Visualização normal

Antes de ontemStream principal
  • ✇Security | CIO
  • Beyond the hype: The enterprise AI architecture we actually need
    My last few years working as a chief digital officer have been, in large part, a sustained exercise in separating what enterprise AI can actually do from what we as a world insist it is about to do. That distinction is not academic. It is the difference between a transformation program that delivers and one that produces a glossy internal report and a quietly shelved proof of concept. Enterprise experimentation with generative AI has accelerated sharply over the past two
     

Beyond the hype: The enterprise AI architecture we actually need

4 de Maio de 2026, 08:00

My last few years working as a chief digital officer have been, in large part, a sustained exercise in separating what enterprise AI can actually do from what we as a world insist it is about to do. That distinction is not academic. It is the difference between a transformation program that delivers and one that produces a glossy internal report and a quietly shelved proof of concept.

Enterprise experimentation with generative AI has accelerated sharply over the past two years. The Stanford AI Index  reports that more than half of organizations globally are now actively exploring or piloting AI-driven workflows — a signal that the conversation has moved from curiosity to operational pressure for many CIOs.

What follows is not a vendor blueprint or prediction. It is a working architectural sketch shaped by real enterprise constraints — the kind that has to survive contact with a real organization’s data governance function, its compliance team and its late-night incident queue.

What I think the mature enterprise AI stack will look like is considerably more federated, more layered and more interesting than most current commentary suggests.

The enterprise AI of the near future will not be a single platform that does everything. It will most likely be a federation — sovereign agents at the base, curated data in the middle and orchestrated intelligence at the top.

A stack built in layers

The starting point is accepting that the major systems of record are not going anywhere.

Native AI

Enterprise platforms like SAP, Salesforce, Workday and ServiceNow hold the most governed and contextually rich data in any large organization, and they are increasingly developing their own native AI capabilities embedded directly within their platforms.

SAP’s recently introduced Joule AI copilot, for example, signals a direction rather than a finished product: Platform-native AI that understands the semantics of the data it sits on and can answer questions that only someone with full schema access and transactional history could answer — without that data ever leaving the platform boundary.

These systems already understand the enterprise in ways no external AI system easily can.

Sovereign private AI

Alongside the native AI sits a different challenge: The long tail of bespoke platforms, industry-specific tools and internal knowledge repositories that no major vendor may ever be able to natively address.

In my experience, sovereign hosted private AI is the most credible answer here — open-source models such as Llama or Mistral, self-hosted within the organization’s own infrastructure and fine-tuned on internal documents and processes. This creates an AI that knows what the organization actually knows, can be interrogated about its provenance and can be shown to a regulator without a conversation about third-party data processing agreements.

For many regulated industries, this sovereignty over data and model behavior will be a defining architectural principle rather than a technical preference.

The data lake 

Between the base systems and the intelligence layer above them sits the data lake — modern data platforms such as Microsoft Fabric, Databricks, Snowflake or their equivalents — fed by governed data pipelines from those base systems. It is worth being precise about what this layer is — and what it is not. It is not a data swamp. It is a curated, semantically enriched, access-controlled repository that reflects the enterprise’s data as a coherent whole across ERP, CRM, HR and others.

The quality of everything above it depends entirely on what flows into it.

This is unglamorous work. It is also the work that most AI transformation programs underinvest in, and the principal reason most of them underdeliver.

AI-powered analytics

The analytics layer — powered by likes of  Power BI, Tableau and their successors — sits on top of this data lake, and this is where the most visible change is already underway. The next generation of these platforms will retain the visualization capabilities that business users depend on but will layer a prompt interface and an AI orchestration engine above the data.

A finance analyst asking why gross margin compressed in a particular quarter will trigger not just a query against the data lake, but a federated call — via MCP-based agent-to-agent protocols — to the ERP’s native AI, the CRM’s revenue intelligence and the procurement system’s spend analyser, each responding within their own security perimeter, with results synthesised at the analytics layer. Mostly read and query – deliberately passive.

The orchestration

The agentic orchestration layer is where AI moves from observation to action, and where governance cannot be an afterthought. This architecture places human oversight at three levels:

  • Human-on-the-loop for autonomous but fully logged agent actions
  • Human-in-the-loop for high-value or irreversible decisions requiring explicit approval
  • Human-over-the-loop for policy-level definitions of what agents may and may not do

Every inter-agent call is traceable, every action timestamped and auditable.

The EU AI Act and sector-specific regulators in financial services and healthcare will make this level of observability non-negotiable within the next couple of years. I have found it considerably easier to build in from the start than to retrofit under regulatory pressure.

Together, these layers form the internal architecture of the enterprise AI stack — systems of record at the base, data consolidation in the middle, analytics above and agent orchestration governing action.

The missing pieces

The five-layer model above is, in one sense, a description of mostly internal infrastructure. But there are two additional structural elements I keep returning to — conspicuously absent from most current enterprise AI discourse.

The marketplace

The first is a public marketplace of AI agents underpinned by a blockchain trust layer. When an organization wants to deploy a specialist external agent — one trained to validate material master pricing against live market indices, cross-reference technical specifications against supplier catalogues or propagate regulatory amendments to internal master data — the current model requires trusting the vendor’s claims about what the agent does.

A blockchain-based identity and audit layer changes that. The agent’s provenance, version history and audit trail across prior deployments live on a distributed ledger: Immutable and inspectable. Smart contracts define precisely which systems it may query, what data it may read or write, and under what conditions it must escalate to a human.

This is the agentic equivalent of what open APIs did for data exchange, but with governance built into the protocol rather than bolted on afterwards. Projects exploring this direction — including Fetch.ai’s autonomous agent network and emerging work around the W3C Verifiable Credentials applied to AI systems — are early signals of where enterprise compliance functions may eventually arrive.

An agent without a verifiable identity is a vendor promise. An agent on a trust ledger is an auditable fact.

The employee intelligence layer

The second missing piece is what I think of as the employee intelligence layer — the interface through which all of this infrastructure actually reaches the person who joined the organization to do a job, not to understand data topology.

What this needs to be is a single workspace that blends the channel-based collaboration model like those offered by platforms such as Slack with the structured project logic available in the likes of Notion, but with AI built into its core rather than added as a feature. A supply chain coordinator should be able to ask, in plain language, for the status of all open purchase orders for a given vendor and receive an answer synthesised from the ERP’s native AI — without navigating a single SAP transaction code.

An HR business partner should be able to retrieve aggregated headcount and attrition data from an enterprise HRMS such as SuccessFactors, annotated with context from their own team’s channel history, without opening a separate analytics tool.

Progress and accountability belong in the same environment where work actually happens — not in a separate project management application that everyone updates for the quarterly review and ignores the rest of the time. The AI in this layer notices when a commitment is overdue, surfaces the relevant context and suggests an appropriate next action rather than simply turning a status indicator red.

Embedded within each person’s workspace, configured to their role and responsibilities, are the analytics dashboards that actually matter to their decisions — query able in natural language when the chart does not answer the question they have.

Get the employee intelligence layer right and the individual has genuine access to the collective intelligence of the organization. Get it wrong and the stack above becomes expensive infrastructure that the people it was built for have quietly routed around.

Implications for technology leaders

I am aware that describing a multi-layer federated AI architecture is considerably easier than implementing one. A few things I have learned in practice that seem worth naming directly. The data governance work is not a precondition of the AI work — it is the AI work. The sophistication of any intelligence layer is bounded entirely by the quality, structure and semantic richness of what flows into it.

Organizations that treat the data lake as an IT project and AI as the real transformation misunderstand the sequence. They are the same project, and the data half is harder. The governance of agentic systems requires a different mental model from the governance of conventional software. When a traditional application does something unexpected, there is usually a code path to trace. When an AI agent takes an unexpected action in a multi-agent system, the failure mode is emergent and the audit trail may be distributed across several systems.

The observability infrastructure — the kind used to monitor complex distributed systems, applied to agent networks — is not optional instrumentation. It is the operating licence. I have come to treat it as a first-class architectural concern rather than something to add once the system is stable, because in my experience the system is never stable in the way that phrase implies.

And finally: The enterprise does not need to be rebuilt around AI. It needs to have AI built into it — carefully, layer by layer, with someone accountable at every level.

The platforms that will win in this environment are not necessarily those with the most impressive pilots. They are the ones that play well with others, expose clean interfaces for inter-agent communication, maintain rigorous audit trails and allow the enterprise to remain sovereign over its own intelligence.

The AI future of the enterprise is federated, governed and — when it works properly — invisible. Which is, when you think about it, precisely what good infrastructure has always been.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

  • ✇Security | CIO
  • The architectural decision shaping enterprise AI
    Every enterprise AI initiative contains an architectural decision that rarely makes it into the business case or the steering committee deck. It doesn’t have a line item. It often gets made by a developer on a Tuesday afternoon based on whatever the default configuration was. And it determines, more than almost anything else, whether your AI system produces answers worth trusting. The decision is this: How should your AI system be architected to find, relate, and reason
     

The architectural decision shaping enterprise AI

1 de Maio de 2026, 09:00

Every enterprise AI initiative contains an architectural decision that rarely makes it into the business case or the steering committee deck. It doesn’t have a line item. It often gets made by a developer on a Tuesday afternoon based on whatever the default configuration was. And it determines, more than almost anything else, whether your AI system produces answers worth trusting.

The decision is this: How should your AI system be architected to find, relate, and reason over information at the moment it needs to? Three dominant architectural patterns answer that question differently — vector embeddings, knowledge graphs, and context graphs. They are not competing technologies. They are different approaches to a fundamental problem, each with distinct capabilities, costs, and failure modes.

Choose the wrong pattern for your use case and you’ll spend the next 18 months explaining confident mistakes. Choose the right combination and you’ll have an AI system that earns trust rather than erodes it.

This article gives you a framework to understand each architectural pattern, know when it applies, and recognize how leading organizations are layering all three deliberately — not by accident.

3 architectural patterns, one fundamental problem

Before comparing them, it helps to understand what each pattern is fundamentally doing when an AI system needs to find or reason over information.

1. Vector embeddings: Finding what feels related

Vector embeddings translate text, documents, or other data into numerical representations – dense lists of numbers called vectors that capture semantic meaning. Two pieces of text that mean similar things end up with vectors that are mathematically close to each other, even if they share no common words.

When a user asks a question, the system converts that question into a vector and searches a database for the stored vectors closest to it. This is the backbone of most Retrieval-Augmented Generation (RAG) systems today.

  • The strength: Vector search is fast, flexible, and remarkably good at finding conceptually related content even across messy, unstructured data. You don’t need to pre-define relationships or maintain a schema. Dump in your documents, embed them, and search.
  • The failure mode: Vector search finds things that feel related — but it has no understanding of why they’re related or what the relationships between them mean. Ask it who reports to whom in your organization, and it will return chunks of text that mention both names near each other. That’s not the same as knowing the org structure.
  • What could go wrong: In production, vector search can surface confidently irrelevant results — content that is semantically adjacent but factually disconnected from the query. Without guardrails, this feeds hallucinations.

“Vector search is very good at finding content that feels related to the question. It is not built to understand whether that content is actually correct, relevant in context, or sufficient to support a trusted answer. In enterprise domains where a confident near-match can create real risk, that limitation is not a technical footnote it is the core architectural issue.” —Wayne Filin-Matthews, Chief Enterprise Architect, McDonalds

It also degrades over time as your document corpus grows without curation. There is a subtler risk too: Vector search quality depends entirely on the embedding model underneath it. Generic models produce generic vectors, and retrieval degrades quietly — without obvious error signals — when the model isn’t matched to your domain.

2. Knowledge graphs: Finding what is related

A knowledge graph represents information as a network of entities (people, products, concepts, events) and the explicit, named relationships between them. An employee reports to a manager. A drug treats a condition. A product belongs to a category. These relationships are defined, typed, and queryable.

When a system needs to answer a structured question such as, “Which suppliers are affected by this regulatory change?” or “What dependencies exist between these systems?”, a knowledge graph traverses those explicit relationships to produce a precise answer.

  • The strength: Knowledge graphs excel at structured reasoning, compliance use cases, and any domain where relationships have real-world meaning that must be preserved. They don’t guess, they traverse. The answers are traceable and explainable.
  • The failure mode: Knowledge graphs are expensive to build and brittle to maintain. Every entity and relationship must be explicitly defined and kept current. In fast-moving domains, active M&A, evolving product lines, shifting regulations – the graph can become stale faster than teams can update it.
  • What could go wrong: A knowledge graph your team built 18 months ago and hasn’t maintained is worse than no knowledge graph. Stale nodes create confident wrong answers. The build-and-maintain cost catches many organizations off guard; the engineering lift is substantial, and the graph needs domain expertise to structure well.

3. Context graphs: Capturing the reasoning, not just the answer

Start with a question that most enterprise AI systems cannot answer: When your organization made a consequential decision last quarter, where did the reasoning go? Not the data that fed it. Not the outcome. The actual context: The signals considered, the tradeoffs evaluated, who pushed back, who approved, and why the call went the way it did.

In most organizations, that reasoning lives in a spreadsheet someone may or may not have kept, in meeting notes that may or may not have been taken, in a CRM field someone half-filled in, and mostly in the heads of the two or three people who were in the room. Six months later, when someone needs to reconstruct it, you’re calling people and hoping they remember.

“Every enterprise has instrumented its transactions. Almost none have instrumented their decisions. The reasoning behind a call, what was weighed, what was dismissed, who pushed back, is still treated as exhaust rather than signal. Context graphs are the first architecture I have seen that takes that reasoning seriously as data.” —Neeraj Mathur, Chief AI Officer, Kognitos

Context graphs are the architectural response to that problem. Where vector embeddings find content that feels related and knowledge graphs map relationships that are explicitly defined, a context graph captures the dynamic web of reasoning relevant to a specific decision, workflow, user, or moment in time. It treats decision context as a first-class data artifact, not a byproduct that gets lost after the meeting ends.

In an agentic AI system, a context graph connects the user’s role, their recent actions, the documents they have referenced, the decisions currently in flight, and the signals that shaped those decisions. It is not a static structure. It assembles and updates in real time, shaped by what is happening.

  • The strength: Context graphs give AI systems something neither vector search nor knowledge graphs can provide: continuity. A single-turn query can get by with semantic search. A workflow that spans multiple steps, multiple users, and multiple days needs a layer that understands what has already happened and why. Context graphs make earlier reasoning available to later decisions, which is what separates a system that answers questions from one that supports how work gets done.
  • The failure mode: Context graphs add architectural complexity that the other two patterns do not. Building them requires deliberate decisions about what context to capture, how long to retain it, and how to keep it current. They also raise governance questions that vector search and knowledge graphs do not: A graph that captures decision reasoning across users and sessions is a graph that must be carefully governed for privacy, access control, and auditability.
  • What could go wrong: Context graphs built without clear boundaries accumulate stale reasoning that degrades rather than improves responses. The same property that makes them powerful, knowing what happened before, becomes a liability if what happened before is outdated, incomplete, or was never captured accurately in the first place.

How the 3 patterns compare

Vector embeddingsKnowledge graphsContext graphs
Core question answeredWhat content is semantically similar?What relationships exist between entities?What is relevant given this user’s current situation?
Data typeUnstructured (docs, text, reports)Structured (entities + typed relationships)Dynamic (session, user state, task history)
StrengthsFast to deploy, works on messy data, scales wellPrecise, traceable, explainable answersAdaptive, personalized, built for multi-step workflows
WeaknessesNo relational reasoning, can return confident wrong answersExpensive to build, breaks when data goes staleAdds architectural complexity, raises data governance concerns
Best forDocument Q&A, semantic search, RAG pipelinesCompliance, org data, structured domainsAgentic workflows, personalized assistants
Typical time-to-valueWeeks3 to 9 monthsDepends on agentic maturity
Ongoing maintenancePeriodic re-indexing as content changesContinuous, dedicated team to keep graph currentSession lifecycle management + governance policies
ExplainabilityHard to audit — “it seemed relevant”Fully traceable, every answer has a pathPartial reasoning is visible, but context assembly is not

Choosing the right pattern for your use case

The instinct most teams have is to start with vector search. It’s fast to deploy, the tooling is mature, and it produces results that look impressive in a demo. That instinct is often correct for a first use case. The problem comes when the architecture that was right for the pilot gets inherited by every subsequent use case without anyone asking whether it still fits.

The right pattern depends on the nature of the problem, not the speed of the deployment.

  • Vector embeddings are the right starting point when your primary challenge is making unstructured content findable. Large volumes of documents, reports, emails, knowledge base articles — anything where users need to ask questions in natural language and get relevant content back. Fast to deploy, forgiving of messy data, and a solid foundation for demonstrating early ROI. The ceiling is that it cannot reason over relationships or maintain continuity across a workflow.
  • Knowledge graphs earn their cost when relationships are load-bearing. If the wrong relationship produces a wrong answer and that wrong answer has compliance, financial, or safety consequences, the precision and auditability of a knowledge graph justify the investment. Regulated industries know this because their auditors have forced the conversation. Organizations in less regulated environments often discover it the hard way.
  • Context graphs become necessary when your AI needs to do more than answer isolated questions. If the system needs to support a workflow that spans steps, users, and time and if earlier decisions should inform later ones, you need an architectural layer that captures and preserves that reasoning. Without it, every interaction starts from scratch, and the system never gets smarter about the work being done.

What a layered architecture looks like in practice

The most sophisticated enterprise AI systems don’t pick one pattern. They layer all three, each handling the job it’s best suited for and the architecture is designed intentionally.

Consider a global manufacturer, let’s call them Hartwell Industries.  They are building an AI assistant for their supply chain operations teams. Here’s how the three layers work together:

  • Layer 1 — Vector embeddings handle the document corpus: Supplier contracts, quality audit reports, engineering specifications, procurement policies, and internal incident reports. When a supply chain manager asks a broad question, “What have we seen historically with single-source suppliers during Q4 demand surges?”  The vector layer retrieves the most relevant content from across that library quickly, even when the question uses different terminology than the documents.
  • Layer 2 — The knowledge graph represents the structured relationships that operational decisions depend on: Which suppliers provide which components, which components go into which product lines, which product lines are committed to which customers, and which regulatory certifications govern which materials. When the system needs to answer, “Which of our active production lines are exposed if this tier-two supplier goes offline?” the knowledge graph traverses those dependencies precisely — no guessing, no approximation.
  • Layer 3 — The context graph tracks what’s happening right now: This operations manager is monitoring a specific regional disruption, has already escalated two at-risk purchase orders this morning, is working against a customer delivery commitment that ships in six days, and flagged a quality hold on an alternative supplier last week. The context graph shapes every response to reflect not just what’s generally true about supply chain risk, but what’s at stake for the situation this person is navigating today.

The difference between the first layer and the third is the difference between a system that finds information and one that understands the situation.

Most organizations won’t need all three layers from day one. But understanding the architecture helps you build toward it deliberately, rather than discovering the gaps when they become problems.

The layer most enterprises are missing

Context graphs are the youngest of the three patterns, and the tooling reflects it. Knowledge graphs have mature, enterprise-grade infrastructure: Neo4j, Amazon Neptune, Azure Cosmos DB. Vector databases have consolidated around proven platforms: Pinecone, Weaviate. Context graphs don’t yet have an equivalent. Different vendors use the term differently. The standards are still being written.

That immaturity is worth naming, but it is not a reason to wait. As one practitioner working across industries recently observed, the missing layer in most enterprises isn’t data — it’s decision traces. The reasoning that connects data to action was never treated as a first-class citizen. Regulated industries figured this out, but rarely voluntarily: Auditors forced insurance companies to capture it, the FAA forced airlines, and quarterly numbers forced logistics operations to instrument their decisions. Most enterprises are still at the spreadsheet-and-hope stage.

“As we transition deeper into AI-First operating models, the demand for explainability and transparent reasoning only intensifies. Vector search and static knowledge graphs alone won’t cut it for complex workflows. Context graphs are quickly becoming a non-negotiable layer in the enterprise architectural stack to capture those critical decision traces. Spot on.” —Anoop Prasanna, Walmart Global

Context graphs are the architectural pattern that changes that. Organizations building agentic systems today are already making context graph decisions, even when they don’t call them that. Every choice about how to manage session state, persist conversation history, or let one agent’s output inform the next is a context architecture decision. The question isn’t whether your organization will have a context layer. It’s whether someone designed it, or whether it just accumulated.

Making this decision intentionally

Most enterprise AI programs will spend the next two years discovering what their architecture cannot do. The vector search system that works beautifully in the pilot will start returning confident nonsense at scale. The knowledge graph that seemed like a solid investment will turn out to need a dedicated team just to keep it current. The agentic workflow that impressed everyone in the demo will fall apart when it cannot maintain context across steps.

None of that is inevitable. But it is what happens when architectural decisions get made by default rather than by design. The organizations that get this right won’t necessarily have better data or bigger models. They will have asked the harder question earlier: Not “what AI should we build?” but “how should our AI be architected to reason well over time?”

That question belongs in the business case. It belongs in the steering committee deck. It belongs on your agenda, before the next prototype goes to production.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

  • ✇Security | CIO
  • What is TOGAF? An EA framework for aligning technology to business
    TOGAF definition The Open Group Architecture Framework (TOGAF) is an enterprise architecture methodology that offers a high-level framework for enterprise software development. TOGAF helps organize the development process through a systematic approach aimed at reducing errors, maintaining timelines, staying on budget, and aligning IT with business units to produce quality results. The Open Group developed the framework in 1995, and by 2016, 80% of Global 50 companies
     

What is TOGAF? An EA framework for aligning technology to business

1 de Maio de 2026, 06:00

TOGAF definition

The Open Group Architecture Framework (TOGAF) is an enterprise architecture methodology that offers a high-level framework for enterprise software development. TOGAF helps organize the development process through a systematic approach aimed at reducing errors, maintaining timelines, staying on budget, and aligning IT with business units to produce quality results.

The Open Group developed the framework in 1995, and by 2016, 80% of Global 50 companies and 60% of Fortune 500 companies used it. TOGAF is free for organizations to use internally, but not for commercial purposes. Businesses can, however, have tools, software or training programs certified by The Open Group. There are currently eight certified TOGAF tools and 71 accredited courses offered from 70 organizations.

In 2022, The Open Group announced the latest update to the framework and released the TOGAF Standard, 10th Edition, to replace the previous Standard, 9.2 Edition. The Open Group states that the 10th Edition will help businesses operate more efficiently and will provide more guidance and simpler navigation for applying the TOGAF framework.

As more organizations adopt AI technology, the TOGAF 10 framework can help businesses navigate developing and implementing AI-driven enterprise architecture. The methodology’s focus on compliance and security can guide organizations through the development and implementation process of AI architecture, while mitigating risk.

TOGAF framework overview

Like other IT management frameworks, TOGAF helps businesses align IT goals with overall business goals, while helping to organize cross-departmental IT efforts. TOGAF helps businesses define and organize requirements before a project starts, keeping the process moving quickly with few errors.

TOGAF 10 brings a stronger focus to organizations using the agile methodology, making it easier to apply the framework to an organization’s specific needs. The latest edition uses a modular structure that is simpler to follow and implement, making the framework easier to implement in any industry.

The TOGAF framework is broken into two main groups, which include the fundamental content and extended guidance. The fundamental content includes all the essentials and best practices of TOGAF that create the foundation for the framework. The extended guidance portion of TOGAF includes guidance for specific topics such as agile methods, business architecture, data and information architecture, and security architecture. The extended guidance portion of TOGAF is expected to evolve over time as more best practices are established, whereas the fundamental content offers a basic starting point for anyone looking to apply the framework.

The Open Group states that TOGAF is intended to accomplish the following:

  • Ensure everyone speaks the same language
  • Avoid lock-in to proprietary solutions by standardizing on open methods for enterprise architecture
  • Save time and money, and utilize resources more effectively
  • Achieve demonstrable ROI
  • Provide a holistic view of an organizational landscape
  • Act as a modular, scalable framework that enables organizational transformation
  • Enable organizations of all sizes across all industries to work off the same standard for enterprise architecture

Agile framework for agentic AI

TOGAF 10 offers a flexible and adaptable framework for designing, integrating, and governing agentic AI systems. Following the framework’s principles can help IT leaders ensure AI architecture aligns with business goals, while also maintaining governance and ethical standards. TOGAF’s flexibility also enables enterprises to grow and adapt their AI architectures as the technology and its use cases evolve.

When approaching agentic AI development, TOGAF 10 can help organizations:

  • Establish key stakeholders and identify key tools and principles
  • Identify risk and address ethical questions around AI
  • Ensure proper compliance and governance of AI
  • Guide the overall integration of AI with existing infrastructure
  • Offer a framework for identifying skills, data, and technology gaps in the organization necessary for AI transformation
  • Demonstrate strategic alignment between AI architecture and business goals

TOGAF business benefits

Th framework helps organizations implement software technology in a structured and organized way, with a focus on governance and meeting business objectives. Software development relies on collaboration among multiple departments and business units both inside and outside of IT, and TOGAF helps address any issues around getting key stakeholders on the same page.

TOGAF is intended to help create a systematic approach to streamline enterprise architecture and the development process so that it can be replicated, with as few errors or problems as possible as each phase of development changes hands. By creating a common language that bridges gaps between IT and the business side, it helps bring clarity to everyone involved.

It’s an extensive document — but you don’t have to adopt every part of the framework. Businesses are better off evaluating their needs to determine which parts of the framework to focus on. With the modular updates to the TOGAF Standard 10th Edition, creating a custom TOGAF framework should be easier than ever. Organizations can start with the core fundamentals, and then pick and choose parts to adopt from the extended guidance portion of the document.

TOGAF certification and training

On  releasing TOGAF 10, The Open Group decided to keep TOGAF 9.1 certification exams as-is, while introducing three new exams to address updates made to the framework. TOGAF 9.1 Level 1 and Level 2 cover the foundations of TOGAF and ensure that past certifications do not become obsolete in the face of an updated framework. The three new exams include the TOGAF Enterprise Architecture Foundation, TOGAF Enterprise Architecture Practitioner, and TOGAF Business Architecture Foundation.

These certifications are combined into learning path options appropriate for differing levels of experience. The first of the three learning paths is the Team level, for those in roles that require a basic understanding of enterprise architecture or who work in customer service. The second is the Practitioner level, for anyone at the management level or who is responsible for developing enterprise architecture. The third and final learning path is the Leader level, for those establishing an enterprise architecture capability.

The TOGAF certification scheme is especially useful for enterprise architects, because it’s a common methodology and framework used in the field. It’s also a vendor-neutral certification that has global recognition. Earning your certification will demonstrate your ability to use the TOGAF framework to implement technology and manage enterprise architecture. It will validate your abilities to work with TOGAF as it applies to data, technology, enterprise applications, and business goals.

According to PayScale, a TOGAF certification can boost your salary for the following roles:

Job TitleAverage SalaryWith TOGAF Certification
IT enterprise architect $158,795 $166,414
Solutions architect $135,178$157,089
Software architect $139,438$170,000
IT director $131,727$152,949

For more IT management certifications, see “20 IT management certifications for IT leaders.”

TOGAF tools

The Open Group keeps an updated list of TOGAF-certified tools, which includes the following software:

  • Alfabet AG: planningIT 7.1 and later
  • Avolution: ABACUS 4.0 or later
  • BiZZdesign: BiZZdesign Enterprise Studio
  • BOC Group: ADOIT
  • Orbus Software: iServer Business and IT Transformation Suite 2015 or later
  • Planview: Troux
  • Software AG: ARIS 9.0 or later
  • Sparx Systems: Enterprise Architect v12

For more tools that support enterprise architecture and digital transformation, see our list of the top 20 enterprise architecture tools.

The evolution of TOGAF

TOGAF is based on TAFIM (Technical Architecture Framework for Information Management), an IT management framework developed by the US Department of Defense in the 1990s. It was released as a reference model for enterprise architecture, offering insight into DoD’s own technical infrastructure, including how it’s structured, maintained, and configured to align with specific requirements. Since 1999, the DoD hasn’t used the TAFIM, and it’s been eliminated from all process documentation.

The Architecture Development Method (ADM) is at the heart of TOGAF. The ADM helps businesses establish a process around the lifecycle of enterprise architecture. The ADM can be adapted and customized to a specific organizational need, which can then help inform the business’s approach to information architecture. ADM helps businesses develop process that involve multiple check points and firmly establish requirements, so that the process can be repeated with minimal errors.

TOGAF was released in 1995, expanding on the concepts found in the TAFIM framework. TOGAF 7 was released in December 2001 as the “Technical Edition,” followed by TOGAF 8 Enterprise Edition in December 2002; it was then updated to TOGAF 8.1 in December 2003. The Open Group took over TOGAF in 2005 and released TOGAF 8.1.1 in November 2006. TOGAF 9 was introduced in 2009, with new details on the overall framework, including increased guidelines and techniques. TOGAF 9.1 was released in 2011 and the most recent version, TOGAF 10 was released in 2022.

More on advancing enterprise architecture:

  • ✇Security | CIO
  • Designing the AI-native cloud: What enterprise architects are learning the hard way
    A few years ago, enterprise cloud conversations followed a familiar pattern. Teams discussed migrating legacy applications, modernizing infrastructure and reducing data center costs. The goal was clear: Move workloads to scalable cloud platforms and gain operational flexibility. But in recent months, the tone of these conversations has shifted dramatically. In architecture reviews and infrastructure planning sessions I’ve participated in, the questions now sound very
     

Designing the AI-native cloud: What enterprise architects are learning the hard way

29 de Abril de 2026, 09:00

A few years ago, enterprise cloud conversations followed a familiar pattern. Teams discussed migrating legacy applications, modernizing infrastructure and reducing data center costs. The goal was clear: Move workloads to scalable cloud platforms and gain operational flexibility.

But in recent months, the tone of these conversations has shifted dramatically.

In architecture reviews and infrastructure planning sessions I’ve participated in, the questions now sound very different:

  • Where will the model training run?
  • Do we have access to GPU clusters?
  • Can our data pipelines support real-time inference?

The reason is simple: Artificial intelligence — particularly generative AI — is pushing enterprise infrastructure beyond what traditional cloud architectures were designed to handle. What many organizations are discovering is that the future isn’t just cloud-first. It’s AI-native.

When AI becomes the workload that breaks the cloud

In many organizations, the turning point arrives when a team attempts its first large-scale generative AI deployment.

A business unit might want to build a document intelligence system, an internal knowledge assistant or a predictive analytics platform powered by large language models. On paper, this looks like just another cloud workload. But implementation quickly reveals the difference.

AI workloads behave nothing like traditional enterprise applications. They require massive datasets, GPU-accelerated compute and high-throughput data pipelines capable of feeding machine learning models continuously. Infrastructure designed for transactional systems often struggles under these conditions.

I’ve seen teams discover this firsthand when their existing cloud environments suddenly become bottlenecks — not because of application traffic, but because of AI model training workloads. This is the moment many organizations realize: AI isn’t just another application in the cloud. It’s a new infrastructure paradigm.

In some cases, even well-architected microservices environments fail to keep up, exposing limitations in storage I/O, network latency and workload isolation. These hidden constraints often only surface under sustained AI workloads, making them difficult to predict during initial planning phases.

AI-native infrastructure: GPU clusters and high-performance compute

Traditional enterprise cloud environments were optimized for CPU-based workloads and transactional applications. AI systems, by contrast, prioritize GPU-accelerated compute, high-bandwidth networking, distributed storage and scalable training pipelines.

Tools like AMD ROCm highlight this shift toward GPU-native ecosystems, offering a full-stack platform designed specifically for high-performance AI workloads. But adopting GPU infrastructure is not just about provisioning capacity — it is about using it efficiently.

Many organizations underestimate the complexity of GPU scheduling, memory fragmentation and workload contention. Unlike CPU workloads, which can be easily distributed, GPU workloads require careful orchestration to avoid underutilization.

These platforms demonstrate that AI workloads are reshaping how cloud infrastructure is designed — from CPU-centric compute layers to AI-native architectures optimized for massive parallelism and high-throughput data processing.

Additionally, emerging innovations such as specialized AI accelerators and custom silicon are further complicating infrastructure decisions. Architects must now evaluate not just performance, but portability and vendor lock-in when selecting hardware strategies.

The rise of distributed AI across hybrid environments

Another pattern emerging in enterprise AI deployments is the move toward distributed infrastructure.

Early cloud adoption encouraged organizations to consolidate workloads within a single cloud provider. This simplified governance and reduced operational complexity.

But AI workloads often introduce new constraints. Certain datasets must remain within private infrastructure for compliance reasons. Training large models requires specialized GPU clusters available only in specific cloud regions. Real-time inference may need to run close to where data is generated. As a result, many enterprises are now operating hybrid and multi-cloud AI environments.

Platforms such as Google Cloud Vertex AI are explicitly designed for hybrid AI pipelines, enabling organizations to train and deploy models across on-premises systems and multiple cloud environments.

In these environments, AI is not confined to a single cloud environment. Instead, intelligence is distributed across infrastructure layers.

The challenge shifts from deploying applications to orchestrating AI systems across multiple environments.

This distribution also introduces new challenges around data consistency, model versioning and latency management. Ensuring that models behave consistently across environments becomes a critical requirement, particularly in regulated industries.

Intelligent orchestration is becoming essential

As AI infrastructure grows more complex, manual cloud management becomes increasingly impractical.

Modern enterprise environments can involve thousands of containers, distributed datasets and multiple compute clusters running across different cloud platforms.

To manage this complexity, organizations are beginning to rely on intelligent orchestration platforms. These systems use machine learning to monitor infrastructure usage, predict compute demand and dynamically allocate resources.

Frameworks like UCUP illustrate the next generation of orchestration — systems capable of coordinating multiple AI agents, monitoring performance and adapting execution strategies in real time. These platforms move beyond simple scheduling into intelligent decision-making layers.

Ironically, artificial intelligence is not only transforming enterprise workloads — it is also becoming the system that manages cloud infrastructure itself.

Over time, this may lead to largely autonomous infrastructure environments where human operators focus more on policy and oversight than direct system management.

The cost reality of enterprise AI

For all the innovation AI promises, the financial implications are impossible to ignore.

Large language models require enormous computational resources. GPU clusters are expensive and often scarce. Training a single model can consume substantial cloud budgets.

This has forced many organizations to rethink their financial approach to cloud computing.

Practices such as FinOps — which focus on managing and optimizing cloud spending — are becoming essential in AI-driven environments.

Teams are experimenting with strategies such as:

  • Model optimization and compression
  • Distributed training architectures
  • Serverless inference models
  • Workload scheduling across cost-efficient regions

In some cases, organizations are even reconsidering hybrid strategies that bring certain AI workloads back on-premises when economics favors private infrastructure.

AI innovation, it turns out, requires as much financial architecture as technical architecture.

FinOps teams are increasingly collaborating directly with data scientists and ML engineers, creating a new cross-functional discipline focused on balancing performance with cost efficiency.

The emergence of the AI-native enterprise cloud

Perhaps the most significant shift underway is conceptual.

For more than a decade, the cloud served primarily as infrastructure for hosting applications.

But AI is transforming the cloud into something far more powerful.

It is becoming a platform for machine intelligence.

Instead of simply running software, cloud environments are now supporting systems that learn from data, generate insights and automate decisions.

Forward-looking organizations are beginning to design their infrastructure with this reality in mind.

They are not just migrating workloads.

They are building AI-native cloud ecosystems designed to support data-driven intelligence at scale.

This also means embedding AI considerations into every layer of architecture — from data ingestion and storage to security, compliance and user experience.

The next chapter of enterprise cloud architecture

The first wave of cloud transformation focused on modernization.

The next wave is about enabling intelligent systems that augment human decision-making, automate operations and unlock entirely new digital capabilities.

That shift is forcing enterprise architects to rethink the foundations of cloud infrastructure — from compute architecture and data pipelines to orchestration and governance.

The organizations that adapt fastest will not simply run AI workloads in the cloud.

They will build cloud environments designed specifically for intelligence.

And in the process, they will define what the next generation of enterprise infrastructure looks like.

Those that fail to adapt, however, risk being constrained by legacy architectural assumptions that no longer align with the demands of AI-driven innovation.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

  • ✇Security | CIO
  • The AI architecture decision CIOs delay too long — and pay for later
    In most of the enterprise AI programs I’ve been involved in, the biggest issue wasn’t that CIOs made the wrong architectural decision early. It’s that they stayed committed to it long after the system around it had fundamentally changed. Early on, everything looks like success. Pilots deliver results. Models perform well enough to justify expansion. Platforms scale within existing cloud and governance structures. From a leadership standpoint, there’s very little incentive
     

The AI architecture decision CIOs delay too long — and pay for later

24 de Abril de 2026, 08:00

In most of the enterprise AI programs I’ve been involved in, the biggest issue wasn’t that CIOs made the wrong architectural decision early. It’s that they stayed committed to it long after the system around it had fundamentally changed. Early on, everything looks like success. Pilots deliver results. Models perform well enough to justify expansion. Platforms scale within existing cloud and governance structures. From a leadership standpoint, there’s very little incentive to question the direction. But over time, something shifts. Costs become harder to predict. Security and architecture reviews take longer. Compliance teams begin asking questions that weren’t part of the original design. And business stakeholders start asking a simple question — “Why did the system do that?” that becomes increasingly difficult to answer.

What makes this moment difficult is that nothing has actually “failed.” Systems remain operational. Dashboards stay green. Traditional metrics still indicate health. And yet, confidence begins to erode. This pattern is not isolated. McKinsey has consistently highlighted that many organizations struggle to move from AI pilots to scaled, trusted deployments due to operational and governance complexity. Recognizing that inflection point — and acting on it — is the decision many CIOs delay too long.

When success starts to hide the real problem

I’ve seen this pattern play out repeatedly across different organizations and industries. A team launches an AI initiative with a focused use case, something contained and measurable. The architecture is straightforward: Integrate a model, connect it to enterprise data, expose it through APIs and add basic controls. The goal is speed and proof of value, not long-term structural design.

The system works. That’s what makes this phase deceptively comfortable. Because it works, the organization expands it. More use cases are added. More workflows depend on it. What started as a pilot becomes part of the day-to-day operations. And importantly, this expansion usually happens without revisiting the underlying architectural assumptions. Over time, the system grows in importance, but not in structure. It becomes more critical without becoming more controllable. That’s where the gap begins to form. I’ve seen teams reach a point where the system is widely used, but no single team can confidently explain how it behaves end-to-end under varying conditions. At that point, success is still visible, but understanding is already lagging.

The signals CIOs tend to rationalize

The early warning signs rarely show up as hard failures. They show up as friction — small, persistent and easy to explain away. Cost volatility is often the first signal. What started as a predictable workload becomes uneven. Usage spikes. Model interactions increase. Optimization becomes reactive instead of planned. Teams spend more time explaining cost behavior than controlling it. This aligns with broader industry trends. The Stanford AI Index notes that as AI systems scale, cost, compute variability and operational complexity increase significantly, particularly for generative and multi-step systems.

Governance friction follows closely behind. Security and compliance reviews take longer, not because teams are inefficient, but because the system is harder to reason about. Questions about how decisions are made and how actions are triggered don’t have clean answers. The most telling signal, though, is behavioral uncertainty.

I’ve been in meetings where teams can explain each component of the system, but struggle to explain how the system behaves. Stakeholders start asking more questions, not fewer. Confidence becomes conditional. That shift, from clarity to hesitation, is the signal most organizations underestimate.

Why this is hard to act on

From the outside, the response seems obvious: Revisit the architecture. In practice, it rarely happens quickly, and I’ve seen several reasons why.

First, success creates inertia. When a system is delivering value, even imperfectly, there is strong pressure to scale it, not disrupt it. Leaders are balancing delivery commitments, stakeholder expectations and budget constraints. Re-architecting feels like stepping backward, even when it’s necessary.

Second, there is no forcing function. Unlike outages or security incidents, this problem does not create a single moment that demands action. The system continues to operate. Issues are distributed across cost, governance and operations, making them easy to treat as separate concerns rather than symptoms of a larger issue.

Third, the cost of change is immediate and visible, while the cost of delay is gradual and cumulative. Re-architecting requires alignment across teams, investment of time and a willingness to disrupt existing workflows. Many organizations delay that decision because the impact of not acting is harder to quantify in the short term.

I’ve seen teams spend months optimizing around these issues, tuning models, adjusting pipelines and adding more controls, before recognizing that the underlying problem is structural. By then, the system will have already become harder to change.

The architectural assumption that breaks

At the center of this pattern is a simple assumption: That decision-making and execution can remain tightly coupled as systems scale.

In early-stage systems, this assumption holds. A model produces an output, and that output directly triggers an action. The system is small enough that the relationship between decision and execution is easy to understand and manage. As systems expand, that assumption begins to break. Decisions become influenced by multiple data sources, intermediate steps and contextual dependencies. Actions affect more systems, more users and more business processes. Yet the architecture still treats decision and execution as a single continuous flow.

This is where predictability begins to erode. Not because the system stops working, but because it becomes harder to anticipate how it will behave under different conditions. I’ve seen organizations reach a point where they trust the components but not the system. This shift is subtle, but it is one of the most important signals that the architecture no longer fits the system.

What changes once CIOs make the call

The organizations that move forward are the ones that recognize this shift and make a deliberate decision to change how the system is structured.

In my experience, the most effective change is introducing a clear separation between how decisions are made and how actions are executed. This creates a control point that didn’t previously exist. Decisions are no longer immediately acted upon. They are evaluated, validated and, when necessary, constrained before execution. This allows teams to understand not just what the system is doing, but why it is doing it.

I’ve seen this shift fundamentally change how teams operate. Security and compliance reviews become more productive because the system is easier to reason about. Operational teams gain more control over behavior. Business stakeholders regain confidence because decisions are no longer opaque.

This aligns with how major technology providers are evolving their own systems. Microsoft has emphasized the need for stronger operational governance and control mechanisms as AI systems become more integrated into enterprise workflows. The architecture doesn’t become simpler, but it becomes more controllable.

What waiting actually costs

The cost of delaying this decision is rarely captured in a single metric. It accumulates across the organization. It shows up as repeated architecture and security reviews that never fully resolve concerns. It shows up as increasing effort spent explaining system behavior instead of improving it. It shows up as teams becoming more cautious about where and how the system is used. I’ve also seen it slow down adoption. Teams that would otherwise build on the system hesitate because they don’t fully trust how it will behave. Over time, this reduces the overall impact of the AI investment.

Industry observations reinforce this pattern. Uptime Institute has highlighted how increasing system complexity and a lack of operational clarity are becoming key challenges in managing modern digital infrastructure. By the time organizations decide to re-architect, they are often doing so under pressure — after the friction has already started to limit scale and introduce risk.

The decision CIOs need to make earlier

Looking back across these programs, the pattern is consistent. The question is not whether the architecture needs to evolve. It’s when.

CIOs who act earlier treat the initial architecture as a starting point, not a long-term foundation. As systems scale, they actively reassess whether the structure still supports the level of control, predictability and transparency the business now requires.

This requires a different mindset. Instead of waiting for a failure signal, leaders look for patterns — cost variability, governance friction, behavioral uncertainty — and treat them as indicators of structural misalignment. I’ve seen organizations that make this shift early avoid months of rework later. More importantly, they maintain confidence in the system as it scales, which is ultimately what enables broader adoption.

From scaling systems to controlling them

Enterprise AI is moving from systems that assist decisions to systems that make and act on decisions. That changes the nature of what CIOs are responsible for. It’s no longer enough to ensure systems are performant and scalable. They must also be controllable and understandable under real operating conditions. This requires architecture that supports not just execution, but oversight.

In my experience, the hardest part is not building the system. It’s recognizing when the system you built for early success no longer matches the system you need for scaled operation. That’s the decision that tends to be delayed. And it’s the one that becomes more expensive the longer it waits.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

  • ✇Security | CIO
  • Why bizware is becoming the dominant form of software
    Since the early 1950s, software has slowly moved from an obscure technical discipline to something that touches almost every person’s life every day. The transition was gradual at first. Most people didn’t have direct access to computers, but the businesses they interacted with did. Computers sat in back rooms quietly changing how companies handled inventory, accounting and customer relationships. Computing accelerated in the 1980s and 1990s. The computer went from an o
     

Why bizware is becoming the dominant form of software

20 de Abril de 2026, 06:00

Since the early 1950s, software has slowly moved from an obscure technical discipline to something that touches almost every person’s life every day. The transition was gradual at first. Most people didn’t have direct access to computers, but the businesses they interacted with did. Computers sat in back rooms quietly changing how companies handled inventory, accounting and customer relationships.

Computing accelerated in the 1980s and 1990s. The computer went from an obscure machine to something sitting on everyone’s desk and, eventually, their homes. At a minimum, people needed basic computer skills to complete everyday tasks.

Over the last 20 years, computing has evolved even further. It is no longer just a utilitarian tool; it is a fundamental part of daily life. Whether that’s good or bad is debatable, but it’s the reality we live in. And that reality requires massive technological infrastructure. Where businesses once needed buildings, now they also need websites.

To explain what this has done to software, it helps to look at another trade.

A skilled carpenter can build a beautiful mahogany table, cabinet or chair. Some spend decades mastering joinery, shaping, finishing and countless other techniques. With enough experience, they can build almost anything.

But homes are also built out of wood, and homes must be built in enormous quantities. There is massive economic pressure to build them quickly, efficiently and at scale. It would not be practical to build houses the same way master furniture makers build cabinets. The objectives are different. Home construction must happen quickly, with minimal waste, while still meeting building codes and safety standards. It is still carpentry, but it is a different discipline with different constraints.

The same thing has happened with software. The massive economic demand for digital infrastructure has created a new category of software work that operates very differently from traditional software engineering. Standing up the technology required to keep modern society running does not require deep knowledge of computer science or the inner workings of computers. Instead, it requires understanding a large ecosystem of specialized tools that assemble the components businesses need. It is still software, but software shaped by business infrastructure. This isn’t traditional software, but it is still a kind of software.

I call it bizware.

Software has split into two disciplines

This distinction becomes clearer when you look at how teams have transitioned in organizations. Traditional software teams are often organized around deep technical problems: building a compiler, optimizing a database engine or designing a new algorithm. Progress is measured by correctness, performance and innovation.

Bizware teams focus on something different. Most businesses now are not trying to develop software; instead, they need to deploy software to run their business. They are typically organized around business functions: payments, authentication, internal tools, customer dashboards or analytics pipelines. The goal is not to push the boundaries of computing, but to assemble reliable, secure systems quickly using existing components.

This difference in orientation changes how success is measured. In traditional software, elegance and efficiency matter. In bizware, speed, reliability and integration matter more. The system does not need to be perfect; it needs to work consistently and support the business.

Bizware is driven by business infrastructure, not computer science

Many traditional concepts of computer science are not central to bizware. Concepts like Von Neumann architecture, NP-completeness or decidability are rarely relevant. Instead, it is far more important to understand authentication systems, infrastructure tooling, security frameworks and deployment pipelines.

This has created an entire ecosystem of tools that primarily exist to solve business infrastructure problems.

Docker is a good example. Docker solves a deployment problem that businesses face. It does not solve a universal computing problem. Building Docker required deep software expertise, but the people using Docker are leveraging it to solve the business problems that arise from large-scale deployment. The rise of platforms like Docker and Kubernetes reflects this shift toward operational software. These tools exist because companies need consistent environments across development and production.

In the beginning, these tools were hard to use. The computers were slow and the software infrastructure was comparatively primitive. A person had to understand the tools and have a significant traditional software background to effectively and efficiently use the tools. As the tools have matured, the knowledge of traditional software development has become less relevant.

To deploy your website globally, you no longer need to understand what NP-Complete means or the nuances of von Neumann architecture. However, outside of business environments, deployment is rarely a major concern. Students, researchers and hobbyists rarely struggle with deployment the way companies do. In contrast, tools like compilers or interpreters are universal; everyone writing software needs them.

Software has effectively undergone a kind of speciation, and a new, distinct discipline has emerged. Bizware and traditional software engineering require different skill sets. Both are difficult and require significant expertise, but they emphasize different types of knowledge. Being excellent at one does not automatically make one excellent at the other.

That distinction also explains where AI is currently being applied. AI struggles with traditional software development. It is not even close to replacing engineers doing deeply technical traditional software work. For example, if I wanted to design a domain-specific language to describe Kalman filters, AI would be almost useless. That task requires deep understanding across multiple technical fields and the ability to combine them creatively in ways that have never existed before. At the same time, the market for that kind of work is relatively small compared with the need businesses have for bizware. 

Bizware also operates under very different economic pressures than traditional software. Businesses need digital infrastructure at enormous scale. These systems must be built quickly, reliably and repeatedly across thousands of organizations. Because the problems are highly repetitive, automation becomes practical and extremely valuable. AI can often produce a reasonable starting point because the patterns are well-known and widely reused.

This also explains why discussions about AI often become confusing. AI is not impacting all software equally. It is far more effective in domains where problems are repetitive and patterns are well understood.

That aligns closely with bizware.

In contrast, traditional software development often involves creating something fundamentally new. That kind of work still requires deep expertise and cannot be easily automated. I explored a related dynamic in my analysis of why hardware and software development fail, where mismatched assumptions between disciplines create systemic problems. Understanding where AI applies and where it does not becomes much easier once the distinction between bizware and traditional software is clear.

Economic pressure is reshaping how software is built

Further, this scale has created strong incentives to standardize and automate as much of the process as possible. Cloud platforms, infrastructure frameworks, containerization and orchestration systems exist primarily to solve these operational problems.

Traditional software development is different. It focuses on building new computational capabilities: compilers, algorithms, operating systems, simulation tools and domain-specific systems that push the boundaries of what computers can do.

Traditional software development solves software problems. Bizware solves business problems.  As a result, we’ve experienced a speciation of expertise and a separation of disciplines.

Why this distinction matters for companies

This divide helps explain many of the tensions inside modern technology companies. Engineers who excel at one discipline are often assumed to be interchangeable with those in the other, even though the skills and objectives are quite different.

The market for bizware is enormous. Capitalism constantly pushes toward optimization. That force becomes stronger as the market grows larger. We are seeing the same thing in construction. Companies like Reframe Systems are now building robots designed to automate large parts of home construction. The economic pressure to optimize never disappears. While skilled carpentry is still critical, homebuilding has become commoditized.

Bizware isn’t a lesser form of software, just as framing a house isn’t a lesser form of carpentry than building fine furniture. They simply exist to serve different economic needs.

Understanding that distinction clarifies what modern software development has become.

Software hasn’t disappeared. But the industry that once revolved around computer science now also revolves around operating digital infrastructure at enormous scale. For companies, this distinction has practical implications. This is not really a technical distinction. It is an operational one.

Hiring and team organization are focused on keeping the infrastructure running while also keeping it up to date. Before the internet, this used to be the purview of the store managers who needed to keep the store clean and accessible. What used to be physical infrastructure is now digital infrastructure.

Traditional software is not extinct, and it is not dying. If anything, it is more important than ever. However, it can feel that way because the scale of traditional development has been completely eclipsed by the scale of bizware.

This speciation has already happened; I’m just trying to give it a name. That way, people, businesses and organizations can all agree on what they are doing and what they want to do, because confusion around concepts like software and bizware costs money.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

  • ✇Security | CIO
  • The 10 skills every modern integration architect must master
    Enterprise integration has changed fundamentally. What was once a backend technical function is now a strategic capability that determines how quickly an enterprise can adapt, scale and innovate. With SaaS-first architectures, continuous ERP updates, event-driven systems and AI-enabled platforms, integration architects are no longer just connecting systems — they are designing the digital nervous system of the enterprise. I have spent years implementing large-scale clou
     

The 10 skills every modern integration architect must master

17 de Abril de 2026, 10:00

Enterprise integration has changed fundamentally. What was once a backend technical function is now a strategic capability that determines how quickly an enterprise can adapt, scale and innovate. With SaaS-first architectures, continuous ERP updates, event-driven systems and AI-enabled platforms, integration architects are no longer just connecting systems — they are designing the digital nervous system of the enterprise.

I have spent years implementing large-scale cloud and middleware implementations, particularly across Oracle EBS, Oracle Fusion Cloud and various SaaS ecosystems. What I’ve observed is that the gap between good and great Integration architects isn’t technical knowledge alone, it’s the breadth of skill, judgment and organizational influence an architect brings to every engagement. The following ten competencies define what separates a modern integration architect from a traditional middleware specialist.

1. Platform thinking, not project thinking

What many get wrong: Designing integrations to satisfy a single project — an ERP rollout, payroll go-live or CRM deployment — without considering reuse or long-term evolution.

Why this fails: SaaS platforms like Oracle Fusion Applications refresh weekly and quarterly. Project-based integrations break repeatedly and accumulate technical debt at a punishing rate.

Modern skill: Adopting a Cloud Integration Platform Mindset, where iPaaS (e.g., Oracle Integration Cloud) is treated as:

  • A shared enterprise platform
  • An abstraction layer between SaaS and consumers
  • A long-term capability, not a temporary solution
  • A source of reusable integrations with long-term enhancement opportunities

The skilled architect also knows that not every integration belongs on the iPaaS platform. High-volume, low-latency integrations might perform better with direct API calls or message queues. Highly complex data transformations might be more maintainable in custom code. The integration architect makes deliberate choices about which integrations belong on which platform based on technical requirements, team skills and long-term maintainability.

Modern architects need both strategic understanding of when these platforms add value and the tactical skills to use them effectively.

2. Mastery of iPaaS and cloud-native capabilities

What many get wrong: Using iPaaS as a visual mapping tool while ignoring native cloud capabilities.

Why this fails: Over-customization increases cost, reduces resilience and bypasses built-in scalability.

Modern skill: Deep understanding of integration patterns and architectures. Integration architects must understand the fundamental patterns that govern how systems communicate — these patterns represent proven solutions to recurring challenges and knowing when and how to apply them is essential. This means knowing how to leverage iPaaS features before writing custom logic:

  • Adapters vs REST endpoints
  • Lookups, packages and integration patterns
  • OCI services such as Streaming, Object Storage and Functions

As enterprises migrate to the cloud and adopt hybrid architectures, integration architects must understand cloud platforms and their unique constraints. We increasingly work in multi-cloud environments, which means designing patterns that work across providers. Rather than building cloud-specific integrations, the architect establishes cloud-agnostic interfaces. Using platform-neutral API formats like JSON for data interchange ensures portability.

On a recent HCM implementation, I replaced a polling-based integration pattern with an OIC and OCI Streaming event-driven approach for HR updates. The result was dramatically lower latency and a significant reduction in load on Oracle HCM during peak processing windows.

3. API-led and event-driven design

What many get wrong: Exposing SaaS applications directly to consumers through tightly coupled integrations.

Why this fails: Schema changes, API version updates and new consumers create cascading failures that ripple across the entire landscape.

Modern skill: Designing API-led and event-driven architectures that genuinely decouple systems. APIs have become the primary integration interface for most modern systems. Integration architects need deep expertise in designing APIs that are intuitive, efficient and maintainable.

Consider what I faced when tasked with exposing customer data from a legacy system. A naive design required multiple calls to retrieve related information, one for basic customer details, another for addresses, another for contact preferences and another for order history creating a chatty interface and increased latency. Every extra call compounds latency and couples the consumer to internal data structures. A well-designed API while utilizing mediation capabilities of Integration tools encodes resource relationships so the consumer retrieves what it needs in a predictable, minimal number of call requests. The middleware orchestrates calls to backend systems, aggregates the data and exposes a single, consumer-friendly endpoint. This approach reduced round trips, decoupled consumers from backend structures and improved performance by enabling parallel processing. I also considered trade-offs like payload size and introduced selective expansion to avoid over-fetching. Overall, the design aligns with consumer-driven API principles and leverages mediation capabilities effectively.

4. Canonical data modeling and data governance

What many get wrong: Mapping source-to-target schemas directly for every integration. A common anti-pattern is point-to-point schema mapping — directly transforming source data into target formats for every integration. At first, this seems fast. But it doesn’t scale.

Why this fails: This approach creates a fragile, tightly coupled ecosystem as A single schema change in one system triggers updates across multiple Integrations, Integrations grow from N systems → N² mappings, Inconsistent data definitions – “Customer,” “Account,” or “Contact” may mean different things across systems. Over time, teams spend more effort fixing integrations than delivering value

Every system change requires multiple downstream updates, creating a maintenance nightmare that compounds over time.

Modern skill: Integration architects increasingly need data engineering skills as the lines between integration and data platforms blur. We often serve as the primary advocates and implementers of master data management strategies. Modern integration architects don’t just move data — they define and govern it. Define System of Record (SoR) while establishing authoritative ownership for each data attribute to avoid conflicts and duplication. Defining canonical enterprise data models and enforcing governance through versioning, reusability, security, validation rules, error handling & centralized control at the middleware layer is how we solve that problem at scale. Enable Controlled Data Propagation by defining how updates flow like Event-driven (real-time sync) or Batch (scheduled reconciliation).

In modern architectures, integration architects increasingly act as data stewards, enabling scalable MDM strategies and ensuring consistency across distributed systems through centralized mediation layers like OIC

A canonical ‘Employee’ model I’ve defined for a large financial services client allowed Oracle HCM, multiple payroll providers and identity systems to evolve independently. During a significant HCM upgrade, integration breakage was near zero because the canonical model absorbed the schema changes rather than propagating them to every consumer.

5. Security-by-design in integration

What many get wrong: Treating integration security as a configuration step late in the project.

Why this fails: Integration layers handle sensitive payroll, financial and identity data — and are frequent attack vectors. Retrofitting security onto an insecure design rarely works.

Modern skill: Modern integration architects must think deeply about security, as integrations often become the weak points in enterprise security postures. Embedding Zero-Trust principles from the start means:

  • OAuth and token-based authentication
  • Least-privilege access controls at the integration level
  • Centralized secrets and certificate management

When we were building integrations for a healthcare provider, HIPAA compliance wasn’t optional — it shaped like every architectural decision. Security controls at multiple levels were non-negotiable: field-level encryption, audit logging, access controls tied to role and context rather than just credentials. A skilled architect implementing single sign-on for a corporate portal understands not just SAML and OAuth protocols but how to design attribute exchange, just-in-time provisioning and role mapping between disparate systems.

I’ve made it a rule to align all OIC integrations with OCI IAM policies from day one and enforce per-integration security policies rather than relying on shared credentials. On one engagement, that decision prevented a significant security incident when a downstream system was compromised — our integrations were isolated, not exposed.

6. Observability and business-centric monitoring

What many get wrong: Monitoring integrations only at a technical level — status, error counts and message volume.

Why this fails: Technical success does not guarantee business success. An integration that processes every message without error can still fail the business if it processes the wrong messages.

Modern skill: Implementing business-aware integration observability. This means instrumenting integrations so the operations team can answer questions like ‘Did payroll actually complete successfully?’ not just ‘Were all messages acknowledged?’

I’ve configured OIC activity streams and OCI Logging Analytics for a payroll integration to surface business-level outcomes — completion rates by pay group, exceptions by category (data issues vs system failures vs delays) and SLA tracking and Reconciliation indicators (expected vs processed employee counts). Within weeks, the finance team was reviewing dashboards themselves rather than filing tickets to ask us if the run had been completed. That shift from reactive to proactive operations was transformative while significantly reducing turnaround time, improving SLA adherence and increasing trust in integration reliability.

7. Designing for continuous change

What many get wrong: Assuming integrations should be ‘stable’ and rarely modified.

Why this fails: Cloud environments are defined by constant change — quarterly SaaS updates, API evolution and acquisitions mean no integration is ever truly done. The mistake many teams make is optimizing for initial stability instead of long-term adaptability. This leads to brittle integrations that break with every release cycle, creating fire drills and eroding business trust.

Modern skill: Building change-resilient integrations where change is expected, tested and absorbed with minimal disruption through:

  • Versioned APIs with clear deprecation policies and backward compatibility
  • Contract-first design so consumers agree on interfaces before implementation begins. schema validation at runtime and test time
  • Automated regression testing that runs before every quarterly update while validating API responses, business flows and edge cases and failure handling.

Before each Oracle ERP quarterly update, our automated test suite validated all critical OIC integrations against the new release in a pre-prod environment. We catch breaking changes weeks before they reach production ensuring seamless business continuity. The peace of mind this creates, for the integration team and for the business, cannot be overstated.

Design integrations not for stability, but for evolution — treating change as a constant and embedding resilience through versioning, contract governance, automated validation and decoupled architecture. This shifts integration from a fragile dependency to a durable, adaptable platform capability

8. DevOps and automation for integrations

What many get wrong: Treating integrations as manually deployed artifacts.

Why this fails: Manual deployments increase risk and slow delivery. They also make audit and compliance conversations unnecessarily painful.

Modern skill: Applying CI/CD and DevOps practices to integrations — automated deployment pipelines, environment standardization with traceability and version-controlled artifacts as first-class engineering outputs.

We promoted integration packages from development to test to production using automated pipelines through CI/CD tools like Flex deploy and Jenkins on a recent engagement. Deployment errors dropped to near zero and audit evidence was generated automatically with every release. The integration team stopped dreading deployments and started shipping faster.

9. Business process and domain expertise

What many get wrong: Focusing purely on technical flows without understanding business context.

Why this fails: Integrations that work technically may fail operationally. A technically perfect integration built around the wrong business process creates a well-engineered wrong answer.

Modern skill: Integration architects frequently serve as bridges between business stakeholders and technical teams. This requires translating business needs into technical requirements and explaining technical constraints in business terms — clearly and without condescension.

Armed with process understanding, the architect designs integrations that automate entire workflows rather than just moving data between systems. The difference between a data-mover and a process architect is the difference between a cable and a nervous system.

On a global HR transformation, I spent the first two weeks meeting the HR operations team gathering requirements and understanding their business processes before writing any integration specifications. By understanding the full hire-to-retire lifecycle — not just the data flows, I designed integrations that ensured consistency across HR, payroll, finance and identity systems in a way that no purely technical analysis would have produced.

10. Leadership and enterprise influence

What many get wrong: Assuming integration architects only need technical authority.

Why this fails: Integration decisions impact multiple business units and platforms. Without the ability to influence stakeholders and align cross-functional teams, and drive adoption, even the best technical design can stall or fail.

Modern skill: Acting as a strategic leader, not just a technical expert while bridging the gap between business priorities and technical execution:

  • Influencing architecture decisions across organizational boundaries
  • Establishing integration standards and governance frameworks that drive consistency
  • Guiding multiple delivery teams toward coherent, enterprise-wide outcomes

Defining enterprise-wide standards by reducing duplicated integrations while improving audit readiness and compliance.

Technical brilliance alone is insufficient if integration architects can’t effectively communicate their designs and decisions. When I document a complex integration architecture, I create multiple views targeting different audiences.

For executive stakeholders, I produce high-level diagrams showing how major systems connect and the business capabilities these integrations enable with minimal technical jargon. I focus on conveying business benefits and risk mitigation plans and strategic value of these integrations.

For development teams, I provide detailed sequence diagrams, error-handling flows and API documentation with example requests and responses while providing Clear guidance for implementing integrations between various applications.

For operations, I write runbooks for common failure scenarios and explain how to interpret log messages and metrics in the context of business outcomes. I provide guidance for proactive monitoring and incident response.

Effective architects invest in knowledge transfer — conducting workshops to explain architectural decisions, pairing with developers during implementation ensuring best practices are adopted and creating decision logs that capture why specific approaches were chosen over alternatives. Providing support during initial production rollout, ensuring confidence, reliability and operational readiness.

Modern integration architects combine deep technical expertise with enterprise influence — communicating effectively, guiding teams, enforcing standards and ensuring that integrations deliver measurable business outcomes. Leadership in this role means shaping organizational decisions, reducing redundancy and turning integrations into a strategic asset.

The evolving role: What the next five years will demand

The role of integration architects will continue to evolve as technology and business needs change. Artificial intelligence and machine learning are already beginning to influence integration, with intelligent data mapping, automated error resolution, agentic workflows and predictive scaling. Low-code and no-code integration platforms are democratizing integration development, requiring architects to shift focus toward governance, standards and architecture while empowering business users to build simpler integrations themselves.

I believe the architects who thrive will be those who treat learning as a core professional discipline, not an optional add-on. That means reading technical research, experimenting with new tools and participating in communities where ideas get challenged. Modern Integration architects design intelligent workflows, automate complex business processes and integrate AI insights into enterprise systems, empowering organizations to achieve faster, smarter and more scalable operations.

The fundamental skills that distinguish exceptional integration architects — the ability to understand complex systems, translate between business and technology, design for resilience and scale and continuously learn and adapt — will remain relevant regardless of how specific technologies evolve. Those who master this diverse skill set will continue to play a critical role in enabling enterprises to harness the full power of their technology investments.

Learning from failure: The habit that separates the best

The best integration architects treat failures as learning opportunities rather than events to be survived and forgotten. When an integration outage causes significant business disruption, we don’t just fix the immediate problem. We conduct thorough post-mortems to understand root causes, identify systemic issues that contributed to the failure and implement changes to prevent similar problems.

After an integration failure caused data corruption on a project I led, I resisted the pressure to simply restore from backup and move on. We analyzed why error handling didn’t catch the problem, why monitoring didn’t detect corruption earlier and why automated testing didn’t surface the bug and how could recovery and reconciliation be optimized to minimize business impact?

We used these insights to redesign error-handling patterns to fail safely and recover gracefully. enhance monitoring with business-aware observability and anomaly detection, expand automated test coverage across all critical integrations and Implement reconciliation and recovery procedures that minimize downtime and data loss. This approach builds resilience, reduces risk and enhances trust across business and technical teams. Six months later, that investment paid off when a similar failure mode was caught in staging rather than production.

Successful architects maintain awareness of emerging technologies and patterns. We experiment with new tools, strategies and approaches, attend conferences and webinars, participate in professional communities and read technical blogs, case studies and research papers. Staying current is not optional, it is how integration architects remain relevant, proactive and capable of driving innovation.

A rare combination

The modern integration architect is no longer just a middleware specialist. We are platform strategists, security architects, business translators and technical leaders.

Enterprises that invest in these skills build integration platforms that are resilient, secure and scalable. Those that don’t find themselves constantly reacting to failures, upgrades and missed business opportunities — fighting the same fires in every quarterly cycle.

Integration architecture is not a purely technical discipline, nor is it purely strategic. It requires a rare combination of deep technical expertise, business acumen, communication skills and the ability to navigate organizational complexity. Those who develop this multifaceted skill set find themselves uniquely positioned to drive meaningful business transformation in an increasingly interconnected digital world.

In a cloud-first world, integration excellence is enterprise excellence.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

  • ✇Security | CIO
  • The secure intelligence framework: Architecting AI systems for a data-driven world
    When I first started deploying AI systems at scale, I made the same mistake most technology leaders make: I treated security and data architecture as problems to solve after the intelligence layer was built. We moved fast, we shipped models and we celebrated early wins. Then, six months in, we discovered that one of our machine learning pipelines was inadvertently exposing sensitive customer data to downstream systems that had no business accessing it. No breach, no headli
     

The secure intelligence framework: Architecting AI systems for a data-driven world

15 de Abril de 2026, 09:00

When I first started deploying AI systems at scale, I made the same mistake most technology leaders make: I treated security and data architecture as problems to solve after the intelligence layer was built. We moved fast, we shipped models and we celebrated early wins. Then, six months in, we discovered that one of our machine learning pipelines was inadvertently exposing sensitive customer data to downstream systems that had no business accessing it. No breach, no headlines but it was a wake-up call that reshaped how I think about AI architecture entirely.

The truth is, most organizations are building AI the wrong way. They invest heavily in model performance, infrastructure and compute, but treat data governance and security as afterthoughts. In my experience working across industries, this approach creates systems that are technically impressive but fundamentally fragile. Intelligence without integrity is just sophisticated risk.

This article outlines the framework I developed what I now call the Secure Intelligence Framework and how any technology leader can apply it to build AI systems that are both powerful and trustworthy.

Why security must be designed in, not bolted on

The instinct to move fast when deploying AI is understandable. Business pressure is real and AI projects often begin as proofs of concept that quietly grow into production systems before anyone has thought seriously about security.

But this sequencing is dangerous. According to the IBM Cost of a Data Breach Report 2024, the average cost of a data breach reached $4.88 million globally and organizations without AI and automation embedded in their security operations paid significantly more. Poorly architected AI systems expand an organization’s attack surface, creating new vulnerabilities through model APIs, training data pipelines and inference endpoints that traditional security frameworks were never designed to address.

The deeper problem is cultural. When security is treated as a deployment checklist rather than a design principle, teams inevitably cut corners under deadline pressure. I have seen organizations launch production AI systems with no access logging, no output monitoring and no rollback plan because those conversations happened after the build, not before it. By that point, the architecture is already set and retrofitting security is expensive, disruptive and often incomplete.

When I redesigned our AI architecture, I started from a single principle: every layer of the system must assume that every other layer is potentially compromised. This is zero-trust thinking applied to AI and it changes everything about how you design data flows, access controls and model governance. The NIST AI Risk Management Framework offers a strong foundation here it is one of the first documents I share with any team beginning a serious AI deployment.

width="450" height="326" sizes="auto, (max-width: 450px) 100vw, 450px">
Figure 1: The secure intelligence framework data, model and governance layers.

Sunil Kumar Mudusu

The 3 layers of a secure AI system

The Secure Intelligence Framework is built on three interdependent layers. Each must be addressed independently and then integrated as a whole.

The data layer

This is where most vulnerabilities begin. I have seen organizations connect machine learning models directly to production databases with minimal access controls, reasoning that the model itself is not a user and therefore does not pose a risk. This thinking is wrong and expensive.

Data pipelines must enforce least-privilege access; every component of the AI system should access only the specific data it needs, nothing more. At one organization I worked with, implementing role-based access controls at the pipeline level alone reduced sensitive data exposure by over 60% without any impact on model performance. Equally important is data lineage. You must be able to answer, at any point, exactly what data trained a given model, where it came from and who had access to it. Without lineage, you cannot audit, you cannot comply and you cannot debug when something goes wrong.

The model layer

Once data is governed properly, attention turns to the models themselves. The key risks here are model inversion attacks, where adversaries extract training data from model outputs, and prompt injection in large language model deployments, where malicious inputs manipulate model behavior.

Defending against these threats means treating model endpoints like any other sensitive API authentication, rate limiting, output filtering and adversarial testing built into the deployment pipeline as standard practice. The OWASP Top 10 for Large Language Model Applications is one of the most practical references I have found for model-layer risk it catalogs the exact attack patterns that keep AI security teams up at night. When we deployed an NLP system for internal knowledge management, we added an output review layer that scanned responses for personally identifiable information before returning results to users. It added 40 milliseconds of latency. It was worth every millisecond.

The governance layer

This is the layer most often overlooked because it feels administrative rather than architectural. In reality, governance is what holds the other two layers together over time.

Governance means clear ownership for every model in production, who built it, who maintains it and who is accountable for its outputs. It means model versioning and rollback capabilities. And it means regular audits of both model performance and data access patterns. Microsoft’s Responsible AI Standard and Google’s Model Cards framework are both practical starting points that I have adapted in my own work. Neither is a plug-and-play solution, but both offer structured thinking that can be tailored to almost any organizational context.

What this looks like in practice

Implementing this framework does not require rebuilding everything at once. I introduced it using a phased approach over three quarters.

In the first quarter, we focused on the data layer auditing pipelines, implementing access controls and establishing lineage tracking. Unglamorous work, but it surfaced three data access issues we had not previously known existed. In two cases, internal teams had been querying datasets they were never authorized to use, simply because no restriction had been put in place.

In the second quarter, we addressed the model layer hardening endpoints, introducing output filtering and embedding adversarial testing into our CI/CD pipeline. The team developed a security-first mindset that made these changes feel natural rather than imposed.

In the third quarter, we formalized governance, assigning model owners, establishing review cycles and integrating model audits into existing IT processes. By year-end, we had a system our security team, legal team and business stakeholders could all trust. New AI projects that previously took weeks to approve were being scoped and greenlit in days because the foundational questions had already been answered at the architecture level.

Figure 2: Three-quarter phased implementation roadmap with outcomes per phase
Figure 2: Three-quarter phased implementation roadmap with outcomes per phase.

Sunil Kumar Mudusu

Trust is architected, not assumed

Security and intelligence are not in tension they are complementary. The discipline that makes an AI system secure also makes it more reliable, more auditable and more explainable to the stakeholders who need to trust it.

AI is not a technology problem. It is a trust problem.

If you are building AI systems without a structured approach to data governance and security, you are not moving faster than your competitors. You are accumulating technical debt that will cost far more than the speed ever gained. The organizations that lead in AI over the next decade will not be those that deploy the most models; they will be those that deploy models people can trust.

Start with the data. Secure the model. Govern everything. The rest is execution.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Architecting the AI backbone of intelligent insurance: How to engineer a scalable and performant enterprise AI platform

14 de Abril de 2026, 12:14

I spent years at Meta engineering large-scale systems for billions of users, delivering sub-second latency and five-nines (99.999%) uptime. When we started Outmarket AI, I brought that same lens: scalability, reliability, sustainability. Not buzzwords but real engineering.

Commercial insurance turned out to be a different planet. Some departments were still on pen and paper, going through manila folders. Others had systems built on COBOL, mainframes from the 80s to handle claims. Nobody wants to touch them because the guy who understood the code retired years ago and didn’t leave notes. Underwriters, brokers, marketing, customer reps — everyone going through thousand-page policy documents, making million-dollar calls for businesses. According to McKinsey’s State of AI research, 78% of organizations are using AI in at least one business function. Insurance has been slower to change the way it operates day to day.

We started building AI products for a few lines of commercial business: workers’ comp, general liability and property coverage to better understand all the pain points. Consider workers’ compensation, which in itself is a beast. A human has to analyze injury claims, workplace risk factors, OSHA reports, medical records, claims histories and state regulations that differ wildly. For general liability, one has to dig through premises risk, operations exposure, vendor agreements and similar hassles for property coverages. Meaning a single policy decision might need someone to pull together dozens of documents from different sources and spend more time on clerical work as opposed to the real deal

Within weeks, our first client wanted it for every other line of business. Not just one department, but the entire organization. The pattern repeated with every new client as they quickly realized the same AI infrastructure could transform how they handled all of their commercial policies. That moment crystallized something for the founding team. We weren’t building a feature, but instead building an AI-backed infrastructure and I knew exactly what that meant from my time engineering at scale.

The AI part wasn’t what kept me up at night. Large language models (LLMs) can handle dense insurance documents. That’s been proven. What worried me was everything underneath.

First, scale. How do we build something that grows with more clients? And scale by users per client? What about seasonality when commercial insurance policy renewals peak? Q4 is a mess. Traffic doesn’t grow linearly. It spikes ~10x.

Second, reliability. We started with one LLM provider. It worked fine, but what will happen when traffic spikes? Everyone’s slamming the same LLM. That’s a nightmare of hitting rate limits, token limits. What if third-party systems go down? We all have seen this in action when ChatGPT went down

Third, data isolation. No insurer would tolerate its proprietary underwriting data bleeding into a competitor’s context window. Every client needs their own guardrails.

So we weren’t just building a system. We were building a beast that can’t flinch under pressure, can’t go dark when a provider fails and can’t leak data between clients.

We attacked each problem head-on.

For isolation, we went single-tenant. Every client gets their own instance, their own database, their own AI context boundary. No shortcuts.

For reliability, we designed the load balancers of AI agents that look at everything and most importantly, latency, cost, accuracy needs, provider health and make a call in real time. If one provider is down, it is now obvious that the traffic has to shift.

This orchestration layer was the breakthrough that can scale infinitely now.

Why is insurance the ultimate stress test for AI infrastructure?

Think about a mid-sized restaurant chain buying commercial insurance. They need workers’ comp for kitchen staff, general liability for slip-and-fall incidents, property coverage for equipment, outdoor dining coverage, liquor liability, theft protection. Probably a dozen policies total. And these are all thousands of pages of dense legal language, exclusions, endorsements, coverage schedules and many more.

Before AI, someone had to read all of this manually. Risk managers spent weeks on it, sometimes months, comparing quotes from various carriers, hunting for gaps, trying to catch redundancies and all of this manually. The mental load was brutal and mistakes were inevitable. I have seen claims denied because of a coverage gap buried on page 847 that no one saw. The policy looked fine. The exclusion that mattered was hiding in plain sight. When that happens, insurers fall back on their errors and omissions coverage (E&O) to protect against mistakes made by their employees while reviewing insurance. That’s how broken the manual process is and can easily lead to millions of dollars in claims.

A typical policy bundle containing 2K pages can now be ingested in 10 to 15 seconds. Even though speed is a big win, what’s more exciting are things that were not possible before. Quotes from various carriers can now be compared side by side in real time. AI flagging gaps automatically before they turn into claim denial. Underwriters can type questions in plain English. “Does this cover water damage from a burst pipe in an unoccupied building during winter?” Answer with citations to gain more trust and confidence. No human can process with that speed and accuracy. The humans are now reviewers and decision-makers and not document processors.

Surviving seasonality: Engineering for 10x traffic spikes

Insurance has a brutal seasonality problem. Policy renewals cluster around year-end. As soon as Q4 hits, traffic is expected to spike by ~10x. An architecture that runs fine in March can collapse in December if we haven’t planned for it.

Three things kept me up. First, caching. LLM caching is not like a typical web caching. Take these two questions, for example:

  1. “What’s my deductible for property damage?”
  2. “How much do I pay out of pocket for building damage?”

Both are basically the same question. How do we recognize that and not waste compute power?

Second, scaling. When renewal season hits, the largest client might need 10x the capacity overnight, but I don’t want to pay for that capacity year-round.

Third, routing. Not every query to LLM needs the biggest and the best model. A simple policy lookup doesn’t need the same horsepower as a complex one. Sending everything to one model means simple queries wait behind heavy ones.

We tackled each one.

For caching, we have semantic matching algorithms at multiple levels.

  1. At the embedding level: We cache vector representations so re-injection would re-use the same embeddings.
  2. At the query level: We use locality-sensitive hashing to spot similar questions and serve cached responses. If a question is already answered, then a similar question can use the same response without burning the compute power twice.

For scaling, each worker process can auto-scale horizontally based on queue length and current latency in-place. The largest client might go from 4 workers to 40, then scale back as soon as traffic drops. The key here is that scaling can be reactive, but for seasonalities, it can be predictive. If client X’s renewal rush started October 15th last year, then we can technically pre-warm their infrastructure on October 10th this year.

For routing, we built a classifier that examines incoming requests and sends them to the right model. A simple lookup can use a small, fast model; however, a complex coverage analysis workflow can be routed to more sophisticated models. This can cut cost by about 40% and actually improve P95 latency because simple queries are not jammed behind complex ones.

Now let’s put this together and we see users getting sub-second responses irrespective of quiet Tuesdays or chaotic Decembers. That consistency is what turns AI to what people can use at scale.

AI hallucinations kill trust; domain knowledge fixes it

Large Language Models (LLMs) fail in ways that regular software does not. In traditional software engineering, a database either returns the right row or throws an error, but an LLM will always return plausible-sounding nonsense and any system will happily pass it downstream unless we build detection mechanisms. Research published in Nature has shown that detecting these “confabulations” (arbitrary and incorrect generations) requires measuring uncertainty about the meanings of responses, not the text alone.

The root cause depends on how all these models learn. General-purpose LLMs train on public data crawled from the internet. They’re capable of broad reasoning without any domain expertise. If we ask a general LLM about insurance policy structure, it will give a reasonable-sounding answer drawn from insurance data that exists in its training set, which may or may not reflect the actual terminology, coverage structures and regulatory requirements that clients operate on. In any insurance, a reasonable-sounding yet wrong answer can lead to denial of claims or even regulatory violations, leading to millions of dollars in losses

Research on fine-tuning LLMs for domain knowledge graph alignment has demonstrated that when models are tuned to domain knowledge, then it can perform multi-step inferences while minimizing hallucination. So we built out our own knowledge graph for insurance, which holds definitions of how the industry actually works. Coverage types, policy structures, regulations, carrier-specific terms, claims workflows, how everything connects. It took years of domain expertise to build it and we are still fine-tuning it every time we run into a weird edge case. What we found out was that when our models were fine-tuned against this custom graph, they stayed inside verified boundaries instead of inventing plausible-sounding answers from pre-trained public data.

In practice, this makes a huge difference. If a user asks for coverage exclusions, then the system no longer hallucinates. It uses a knowledge graph as a source of truth. Any missing knowledge in the graph means uncertainty rather than confabulating an answer.

No system is perfect, though. Even with the knowledge graph, things slip through. I call it hallucination tripwires, an automated check that can catch AI when it’s making stuff up.

Model claims a coverage limit that’s nowhere in the source document? Tripwire.

Model references a policy section that doesn’t exist? Tripwire.

Model pulls a number that’s way outside expected ranges for that policy type? Tripwire.

An ACM survey on LLM hallucinations categorizes hallucination detection techniques into two: factuality and faithfulness approaches. When a tripwire triggers, a smart system won’t just log an error and move on. It will fall back to a secondary model for verification purposes. And when that fails, it will escalate to a human for review, depending upon the severity and confidence scores.

Hallucination detection is one piece. The other is model drift. Models can get worse over time and shift away from training data and accuracy drops. We track this constantly, checking against human-verified samples. When we see the numbers trending down, we fine-tune or adjust our prompts. Observability isn’t a nice-to-have; it’s a must for enterprise applications to stay reliable and win clients’ trust.

Databricks popularized a concept called medallion architecture, where raw ingestion produces what we call “bronze” data, minimally processed, potentially messy. AI-driven normalization transforms this into “silver” data with consistent schemas and validated fields. Further enrichment and cross-referencing produce “gold” data that’s ready for downstream analytics and reporting. This tiering can help serve different use cases appropriately. Real-time policy queries can work against silver data, whereas regulatory reporting and actuarial analysis must work off of gold-tier data with full audit trails.

Engineering principles that made the difference

A few principles that stand out when I look back in time.

AI is an infrastructure. Treat it that way from day one. Don’t bolt on scalability later. The early decisions on single-tenant v/s multi-tenant, sync or async, one LLM provider or several will compound. Unwinding them later will be painful and expensive as it may eat up a good amount of engineering resources and time and even get new features to stand still for weeks to months.

Build for failures. Providers go down, models hallucinate and demands can go up at any time, especially when we least expect it, so build the fallback paths before entering panic mode.

Observability is not optional. In regular early-stage software, we can skip the fancy monitoring, but in later stages, especially with AI systems, we can not afford to do that. No observability will mean shipping a broken output and being blindfolded about it.

Commercial Insurance has built its traditional processes around human limits, especially reviewing speed and mental bandwidth. AI can lift those limits up if and only if the infrastructure holds up to its expectations reliably, at scale and under pressure.

The difference between an AI demo and an enterprise AI system is not the AI models but the backbone, the infrastructure that doesn’t flinch.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

  • ✇Security | CIO
  • The AI paradox: How AI fixes the crisis it creates
    The rise of AI has created significant challenges for modern data center infrastructure in terms of power management. Traditional enterprise racks that once consumed an average of 7-10 kW, require close to 30-100 kW today. This significant increase in computational requirements has revealed a fundamental bottleneck: The traditional infrastructure isn’t enough to sustain ‌AI growth. However, AI can also prove to be a savior: By embedding it into hardware design and automa
     

The AI paradox: How AI fixes the crisis it creates

14 de Abril de 2026, 09:00

The rise of AI has created significant challenges for modern data center infrastructure in terms of power management. Traditional enterprise racks that once consumed an average of 7-10 kW, require close to 30-100 kW today. This significant increase in computational requirements has revealed a fundamental bottleneck: The traditional infrastructure isn’t enough to sustain ‌AI growth.

However, AI can also prove to be a savior: By embedding it into hardware design and automated construction workflows, data centers can evolve into intelligent, adaptive systems instead of the passive hubs that they are.

Revolutionizing hardware architecture through AI

Due to the rapid development of AI models, a significant innovation is required to reshape hardware design into a streamlined AI-driven innovation cycle. This is required at the microarchitecture design and macro-level system management.

AI-driven chip design processes

Modern-day AI accelerator chips integrate chiplets, high bandwidth memory (HBM) stacks and dense interconnect structures; hence, manual design isn’t scalable. AI-driven EDA tools are essentially the future.

AI-driven EDA tools can be pivotal in multiple avenues. In one such instance, Google has shown that something as complex as chip floorplanning can be done in hours that rival or surpass human efforts in terms of quality. These optimizations can reduce parasitic energy losses and prevent thermal hotspots in the physical design (PD).

AI models can also be used to evaluate thousands of multi-die configurations to predict hotspots, “through silicon via” (TSV) density issues and power-delivery constraints. This enables far more thermally balanced 2.5D/3D layouts than traditional heuristics.

Apart from PD floorplanning and prevention of hotspots, verification is another avenue where AI-driven EDA can be useful. Verification is significantly important because it consumes up to 70% of chip development time. AI tools such as Synopsys ML verification and Cadence Cerberus will prove to be useful tools to reduce this development time. Once the development time of a chip is reduced, meeting the growing performance needs of AI models will be feasible.

Another avenue where AI can be useful is reducing the power consumption in frontend design. Researchers have successfully demonstrated that ML-driven dynamic voltage frequency scaling (DVFS) strategies reduce the power without significant performance loss. AI can also be used to predict power consumption of the RTL design and post-layout snapshot in seconds, allowing designers to iterate rapidly.

Thermal and power management

Since modern AI chips generate vast amounts of heat, which can lead to hardware failure, modern AI chips require algorithms that can analyze data from multiple thermal sensors. AI algorithms can play pivotal roles here. Modern datacenters have used AI to reduce energy consumption in facilities, achieving significant amounts of energy savings. These AI-driven systems improve hardware longevity and reliability while significantly reducing operational costs.

AI can also be used to analyze the operational data and identify energy-intensive processes. It also then goes on to allocate computational tasks to efficient resources. This leads to less idle time and hence avoids power waste.

This creates a “self-sustainable cycle”: Power-optimized hardware enables the training of even more powerful AI models, which in turn are used to design the next generation of hardware.

AI in data center design and construction

To meet the demand for “speed-to-market,” AI can be integrated into procurement and design phases of data centers. These segments were historically slowed by manual reviews and complex specifications, something that AI can help with.

Streamlining procurement and design

AI tools can be particularly useful in automating tasks that otherwise require a substantial amount of manual work. For example, LLM-based assistants trained on design standards and Request for Information (RFI) history can now respond to vendor queries in minutes – a task that would have taken a control engineer 2 to 4 hours. Similarly, machine learning systems can be used to extract control point requirements (temperature setpoints and pressure limits, for example)‌ from 100% design drawings. This can help reduce human errors while transitioning from blueprint to physical installation.

In addition to identifying control requirements, generative AI tools can organize information scattered across multiple documents and convert it into structured outputs. For example, AI can automatically generate equipment schedules that list all major components, their capacities, control parameters and operating limits. Activities that once took design teams several weeks—such as cross-checking documents, extracting control data and preparing schedules—can potentially be completed in hours.

Automated commissioning and configuration

The commissioning process of the datacenter spans across five levels: Originating with factory tests and ending with integrated system testing. This is a final hurdle before the data center goes live in operation. Consequently, it’s a key step in the process, but can become tedious as it requires validating complex interconnected electrical and mechanical systems to ensure zero-downtime reliability, often under tight timelines. AI scripts can be helpful in reducing the burden here by automatically checking software configurations and interconnected systems to reduce rework during final testing. Generative AI can also be used to simulate system behavior under various operating conditions before physical commissioning can start. This allows the system to achieve optimal performance upon handover.

Predictive operations and AIOps

AI can also be used to make the management of data centers predictive and proactive instead of reactive. For example, AI can be used to predict the maintenance schedule. This is possible after the initial model is trained on vibration and voltage sensors. After that, the AI model can forecast failures. This will lead to an increase in reliability and a reduction in unplanned downtime.

Similarly, AI can also be used to place high-intensity workloads in cooler areas of a datacenter, preventing “thermal hotspots”. This will reduce the energy required for ‌cooling.

Since security is of paramount importance in data centers, AI can also be used to enhance physical and digital defense by tracing network anomalies, such as suspicious traffic patterns or unauthorized access attempts. This will lead to the neutralization of threats in real-time, instead of reacting to the threats.

Sustainability and the circular hardware economy: Beyond the linear lifecycle

Traditionally, enterprise servers had a lifecycle of 3-5 years. Now, with AI models being developed rapidly, AI hardware is being refreshed in 12-18 months. This is leading to large amounts of “embodied carbon waste”, which isn’t environmentally sustainable.

Consequently, hardware and infrastructure engineers need to pivot to a circular hardware economy framework, where hardware is an “evolving asset”.

At the hardware level, modularity is paramount, so that operators can upgrade to high-performance accelerators while retaining the chassis, power delivery units and cooling manifolds. This will significantly reduce the embodied carbon waste during raw material extraction and fabrication of the non-compute components.

To further solve this issue, AI can be used to decommission the hardware. Intelligent AI systems can analyze the telemetry data points from the server rack’s operational history to predict the remaining useful life of the chip or components around it (such as power delivery units or cooling manifolds). Healthy units can then be redeployed to “edge” data centers for less intense inference tasks, whereas failing units can be routed to specialized facilities for recovery of important materials.

This way, we can address the AI paradox: Use ‌AI to mitigate the environmental footprint of the machines. This ensures that the next generation of infrastructure is not produced for better speed and performance, but is also sustainable.

Modern approaches, not conventional engineering

With the exponential growth of performance-critical AI models, AI has become a foundational requirement for datacenter infrastructure. Even though modern AI models lead to an increase in total power/energy consumption, they can also act as a critical driving force to mitigate the same. To keep up with the growing rise of AI models and the subsequent rise of power/energy consumption, we need to switch to modern approaches rather than relying on conventional engineering. The next generation of datacenter infrastructure will be defined by how well we manage the evolution of hardware design and automated construction to build AI-capable data centers.

By integrating AI at the silicon level and the structural level, we are not just building faster computers; we are building a more intelligent foundation for the future of technology.

Disclaimer: The views expressed in this article are solely those of the authors in a personal capacity and do not represent the views of their employers.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

  • ✇Security | CIO
  • Micro and macro agents: The emerging architecture of the agentic enterprise
    Artificial intelligence is entering a new phase. For the past decade, enterprises have focused primarily on predictive analytics and automation — using machine learning models to classify data, detect patterns and improve decision making. Today, a new paradigm is emerging: Agentic AI, systems capable of autonomously executing tasks and coordinating complex workflows. Yet despite the rapid growth of AI agents, the term itself is often used loosely. Many organizations des
     

Micro and macro agents: The emerging architecture of the agentic enterprise

14 de Abril de 2026, 08:00

Artificial intelligence is entering a new phase. For the past decade, enterprises have focused primarily on predictive analytics and automation — using machine learning models to classify data, detect patterns and improve decision making. Today, a new paradigm is emerging: Agentic AI, systems capable of autonomously executing tasks and coordinating complex workflows.

Yet despite the rapid growth of AI agents, the term itself is often used loosely. Many organizations describe any AI-powered automation as an “agent,” even when it performs only a single function. As enterprises move toward large-scale deployment of autonomous systems, a clearer framework is needed to understand how these systems will be structured.

One useful way to think about the emerging architecture is through the distinction between micro agents and macro agents — two complementary layers that together form the foundation of the agentic enterprise.

The rise of micro agents

Most AI systems being deployed today can be best described as micro agents.

Micro agents are specialized AI systems designed to perform narrow, well-defined tasks within a workflow. They typically operate within existing applications and platforms, augmenting specific functions rather than managing entire processes.

Examples of micro agents are increasingly common across industries:

  • A document extraction agent that reads contracts or insurance policies
  • A fraud detection agent that analyzes transactional anomalies
  • A summarization agent that condenses large volumes of text
  • A classification agent that categorizes customer requests
  • A risk scoring agent that evaluates underwriting inputs

These agents are powerful because they combine machine learning models, large language models and automation tools to complete tasks that previously required human intervention.

In many ways, micro agents resemble AI-powered microservices. Each is optimized for a specific capability and integrated into a broader digital workflow.

However, micro agents have an inherent limitation: They operate at the task level, not the workflow level.

The emergence of macro agents

The next stage in enterprise AI will be defined by the rise of macro agents.

Macro agents operate at a higher level of abstraction. Rather than performing a single task, they coordinate multiple micro agents to complete an end-to-end business process.

Macro agents are, therefore, goal-oriented systems. Their objective is not simply to perform an activity but to deliver an outcome.

This enables seamless integration by integrating with systems requiring real-time decisions and dynamic engagement.

Consider a typical insurance claims process. Traditionally, this workflow involves numerous steps:

  • First notice of loss intake
  • Document analysis
  • Damage assessment
  • Fraud detection
  • Coverage validation
  • Payment authorization

A macro agent could orchestrate each of these steps by coordinating specialized micro agents responsible for individual tasks. The macro agent would manage the workflow, evaluate outcomes and ensure the process is completed successfully.

This orchestration capability fundamentally changes the role of AI in enterprises. Instead of acting as a set of isolated tools, AI begins to function more like a coordinated digital workforce.

The key factor to note is that macro agents are more outcome-based, which is what the businesses want.

The need for governance: Meta agents

As organizations deploy networks of interacting agents, another challenge quickly emerges: Governance.

 The struggle for good AI governance is real, and many organizations deploying AI recognize the need for guardrails, but few have figured out how to build a mature governance system.

Autonomous systems that make decisions, coordinate tasks and execute actions must be monitored carefully to ensure they stay compliant, secure and aligned with business objectives.

This creates the need for a third layer in the agentic architecture: Meta agents.

Meta agents oversee and monitor other agents. Their responsibilities may include:

  • Monitoring risk and model behavior
  • Validating regulatory compliance
  • Auditing decision logic
  • Managing cost and resource consumption
  • Escalating decisions to human operators when necessary

In essence, meta agents serve as the governance layer of the agentic enterprise, ensuring that autonomy does not come at the expense of control.

The need for governance is critical, and meta agents will be the trick to balancing governance with innovation in the age of AI. According to Ian Ruffle, head of data and insight at UK breakdown specialist RAC, “Success is about having the right relationships and never trying to sweep issues under the carpet.”

The agentic enterprise stack

Together, these layers form what can be described as the agentic enterprise stack:

  • Meta agents: Governance and oversight. Monitoring, compliance and risk management across agent systems.
  • Macro agents: Workflow intelligence. Coordination of multi-step processes and delivery of business outcomes.
  • Micro agents: Task execution. Specialized systems are responsible for discrete capabilities and actions.

This layered architecture reflects how large-scale AI systems will likely evolve. Instead of deploying isolated tools, enterprises will build interconnected ecosystems of agents, each operating at a different level of responsibility.

This framework potentially can move today’s ERP systems from a system of records to a new generation of systems that are systems of intelligence.

Where most companies are today

Despite growing interest in agentic AI, most organizations remain in the micro-agent stage.

Many AI initiatives focus on improving individual tasks — automating document processing, generating summaries, or assisting customer service representatives. These use cases deliver meaningful productivity gains, but they represent only the early phase of the agentic transformation.

The real shift will occur when enterprises begin to deploy macro agents capable of managing entire workflows, coordinating dozens of micro agents in the background.

At that point, AI moves beyond augmentation and begins to operate as an operational system for work itself.

Implications across industries

The emergence of agentic architectures will have profound implications across industries.

In financial services and insurance, macro agents could manage complex processes such as underwriting decisions, claims resolution and regulatory reporting.

In healthcare, macro agents may coordinate patient intake, diagnosis support and care management workflows.

In manufacturing and supply chains, agent systems could orchestrate procurement, logistics and production planning.

Across sectors, the defining shift will be the transition from AI tools that assist humans to AI systems that manage workflows autonomously while remaining governed by human oversight.

From automation to autonomous

The evolution from micro agents to macro agents represents more than a technological upgrade. It signals a fundamental shift in how organizations think about work.

Digital transformation modernized technology, while intelligent transformation modernizes the enterprise itself.

Ultimately, success will not be determined by who can showcase the most impressive agent, but by who can develop the most trustworthy agentic ecosystem — one that is secure by design, outcome-oriented and embraced by employees who feel empowered rather than displaced.

For decades, enterprise technology has focused on improving the efficiency of human tasks. Agentic systems instead aim to restructure how work itself is executed, distributing responsibilities across networks of autonomous systems.

In this emerging model, micro agents act as the specialized workers, macro agents serve as workflow managers and meta agents provide the governance and oversight required for responsible autonomy.

This approach moves the organizations from where humans initiate AI agents to where AI initiates AI agents, sometimes with a human overseeing the outcome.

Organizations that understand and design for this layered architecture and are willing to redesign workflows and roles will be best positioned to build the agentic enterprises of the future. Adoption of this enterprise architecture will translate the value creation into value realization.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

❌
❌