Agent Orchestration for AI Engineers: Harnesses, Context, and the Layer Most Teams barely thought of

When enterprise AI deployments fall short, failures often span several layers at once: weak task decomposition, tool unreliability, retrieval gaps, model reasoning limits, poor evaluation coverage, permissions design. Organizational context is one significant contributor among those, and one that is consistently underestimated. Understanding why requires thinking clearly about what each layer of the stack can and cannot know.

Practitioners who build agentic systems have converged on a useful taxonomy: model, harness, and context. Each layer can improve independently, and each one has a different update cost, speed, and failure mode. Harrison Chase, Co-Founder and CEO of LangChain, articulated this recently:

Harrison Chase · Co-Founder & CEO, LangChain · April 2026 "Most discussions of continual learning in AI focus on one thing: updating model weights. But for AI agents, learning can happen at three distinct layers: the model, the harness, and the context. Understanding the difference changes how you think about building systems that improve over time."

The taxonomy maps cleanly to what we see in production enterprise deployments. And it surfaces a gap that most agent frameworks have not addressed: context at the organizational level is categorically different from context at the user or agent level, and it is the layer that most often determines whether an agent actually behaves correctly inside a real company.

The three layers of enterprise agent orchestration

Chase defines the three layers as follows. The model is the weights themselves: Claude, GPT-4o, Gemini. The harness is everything that is always on around the model: every piece of code, configuration, and execution logic that is not the model itself, including prompts, tools, MCPs, orchestration logic, memory, state, hooks, and the inner tool-calling loop. LangChain's framing makes this explicit: the harness is what operationalizes the model for every instance of the agent. The context is configuration that lives outside the harness and customizes it per tenant: instructions, skills, and memory that are not always-on but get pulled in based on who is running the agent.

Figure 1: The three layers of an agentic system Context Configuration that lives outside the harness. Instructions, skills, memory. Customizes behavior per tenant (agent, user, team, org). e.g. CLAUDE.md, /skills, mcp.json, SOUL.md, and org-level behavioral graphs. Harness Code that drives every instance of the agent. Prompts, tools, state, hooks, the tool-calling loop. Always on. e.g. Claude Code, DeepAgents, Pi, Codex, Droid. Model The weights. e.g. Claude Sonnet, GPT-5.4, Gemini, GLM5. Updated via SFT, RL (GRPO), fine-tuning.

Each layer can improve independently. Harness updates do not require retraining the model. Context updates do not require changing the harness.

Erik Schluntz & Barry Zhang · Members of Technical Staff, Anthropic · Building Effective Agents · December 2024 "Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage... The most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns."

Chase maps real products to this taxonomy clearly. Claude Code: model is claude-sonnet, harness is Claude Code itself, user context is CLAUDE.md, /skills, mcp.json. OpenClaw: model is many, harness is Pi plus scaffolding, agent context is SOUL.md and skills from ClawhHub. You can identify similar layers across many agent stacks, including LangGraph, CrewAI, and OpenAI Agents SDK, though the exact boundaries differ by framework. LangGraph, for instance, positions itself primarily as a low-level orchestration runtime for durable execution and human-in-the-loop workflows, not a harness in the narrow sense. The value of the taxonomy is the conceptual separation, not a precise mapping onto any single framework.

How learning works at each layer

Each layer has its own improvement mechanism, its own speed, and its own failure modes.

Model layer: SFT, RL (GRPO), fine-tuning. Slow and expensive. The central challenge is catastrophic forgetting: updating on new data tends to degrade performance on things the model previously knew. High ceiling of impact, but the highest cost and the least human-inspectable process.

Harness layer: The Meta-Harness paper (Lee et al., 2026) demonstrates this approach concretely. The pattern: run the agent over a set of tasks, evaluate results, store all execution traces to a filesystem, then run a coding agent over those traces to propose changes to the harness code. The harness improves through its own traces. Medium cost, medium speed. Usually done at the agent level, meaning one harness for all users, though you could in principle learn per-user harness variants.

Lee, Nair, Zhang, Lee, Khattab (incoming MIT professor, creator of DSPy), Finn (Stanford professor, co-founder of Physical Intelligence) · Meta-Harness: End-to-End Optimization of Model Harnesses · arXiv:2603.28052 · 2026 "The performance of LLM systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model."

On retrieval-augmented math reasoning, a single discovered harness improves accuracy by +4.7 points on average across five held-out models while using 4x fewer context tokens. The paper supports the setup here (that harness code is a first-class optimization target), not the org-context conclusion that follows.

Context layer: This is where the most flexibility lives. Context updates are low cost, fast, and human-inspectable. They can happen offline (batch over recent traces, extract insights, update configuration (what OpenClaw calls "dreaming")) or in the hot path (the agent updates its own memory as it runs). And critically, they can happen at multiple levels of granularity: per agent, per user, per team, per org.

Andrej Karpathy · Founder, Eureka Labs; former Director of AI at Tesla, founding team at OpenAI · X / Twitter · June 25, 2025 "Context engineering is the delicate art and science of filling the context window with just the right information for the next step. When in every industrial-strength LLM app, context engineering is the core discipline — not prompt engineering."

Simon Willison · Co-creator of Django, creator of Datasette · simonwillison.net · Context Engineering · June 27, 2025 "Context engineering is the delicate art and science of filling the context window with just the right information for the next step... Doing this well involves task descriptions and explanations, few shot examples, RAG, related data, tools, state and history, compacting — doing this well is highly non-trivial."

Dimension	Model	Harness	User / Team Context	Org Context (BehaviorGraph)
Form factor	Model weights	Code	Config files (agent, user, team)	Dynamic behavioral graph
Level of granularity	Agent	Agent	Agent, user, org, team	Org-wide, continuously updated
Cost to update	High	Medium	Low	Low
Speed to update	Slow	Medium	Fast	Continuous
Human inspectable	No	Yes	Yes	Yes
Ceiling of impact	Highest	High	Medium	High (routing correctness)
Update pattern	Batch offline	Batch offline job	Batch offline; hot path	Continuous signal ingestion + batch graph refresh
What it teaches the agent	General reasoning	How to run reliably	User/team preferences & skills	Who to route to, who is trusted, how decisions actually move

Willison frames context engineering as the core discipline of production AI systems, covering everything that surrounds the prompt: goals, constraints, tools, memory, and retrieved knowledge.

Figure 2: Comparing the four layers of an agentic system across key dimensions Dimension Model Harness User / Team Context Org Context (BehaviorGraph) Form factor Model weights Code Config files (agent, user, team) Dynamic behavioral graph Level of granularity Agent Agent Agent, user, org, team Org-wide, continuously updated Cost to update High Medium Low Low Speed to update Slow Medium Fast Continuous Human inspectable No Yes Yes Yes Ceiling of impact Highest High Medium High (routing correctness) Update pattern Batch offline Batch offline job Batch offline; hot path Continuous signal ingestion + batch graph refresh What it teaches the agent General reasoning How to run reliably User/team preferences & skills Who to route to, who is trusted, how decisions actually move

The table above lays out four layers. The first three (model, harness, and user/team context) are what most agent stacks already address. The fourth column is what the rest of this piece is about: why org-level context is categorically different from user context, why existing tools cannot substitute for it, and what it actually takes to build it.

The missing granularity: org-level context

Most agent stacks recognize three granularities at which context can apply: agent level, user level, and team level. Products like Hex's Context Studio, Decagon's Duet, and Sierra's Explorer do this well.

But in enterprise deployments, there is a fourth granularity that is categorically different from the others: organizational context. Not "what does this user prefer?" but "how does work actually route inside this org, who is trusted for what type of decision, which approval paths are real versus nominal, and when is a bottleneck forming?"

This is not information you can extract from user prompts, CLAUDE.md files, or even traces of individual agent runs. It is a property of the organization as a whole, and it changes continuously as teams restructure, people change roles, and projects shift priority. It requires a different kind of context layer.

Consider how much a well-integrated enterprise AI system can learn about a person. It can read their emails, infer their writing style, understand their background and domain expertise, observe their communication patterns, and build a detailed profile of who they are and how they work. With Slack and email access it can map who they exchange messages with most often. But there is a hard limit on what that picture can tell you.

People do not write "I don't trust Jordan's judgment on vendor contracts" in their work email. They do not tell the company AI model that they think their manager's approval is a rubber stamp. They do not explain in Slack that the person listed as team lead has not actually made a call in six months. The things that matter most about professional relationships, including peer trust, informal authority, and who someone actually goes to when they need something unblocked, almost never surface as explicit statements in any tool.

So even a model that knows a person extremely well does not know how that person is perceived by their peers, which of their relationships carry real weight, or where they sit in the org's actual decision network. You might know their communication network from email and Slack metadata. But knowing who someone exchanges messages with is not the same as knowing who they rely on, who trusts their judgment, or who can move something forward by picking up the phone. That is a different layer of signal, one that can only be inferred from behavioral patterns across the whole organization, not reconstructed from any single person's digital exhaust.

Knowing a person's communication network is not the same as knowing their relationships. The difference is the entire gap between user context and org context.

Why existing tools don't fill this gap

Engineers who have worked in enterprise software will recognize the problem and immediately reach for a familiar solution: role-based access control. Workday defines who can approve a purchase order. ServiceNow defines who can resolve a ticket. GitHub defines who can merge to main. Okta defines who can access which system. Every mature enterprise tool has a permissions model. Engineers know how to build these. Why isn't that enough?

Because RBAC answers a different question. Permissions define who is allowed to do something. Org context is about who actually does it. The two are not the same thing, and they often diverge over time, especially after re-orgs, role changes, and project transitions.

The formal approver for budget requests is the VP of Finance. The actual approver, the person whose reply unblocks things, is her chief of staff. That is not in Workday. The person with merge rights on the core auth service is a staff engineer who left six months ago and whose account was never deprovisioned. The team listed as owners of the data pipeline have not touched it in a year; the two engineers who actually maintain it are in a different org unit. These gaps are not edge cases. They are the steady-state of how organizations work.

Hard-coded permissions also do not scale for a second reason: the org changes faster than anyone updates the rules. Re-orgs, role changes, new hires, departures, project transitions, each one creating drift between the permissions model and organizational reality. Human teams absorb this drift through informal knowledge: people know to ask Marcus even though his title says something else. An AI agent has no such informal network unless you give it one.

Dimension	RBAC / Static permissions	Behavioral org context
What it captures	Who is allowed to act	Who actually acts, and how
Defined by	Admin configuration at setup time	Observed behavioral signals, continuously
Reflects re-orgs	Only if admin updates it	Automatically, via signals
Captures informal authority	No	Yes: trusted deputies, shadow experts
Works for agent-to-agent routing	Not designed for it	Yes, queryable at runtime
Scales with org complexity	Role explosion, thousands of fine-grained rules	Graph-based, scales with signal volume
Useful for compliance	Yes, authoritative for access control	Yes, and complementary to it

Figure 6: Hard-coded permissions vs. behavioral org context Dimension RBAC / Static permissions Behavioral org context What it captures Who is allowed to act Who actually acts, and how Defined by Admin configuration at setup time Observed behavioral signals, continuously Reflects re-orgs Only if admin updates it Automatically, via signals Captures informal authority No Yes: trusted deputies, shadow experts Works for agent-to-agent routing Not designed for it Yes, queryable at runtime Scales with org complexity Role explosion, thousands of fine-grained rules Graph-based, scales with signal volume Useful for compliance Yes, authoritative for access control Yes, and complementary to it

To be clear: RBAC and static permissions are not wrong. They are the right tool for access control, compliance, and security boundaries, and they should stay. The problem is using them as a substitute for organizational context when routing decisions. They answer "is this permitted?" not "is this the right path?"

How AI accelerates the problem

Even if you have good org context today, autonomous agents introduce new failure modes that do not exist in the same form for human workers, or they exist in humans but operate at a scale and speed that makes them far more dangerous when AI is involved.

Agent drift. An agent's behavior shifts over time as it accumulates context, patterns, and implicit reinforcement from past interactions. A human employee who learns "always escalate billing disputes to Jordan" carries that forward; if Jordan leaves, a good manager corrects the mental model. An agent left unmonitored keeps routing to Jordan's old account, or to whoever the system now maps that label to. Drift in human organizations is a management problem. Drift in agent organizations is a reliability and governance problem that compounds at machine speed.

Context cross-contamination. In multi-tenant or multi-user agent deployments, context from one user's session or one team's workflow can bleed into another's. An agent that handled a sensitive restructuring query for one executive should not carry implicit priors about that team's authority structure into the next user's completely unrelated request. This is a well-known technical problem in shared-context systems, but it has an organizational dimension that is often missed: the contamination is not just data leaking, it is organizational assumptions leaking.

Stale authority maps. The most common failure. An agent was trained or configured when Sarah was the approver for vendor contracts. Sarah was promoted six months ago. The agent still routes to Sarah. Sarah's approval queue is a graveyard of stalled requests that no human would have let pile up because everyone informally knows to go to David now. AI does not know that unless the org context layer is continuously updated.

None of these are novel human problems. Organizations have dealt with outdated process documentation, siloed knowledge, misrouted escalations, and authority confusion for as long as organizations have existed. What AI does is accelerate all of it. A human making a wrong routing decision makes it once, gets corrected, adjusts. An agent making the same wrong routing decision makes it ten thousand times before anyone notices the pattern in the logs.

Why organizational behavior science matters for enterprise AI

The informal organization, meaning the trust networks, influence patterns, and real decision paths that sit beneath the formal org chart, has been documented and studied for decades. It does not appear in job descriptions or org charts. It emerges from repeated interaction, shared experience, and social trust. This is the core insight behind Organizational Network Analysis (ONA), a field with roots in the 1970s–80s. Organizations that ignore the informal network make systematically wrong decisions. AI systems that cannot see it make those wrong decisions at scale.

There is a body of knowledge that has spent fifty years studying exactly this problem: how authority flows, how trust forms, how information actually moves through groups, why some people become informal hubs regardless of their title, and what happens when organizations change faster than people's mental models of them. It goes by several names: organizational behavior, organizational network analysis, knowledge management, and change management. It produced rigorous methods for measuring and mapping the informal org. The three-tier model of organizational data (structural, transactional, behavioral) is one way to frame what these fields have contributed; The Layer Every Enterprise AI Platform Is Missing unpacks that framework and explains why Tier 3 is the layer every AI category keeps hitting the wall on.

That field exists for a reason. Every large organization has discovered, usually the hard way, that the formal structure and the operational reality are two different systems running in parallel. Decisions that optimize for the formal structure and ignore the informal one tend to fail. The org chart tells you the reporting lines. The behavioral graph tells you how things actually get done.

Enterprise AI is now rediscovering this lesson at scale. The difference is that the tools to observe and map organizational behavior have improved dramatically, and the downstream system that needs this context, the AI agent stack, has a much lower tolerance for ambiguity than a human workforce does. A human new hire takes three months to build an adequate mental model of how the org really works. An AI agent has no onboarding period. It acts immediately, on whatever context it has.

This is where the intersection of organizational science and AI engineering becomes practically important, not as an academic curiosity but as an infrastructure problem. People who have studied both how organizations actually behave and how AI systems are built are in a position to do something that neither discipline can do alone: teach enterprise AI to understand beyond documents. Not just what was written down, but how the organization actually moves.

The documents tell you what was decided. The behavioral graph tells you how decisions actually get made, who they go through, and whose judgment counts. Enterprise AI needs both.

> The documents tell you what was decided. The behavioral graph tells you how decisions actually get made, who they go through, and whose judgment counts. Enterprise AI needs both.

ONA established that authority maps and influence patterns can be derived from observed communication behavior, not from org charts or job titles. BehaviorGraph applies that same logic to the agent stack: infer who actually owns and routes decisions by watching how work moves, then surface that as queryable context at runtime. Here is what that looks like in practice.

What org-level context contains

BehaviorGraph is a dynamic organizational knowledge and behavior graph: a continuously updated map of real routing behavior, trust relationships, authority patterns, and escalation paths across an enterprise. It operates on behavioral metadata only: collaboration patterns, calendar signals, approval sequences, response latency, workflow timestamps. No message content, no document text.

Two caveats worth stating clearly. First, behavioral metadata still carries organizational sensitivity; the metadata-only scope reduces content exposure, not governance requirements. Second, the graph is probabilistic: it infers operational reality from observed signals, not ground truth. Routing frequency is directional evidence, not certification of who should own a decision. What it offers is continuously updated organizational inference that is more useful to an agent than stale system metadata, and more honest than treating the org chart as current.

The signals it ingests, and what the graph learns from them:

Figure 3 How org context builds from behavioral signals Collaboration patterns who works with whom cross-team ties frequency response latency Approval sequences real vs. nominal approver who bypasses formal path approval latency Routing behavior who gets escalated to trusted deputies bottleneck detection Pulse signals subject matter experts peer-validated authority availability Workflow timestamps stall detection handoff latency SLA patterns No message content, no email bodies, no document text. The shape of behavior is enough signal.

Where org context fits in the stack

A harness is the always-on execution layer around a model. BehaviorGraph adds a different kind of context: organizational reality at runtime.

The relationship between BehaviorGraph and the harness is not competitive. The harness runs the agent. BehaviorGraph answers the questions the harness cannot answer on its own at the moment of action: who owns this decision in practice, who is trusted to approve it, what path is shortest and legitimate, when should the agent stop and defer.

Each layer still does its job. MCP standardizes access to tools and data sources. RAG retrieves content. The harness (everything around the model that is always on) manages execution. The model reasons. Org context tells the agent what to do with all of that inside a real organization.

Figure 4 What each layer answers at the moment of action RAG What information is relevant? Surfaces the right documents, records, and data. Retrieval is correct. The answer exists in the corpus. MCP What tools and data sources can I access? Open standard for secure, two-way connections between AI systems and external tools or data. The USB-C layer for the agent stack. Harness How do I run this agent reliably, turn by turn? Context management, error recovery, tool loop, state, hooks. Makes the agent operational. Context What does this user or team prefer? user memory team skills agent SOUL.md BG Who should this go through? Who can actually approve it? real expert vs. formal owner authority in practice trusted escalation path when to defer The last question is the one most stacks skip. It is also the one that determines whether the agent behaves correctly or just executes.

The failure mode when org context is missing

With a clear picture of what each layer does, it becomes easy to see what goes wrong when the org context layer is absent.

Deloitte · State of AI in the Enterprise · 2026 (survey conducted August–September 2025) Just 34% of organizations are truly reimagining their business with AI. Deloitte's findings suggest the main bottlenecks are less about raw model quality and more about scaling, governance, skills, and operational integration, including the challenge of connecting AI systems to how work actually moves through the company.

The survey covers thousands of organizations globally. The pattern points at the org-context gap: enterprises have capable models but consistently struggle with governance, routing, and the operational layer that connects AI to real workflows.

Enterprise AI failures rarely trace to a single layer. Models reason poorly on edge cases, tools fail, retrievals miss, evals are thin. But a recurring pattern in enterprise deployments is context-poor execution about the organization itself. The agent retrieved the right document, identified the formal owner from system metadata, routed the task to that person, and the task stalled. The actual approver was different in practice. The trusted deputy was not in any system. The owner had quietly changed roles. We mapped this failure pattern in detail, including the six specific forms it takes across enterprise deployments, in A Large Behavior Model for AI Governance and Agent Orchestration Platforms. If the routing failure is specifically about RAG pipelines returning the right document but the wrong person, Your AI Agents Route to the Wrong Person goes deeper on that layer.

Figure 5: Routing with and without org context Without org context ▶Request received ▶Doc retrieved (correct) ▶Formal owner from metadata ▶Routed to formal owner ▶Task stalls or is rejected ▶Human cleans up manually Why it failed: actual approver was different • owner changed 3 months ago • trusted deputy not in any system • escalation path was socially wrong With org context (BehaviorGraph) ▶Request received ▶Doc retrieved (correct) ▶Query BehaviorGraph for org context ▶Real approver + trusted path returned ▶Route, escalate, or defer safely ▶Task resolves What it knew: ranked real experts • authority confirmed in practice • shortest trusted path • overload detection

How org context stays current

BehaviorGraph's org context layer stays current the same way any well-designed context layer does: two update modes, different signal sources.

Hamel Husain · Independent Consultant, Parlance Labs; specialist in LLM evaluation · hamel.dev · Your AI Product Needs Evals · 2024 "Rigorous and systematic evaluation is the most important part of the whole system. A failure to create robust evaluation systems is the common root cause of unsuccessful LLM products... Success with AI hinges on how fast you can iterate."

Husain's emphasis on traces and fast iteration applies at all layers: the same execution logs that feed harness optimization can inform context updates and, eventually, fine-tuning. That synthesis is ours, not his explicit claim, but his framework makes it easy to see.

Behavioral signals from collaboration patterns, approval sequences, and workflow metadata arrive continuously. The graph is refreshed in batch to update authority maps, trust scores, and routing paths. The result is context that reflects how the organization actually works right now, not how the org chart said it worked six months ago, and not what a user typed into a CLAUDE.md file.

This is the distinction that makes org-level context different from user-level context. User context is explicit and authored. Org context is observed and inferred: it is the accumulated shape of how the organization actually behaves.

The next phase of enterprise AI is not just about better harnesses. It is about whether those harnesses can access organizational reality.

Add org-level context to your agent stack

Designed to integrate with agent stacks through REST, MCP, or retrieval enrichment. No message content or document bodies; metadata-only design reduces content exposure.

Talk to us →

Sources and voices cited in this piece HC Harrison Chase · Co-Founder & CEO, LangChain ▲ Erik Schluntz & Barry Zhang · Members of Technical Staff, Anthropic AK Andrej Karpathy · Founder, Eureka Labs; former OpenAI & Tesla AI SW Simon Willison · Co-creator of Django, creator of Datasette HH Hamel Husain · Independent Consultant, Parlance Labs MH Meta-Harness paper · Khattab (MIT / DSPy), Finn (Stanford), et al., arXiv 2026 DL Deloitte · State of AI in the Enterprise 2026 These individuals and organizations have not reviewed or endorsed BehaviorGraph. Citations are used for commentary on publicly available statements about enterprise AI.

References

Chase, H. (2026). Continual learning in agentic systems: model, harness, and context layers. X / Twitter thread. Harrison Chase is Co-Founder & CEO of LangChain, the company behind the LangChain and LangGraph agent frameworks.

Schluntz, E. & Zhang, B. (2024, December). Building Effective Agents. Anthropic Engineering Blog. Schluntz and Zhang are Members of Technical Staff at Anthropic; Schluntz leads work on tool use and computer use for Claude.

Karpathy, A. (2025, June 25). On context engineering. X / Twitter. x.com/karpathy. Karpathy is Founder of Eureka Labs; former founding team at OpenAI and former Director of AI at Tesla.

Willison, S. (2025, June 27). Context Engineering. simonwillison.net. Willison is co-creator of the Django web framework and creator of Datasette.

Lee, Y., Nair, R., Zhang, Q., Lee, K., Khattab, O., & Finn, C. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv:2603.28052. Omar Khattab is an incoming Assistant Professor at MIT and creator of DSPy; Chelsea Finn is Associate Professor at Stanford and co-founder of Physical Intelligence.

Husain, H. (2024). Your AI Product Needs Evals. hamel.dev. Husain is an independent consultant at Parlance Labs specializing in LLM evaluation and operationalization.

Deloitte. (2026). State of AI in the Enterprise. Survey conducted August–September 2025. Deloitte US.

Author Yumi W. Kimura

Author

Yumi W. Kimura