Skip to main content
skilder

engineering

From Context Bloat to Context Load

Why enterprise AI gets heavier as it gets more capable — and how a capability graph lets you assemble context as a runtime payload instead of a static config file.

Author: The skilder team
  • #context
  • #performance
  • #ai-agents
  • #architecture

Why your enterprise AI gets heavier as it gets more capable — and how to fix it structurally.

TL;DR — Your agent isn’t getting dumber. Your context is getting fatter.


Your agent was great at v1

Then came the second tool, the third guardrail, the fourth workflow script. Each addition made sense at the time. Nobody removed anything. By v3 your agent is carrying 8,000+ tokens of context before the user says a word — and it’s started ignoring instructions, costing more, and responding slower.

The instinct is to blame the model. But the model hasn’t changed. What has changed is the context.

The paradox of capability: the more you add to your agent, the less reliably it performs. This is not a model problem. It is a context architecture problem.


Three layers that grow without limit

Enterprise agents carry three distinct types of content — each reasonable on its own, collectively toxic when unmanaged:

Tools — API schemas, function definitions. A single tool with parameters and examples can run 200–400 tokens. 15 tools = up to 6,000 tokens before a word is spoken.

Skills — Behavioral instructions, compliance rules, persona guidelines. Written once, never pruned. They accumulate amendments like a legal contract.

Scripts — Chain-of-thought scaffolds, decision trees. Verbose by design. A triage script can consume 2,000+ tokens — most of it irrelevant to any given call.

At $3/M tokens, 10K daily conversations, 6 turns each: ~$1,500/day in system context cost alone before the conversation starts.


Bloat hurts in four distinct ways

~2×more instruction-following failures as context grows
18–24%latency increase going from 1,400 to 8,400 tokens
$440Kyearly cost delta at 10K conversations/day

From our own deployments across 3 enterprise pilots (Nov 2024 – Feb 2025). Directional signal, not industry benchmarks.

Beyond cost and latency: the model starts making silent tradeoffs between instructions — which ones to follow, which to partially ignore — in ways nobody can predict or explain. We saw this as a consistent rise in escalations and output review flags once system prompts crossed 6,000 tokens.

On bigger context windows: reaching for a model with a 500K-token window doesn’t fix bloat. It just makes bloat more expensive. Larger windows raise the ceiling — they don’t fix the discipline problem of what should actually be loaded.


Stop dumping. Start loading.

Most teams treat the context window like a configuration file: fill it with everything the agent might need and leave it static. This is the bloat mindset.

The alternative: treat context as a runtime payload. Assembled fresh for each call. Containing only what this agent, handling this task, in this turn, actually needs.

The load principle: every token in your context should be able to answer — why is this here, for this call, right now? If the answer is “it might be useful,” that token is probably bloat.


Decouple capabilities from agents

Patching individual agents doesn’t scale. As soon as you have 6, 10, 15 agents, you have n agents each solving the same context problem independently — with no shared governance, no shared versioning, and days of lag every time a tool or compliance rule changes.

The structural answer: agents should not own their capabilities. Tools, skills, and scripts should be defined centrally, versioned independently, and assembled into context dynamically at runtime based on role and task.

In practice this means a capability graph — nodes are tools, skills, and scripts; edges define their relationships. An agent declares a role. A resolution engine traverses the graph and loads exactly the right bundle for this call.

Before: 12 tools + 6 skill blocks + 2 scripts = 8,400 tokens per call.

After: 3 tools + 1 skill block + 1 script = 2,000 tokens per call. A 76% reduction — consistent with our first pilot migration.


The model landscape pushes both ways

Frontier models — Context windows keep growing, but so do per-token costs. A bloated 8,400-token prompt on a premium frontier model doesn’t solve bloat. It runs it on more expensive infrastructure.

Small language models — SLMs (Phi-4, Mistral Small, Llama 3.2) are rising fast for on-premise, cost-sensitive, and data-sovereign deployments. Their 4K–32K context limits make capability architecture not optional — mandatory.

The capability graph is a model-agnostic abstraction layer. It doesn’t care if the underlying model has 8K or 800K tokens. It assembles the right payload and hands it to whatever model the deployment requires — the same architecture for an on-premise SLM at a regulated bank and a frontier model handling complex enterprise workflows.


Six questions to check your context health

  • Injection ratio: what proportion of your system prompt is injected on every call vs. conditionally? Mostly always = bloat risk.
  • Tool utilization: what share of your registered tools are actually called? Below 40% is tool sprawl.
  • Instruction age: when did you last review each block? Anything untouched for 90+ days likely has redundancy or contradiction.
  • Removal cadence: when did you last remove something? Teams that only add are accumulating by definition.
  • Capability ownership: if a tool’s API changes, how many agent definitions need updating? More than three is a governance problem.
  • Context observability: can you see the exact payload sent on each call in production? If not, you can’t audit, optimize, or govern.

A well-scoped agent role should carry 2,000–3,500 tokens of system context for a focused task. In agents we audited before migration, the median was 7,800 tokens.


Context bloat is a symptom of treating context as a configuration file. Context load is what happens when you treat it as engineering.

Related articles