May 22, 2026blog

Harness Design 1 — Context Engineering

Context is more than just the conversation history.

I’ve been building BenchClaw and TeachClaw, agent harnesses for, respectively, a general-purpose personal assistant in the style of OpenClaw and an academic teaching assistant that has to deal with untrusted student users. A shared goal in both is to deliver high-quality interaction using cheap models.

Both currently run on Gemma 4 E4B — a smartphone-grade model that costs cents per million tokens. A 24k-token context window is more than enough for what they do. They punch above their weight because of the harness around them.

Harness Engineering

The harness is the part that wraps around the model and lets it do things in the world. A model offers a very simple interface: given a conversation state, predict the next turn. Everything else — tools, memory, scheduling, context — is the harness.

Harness engineering is increasingly the difference between products that share the same underlying model. Cursor makes open-weight models punch well above their weight; Codex (until recently) hamstrung top-tier ones. Two of my favourite reads on this are Addy Osmani and Martin Fowler.

I’m going to write up some of my own experiments in a series of posts. This first one is about grooming and managing the agent context. It assumes you’re familiar with the context window, KV caching, and how hosted LLMs are priced.

Context

The context window is the agent’s short-term memory, measured in tokens. Naively, you’d populate it by appending every user message, every model turn, and every tool call and result as they happen — a literal transcript.

A small slice of an agent context
system12k tok

You are a coding agent. Tools available: Read, Edit, Bash. …

user

Fix the failing test in test_foo.py.

assistant

I’ll read the test first.

tool callRead

test_foo.py

tool resultRead180 tok
def test_foo():
    assert add(2, 3) == 5
assistant

add is returning a - b. One-character fix.

Two issues with treating context as a transcript. First, it fills up fast: a single grep can dump 20k tokens of unhelpful output into the conversation. Second, the model doesn’t actually need the full history to behave well on the next turn — it needs the right slice of it.

The reframing I want to propose is this: instead of the context being an accurate reflection of the conversation, treat it as synthetic. You can freely design the history to elicit the behaviour you want, as long as you keep it cache-friendly. Once you accept that framing, several techniques fall out naturally.

Elision and Cache-friendliness

The simplest such technique is elision: remove or replace earlier turns to keep the context small, especially turns involving tool responses or where the output is known to be out-of-date. There’s a ladder of how aggressive you can be:

  1. Trivial: remove the turn entirely.
  2. Simple: replace it with a one-line “elided” stub.
  3. Offloaded: the stub also points at where the full content can be reloaded from — elided, read from elided/abc.txt.
  4. Summarized: the stub also says what the agent did with that turn — elided; the agent read this file to confirm the import path was correct.
  5. Hybrid: keep some turns, discard others, based on length, ephemerality, recency, or some other signal.

I’m going to call the choice of how and when to do this the elision policy. It maps directly onto the bill because of the KV cache: if you elide a tool call ten thousand tokens ago, every token after it has to be recomputed on the next turn. That quickly adds up to a lot of money.

Full transcript (~30k tok)
After elision (~12k tok)
system12k tok

System prompt and tool definitions.

system12k tok

System prompt and tool definitions.

user

Find every place we call the legacy auth middleware.

user

Find every place we call the legacy auth middleware.

assistant

I’ll grep the repo.

assistant

I’ll grep the repo.

tool callBash

grep -r legacyAuth src/

tool call dropped
tool resultBash18k tok
src/api/users.ts:12:   import { legacyAuth } from "../mw"
src/api/orders.ts:8:   import { legacyAuth } from "../mw"
src/api/billing.ts:9:  import { legacyAuth } from "../mw"
... (240 more lines)
18k tokens, replaced with a one-line stub
assistant

14 files reference it across the API surface. The order routes look like the trickiest — they wrap legacyAuth in a custom handler. Want me to start there?

assistant

14 files reference it across the API surface. The order routes look like the trickiest — they wrap legacyAuth in a custom handler. Want me to start there?

user

Yes, do orders first.

user

Yes, do orders first.

The simplest policy elides any tool calls that occurred before the previous user message. This works fine for simple support chatbots, where tool calls are used immediately and output depends on short time horizons. It works poorly for coding agents, where the file you read a dozen turns ago might be critical to the edit you are about to make.

An intermediate solution is to run elision every n turns, each time eliding to some position before the latest turn. This bounds how often you blow the cache, at the cost of a larger recomputation bill when you do. There’s a tradeoff to make: eliding as close to the conversation head as possible is cache-friendly, but that risks dropping information the model would have used on its next turn. Get this wrong and the agent’s quality becomes jagged (“every five turns, it suddenly becomes dumb”).

A natural follow-up: if a file gets read multiple times, why not elide all but the latest read? In theory, great. In practice, the read is early in the conversation, so removing it invalidates the cache for everything that came after. You may end up doing this multiple times in a single session as the same file gets re-read, and pay the recomputation bill each time.

Task-specific policies open up more options: elide after fifteen minutes of inactivity, when the conversation topic shifts, after git commit, and so on. This is an area of active experimentation for me; the right policy is deeply task-specific and demands trial and error.

Conversation-head Injection

The other direction you can push context engineering is to add things the user didn’t say. Inject system time and time-since-last-message at the head of the conversation. Inject the user’s calendar for the next hour. Inject — in TeachClaw — the student’s name, current class, and which slide is currently being displayed.

A head-injected reminder
user

What files have I changed since main?

assistant

I’ll check git status.

tool callBash

git status

tool resultBash
 M src/api/orders.ts
 M src/api/billing.ts
?? src/api/notes.ts
system

system-reminder — three files staged for over an hour; consider whether the user wants a commit checkpoint before continuing.

Injected by the harness just before the next assistant turn, and removed at the next elision pass.

assistant

Two modified files and one new file. Want me to draft a commit message before we continue?

This costs you tokens per turn, but doesn’t fill the context window any faster: the injected message gets replaced by a newer version on the next user turn rather than accumulating.

There’s a conflict between output quality and cache efficiency here that needs separate handling. You want the injected message to be present in the right place — typically just before the next assistant turn, while it’s choosing what to do — but you don’t want every injection to invalidate the cache for everything after it. The cleanest approach I’ve found is to keep the head-injected message attached to the most recent user message and remove it at the next elision pass, when the boundary advances past it. It’s then re-injected at the new head. The cache survives because the injected message always sits at the moving conversation boundary, never deep in the cached prefix.

This is also a good place for proactive help from the harness, such as proposing memory files to read. The harness could run a naive RAG pass over the last few turns, propose a handful of memory tags at the head of the conversation, and let the agent decide whether to read any of them on the next turn. I haven’t built this just-in-time memory yet, but it’s next on the list.

Persona management

In TeachClaw, the agent has a persona — a writing style and focus that frames the discussion. For a business class, we have a CFO (risk, cost, cash flow) and a consultant (framing, stakeholder management). Students use these to look at the same issue from multiple perspectives: discuss a project with the consultant agent, then switch to the CFO to figure out how to cost it and pay for it.

The conversation context has to be preserved across persona switches — the value of the exercise is in continuity, not in starting fresh each time. But small models are poor at consistently following instructions over a long horizon, especially when those instructions arrived once, thousands of tokens ago.

Conversation-head injection handles this naturally. The current persona is named at the head (~5 tokens) with its focus and concerns spelled out in detail (~200 tokens), refreshed every user turn. That’s a 200-token tax per turn, but on a model costing cents per million tokens, it’s well worth paying for a consistent voice.

One twist: the previous conversational structure and tone anchors the LLM, so sometimes LLMs (especially the small ones we’re using) don’t respond to the change in the injected message. This is easily fixed by emitting a short system message (“persona changed to CFO”) which provides the LLM enough of a cue to regulate its output.

Compaction

When context runs out, we compact it to get more headroom. Compaction replaces a long chunk of the history with a much shorter representation — typically a single summary turn that stands in for everything before it.

Before compaction — 8 turns, ~32k tokens
After compaction — 3 turns, ~13k tokens
system12k tok

System prompt and tool definitions.

system12k tok

System prompt and tool definitions.

user

Plan a rewrite of the billing webhook handler.

assistant

I’ll start by reading the current handler and the tests.

tool callRead

src/webhooks/billing.ts, test/webhooks/billing.test.ts

tool resultRead14k tok
export async function handleBilling(req, res) { ... 600 lines ... }
assistant

Three retry paths, two of them duplicate the dedupe logic. The cleanest cut is to extract a withIdempotency wrapper and run each provider branch through it. I’ll sketch the diff.

summary~600 tok

Earlier this session: user asked for a rewrite plan for the billing webhook. After reading src/webhooks/billing.ts (600 lines) and its tests, the agent identified three retry paths, two of which duplicate dedupe logic. Proposed extracting a withIdempotency wrapper and running each provider branch through it.

user

Go ahead, but keep the Stripe path untouched for now.

user

Go ahead, but keep the Stripe path untouched for now.

Compaction via summary

This is what most existing coding tools do, and they all converge on roughly the same shape:

  1. Replace the conversation window with a no-frills summary, including orienting facts and goals.
  2. Re-inject any system state: a. Loaded skills. b. TODO list, if there is one. c. Recently run tools/read files

The summary itself is generated by the model, given the conversation so far and a prompt asking it to extract orienting facts. Done well, it’s nearly invisible. Done badly, the agent loses the thread and drifts off into absurdity.

Ongoing compaction via log

Summary-based compaction is noticeably slow — the model has to read the entire window and emit a summary before the next user turn can be served. So why not amortize the work? The idea is to compact-as-you-go by maintaining a running log of what the agent has done, and at compaction time, swap the trace for the log.

The naive version — ask the LLM to log each turn — doesn’t work well. Small models don’t reliably emit logs without being reminded every turn (which is itself a head-injection problem), and turn-by-turn logging doesn’t easily build into a coherent picture of the overall task.

A better approach, I think, is to run an ephemeral command every few turns: spin up a sub-context on top of the current conversation that asks the model to produce or update the log; record the log; throw the sub-context’s turns away. Pair this with an explicit task list as the agent’s goal system, so the log knows what it’s logging about. I haven’t built this yet — also next.


If anything here sounds suspiciously plausible-but-untested, that’s because much of it is. These are ideas I’m actively experimenting with, not finished and polished libraries ready for release.