May 22, 2026•blog
I’ve been building BenchClaw and TeachClaw, agent harnesses for, respectively, a general-purpose personal assistant in the style of OpenClaw and an academic teaching assistant that has to deal with untrusted student users. A shared goal in both is to deliver high-quality interaction using cheap models.
Both currently run on Gemma 4 E4B — a smartphone-grade model that costs cents per million tokens. A 24k-token context window is more than enough for what they do. They punch above their weight because of the harness around them.
The harness is the part that wraps around the model and lets it do things in the world. A model offers a very simple interface: given a conversation state, predict the next turn. Everything else — tools, memory, scheduling, context — is the harness.
Harness engineering is increasingly the difference between products that share the same underlying model. Cursor makes open-weight models punch well above their weight; Codex (until recently) hamstrung top-tier ones. Two of my favourite reads on this are Addy Osmani and Martin Fowler.
I’m going to write up some of my own experiments in a series of posts. This first one is about grooming and managing the agent context. It assumes you’re familiar with the context window, KV caching, and how hosted LLMs are priced.
The context window is the agent’s short-term memory, measured in tokens. Naively, you’d populate it by appending every user message, every model turn, and every tool call and result as they happen — a literal transcript.
You are a coding agent. Tools available: Read, Edit, Bash. …
Fix the failing test in test_foo.py.
I’ll read the test first.
test_foo.py
def test_foo():
assert add(2, 3) == 5
add is returning a - b. One-character fix.
Two issues with treating context as a transcript. First, it fills up fast: a single grep can dump 20k tokens of unhelpful output into the conversation. Second, the model doesn’t actually need the full history to behave well on the next turn — it needs the right slice of it.
The reframing I want to propose is this: instead of the context being an accurate reflection of the conversation, treat it as synthetic. You can freely design the history to elicit the behaviour you want, as long as you keep it cache-friendly. Once you accept that framing, several techniques fall out naturally.
The simplest such technique is elision: remove or replace earlier turns to keep the context small, especially turns involving tool responses or where the output is known to be out-of-date. There’s a ladder of how aggressive you can be:
elided, read from elided/abc.txt.elided; the agent read this file to confirm the import path was correct.I’m going to call the choice of how and when to do this the elision policy. It maps directly onto the bill because of the KV cache: if you elide a tool call ten thousand tokens ago, every token after it has to be recomputed on the next turn. That quickly adds up to a lot of money.
System prompt and tool definitions.
System prompt and tool definitions.
Find every place we call the legacy auth middleware.
Find every place we call the legacy auth middleware.
I’ll grep the repo.
I’ll grep the repo.
grep -r legacyAuth src/
src/api/users.ts:12: import { legacyAuth } from "../mw"
src/api/orders.ts:8: import { legacyAuth } from "../mw"
src/api/billing.ts:9: import { legacyAuth } from "../mw"
... (240 more lines)
14 files reference it across the API surface. The order routes
look like the trickiest — they wrap legacyAuth in a custom
handler. Want me to start there?
14 files reference it across the API surface. The order routes
look like the trickiest — they wrap legacyAuth in a custom
handler. Want me to start there?
Yes, do orders first.
Yes, do orders first.
The simplest policy elides any tool calls that occurred before the previous user message. This works fine for simple support chatbots, where tool calls are used immediately and output depends on short time horizons. It works poorly for coding agents, where the file you read a dozen turns ago might be critical to the edit you are about to make.
An intermediate solution is to run elision every n turns, each time eliding to some position before the latest turn. This bounds how often you blow the cache, at the cost of a larger recomputation bill when you do. There’s a tradeoff to make: eliding as close to the conversation head as possible is cache-friendly, but that risks dropping information the model would have used on its next turn. Get this wrong and the agent’s quality becomes jagged (“every five turns, it suddenly becomes dumb”).
A natural follow-up: if a file gets read multiple times, why not elide all but the latest read? In theory, great. In practice, the read is early in the conversation, so removing it invalidates the cache for everything that came after. You may end up doing this multiple times in a single session as the same file gets re-read, and pay the recomputation bill each time.
Task-specific policies open up more options: elide after fifteen minutes of inactivity, when the conversation topic shifts, after git commit, and so on. This is an area of active experimentation for me; the right policy is deeply task-specific and demands trial and error.
The other direction you can push context engineering is to add things the user didn’t say. Inject system time and time-since-last-message at the head of the conversation. Inject the user’s calendar for the next hour. Inject — in TeachClaw — the student’s name, current class, and which slide is currently being displayed.
What files have I changed since main?
I’ll check git status.
git status
M src/api/orders.ts
M src/api/billing.ts
?? src/api/notes.ts
system-reminder — three files staged for over an hour; consider whether the user wants a commit checkpoint before continuing.
Injected by the harness just before the next assistant turn, and removed at the next elision pass.
Two modified files and one new file. Want me to draft a commit message before we continue?
This costs you tokens per turn, but doesn’t fill the context window any faster: the injected message gets replaced by a newer version on the next user turn rather than accumulating.
There’s a conflict between output quality and cache efficiency here that needs separate handling. You want the injected message to be present in the right place — typically just before the next assistant turn, while it’s choosing what to do — but you don’t want every injection to invalidate the cache for everything after it. The cleanest approach I’ve found is to keep the head-injected message attached to the most recent user message and remove it at the next elision pass, when the boundary advances past it. It’s then re-injected at the new head. The cache survives because the injected message always sits at the moving conversation boundary, never deep in the cached prefix.
This is also a good place for proactive help from the harness, such as proposing memory files to read. The harness could run a naive RAG pass over the last few turns, propose a handful of memory tags at the head of the conversation, and let the agent decide whether to read any of them on the next turn. I haven’t built this just-in-time memory yet, but it’s next on the list.
In TeachClaw, the agent has a persona — a writing style and focus that frames the discussion. For a business class, we have a CFO (risk, cost, cash flow) and a consultant (framing, stakeholder management). Students use these to look at the same issue from multiple perspectives: discuss a project with the consultant agent, then switch to the CFO to figure out how to cost it and pay for it.
The conversation context has to be preserved across persona switches — the value of the exercise is in continuity, not in starting fresh each time. But small models are poor at consistently following instructions over a long horizon, especially when those instructions arrived once, thousands of tokens ago.
Conversation-head injection handles this naturally. The current persona is named at the head (~5 tokens) with its focus and concerns spelled out in detail (~200 tokens), refreshed every user turn. That’s a 200-token tax per turn, but on a model costing cents per million tokens, it’s well worth paying for a consistent voice.
One twist: the previous conversational structure and tone anchors the LLM, so sometimes LLMs (especially the small ones we’re using) don’t respond to the change in the injected message. This is easily fixed by emitting a short system message (“persona changed to CFO”) which provides the LLM enough of a cue to regulate its output.
When context runs out, we compact it to get more headroom. Compaction replaces a long chunk of the history with a much shorter representation — typically a single summary turn that stands in for everything before it.
System prompt and tool definitions.
System prompt and tool definitions.
Plan a rewrite of the billing webhook handler.
I’ll start by reading the current handler and the tests.
src/webhooks/billing.ts, test/webhooks/billing.test.ts
export async function handleBilling(req, res) { ... 600 lines ... }
Three retry paths, two of them duplicate the dedupe logic. The
cleanest cut is to extract a withIdempotency wrapper and run
each provider branch through it. I’ll sketch the diff.
Earlier this session: user asked for a rewrite plan for the
billing webhook. After reading src/webhooks/billing.ts (600
lines) and its tests, the agent identified three retry paths,
two of which duplicate dedupe logic. Proposed extracting a
withIdempotency wrapper and running each provider branch
through it.
Go ahead, but keep the Stripe path untouched for now.
Go ahead, but keep the Stripe path untouched for now.
This is what most existing coding tools do, and they all converge on roughly the same shape:
The summary itself is generated by the model, given the conversation so far and a prompt asking it to extract orienting facts. Done well, it’s nearly invisible. Done badly, the agent loses the thread and drifts off into absurdity.
Summary-based compaction is noticeably slow — the model has to read the entire window and emit a summary before the next user turn can be served. So why not amortize the work? The idea is to compact-as-you-go by maintaining a running log of what the agent has done, and at compaction time, swap the trace for the log.
The naive version — ask the LLM to log each turn — doesn’t work well. Small models don’t reliably emit logs without being reminded every turn (which is itself a head-injection problem), and turn-by-turn logging doesn’t easily build into a coherent picture of the overall task.
A better approach, I think, is to run an ephemeral command every few turns: spin up a sub-context on top of the current conversation that asks the model to produce or update the log; record the log; throw the sub-context’s turns away. Pair this with an explicit task list as the agent’s goal system, so the log knows what it’s logging about. I haven’t built this yet — also next.
If anything here sounds suspiciously plausible-but-untested, that’s because much of it is. These are ideas I’m actively experimenting with, not finished and polished libraries ready for release.