Back to Blog

Managing Long Contexts in Agentic Coding Systems

Managing Long Contexts in Agentic Coding Systems

Nov 4, 2025

Written by

Sudhir Balaji

There has been a lot of chatter in the coding agent world over the last few months about context engineering and managing long contexts. The defining work on this is probably Anthropic’s blog post, but the thing that prompted me to write this post was Cognition’s post about their fast context retrieval models and @swyx’s tweet about it. In this article I’m going to cover existing strategies and some of our thinking about how to implement these with some modernizations and other improvements.

Existing strategies

Compaction

A lot has been written about compaction, so I’ll keep it brief - the core idea is to summarise part of a long LLM conversation, and replace that part in the original conversation with the summary.

Claude Code has this feature, Cognition had their own LLM model to do compaction, Cursor calls it summarization. The feature is ubiquitous, probably in large part due to the simplicity of implementation.

I’m starting to notice some forays into offering compaction as part of the base LLM APIs -Anthropic has a beta feature that will automatically clear old thinking and tool use messages. I expect OpenAI to follow suit - their Responses API has a truncation option that has been around for a while - I imagine in time it might expose smarter strategies for handling long contexts than just chopping off the oldest messages.

Agentic memory

The idea here is that allowing the coding agent to persist some sort of artifact outside of its context window that gets injected automatically, or is available to read via tools, will help to keep the agent on-track.

There’s not much consensus among coding agent builders about how best to handle agentic memory - people seem to have their own tricks. Mechanisms I’ve seen in the wild include text scratchpads, todo lists and planning documents.

Agentic memory isn’t a direct solution for handling long agent conversations - rather, it works in tandem with compaction to avoid losing important information that the agent needs to carry out its work.

Sub-agents

In the early days of coding agents, many seemed convinced that indexing codebases and doing searches was the best way to handle searching for relevant files and folders for a given coding agent task.

Gradually, the industry moved towards “agentic search”, i.e. letting coding agents poke around files and directories by themselves, since it seemed to work at least as well as vector search and was simpler to implement (I think the release of Claude Code probably cemented this).

However, agentic search is slow and can be very token inefficient - coding agents usually spend a few turns hunting down relevant files and folders, adding a whole bunch of tokens to the context window regardless of whether or not the agent finds what it was looking for.

Here’s where the sub-agent concept comes in - what if the main coding agent could delegate the job of finding the right files and folders and returning relevant snippets to a different agent? You could use a smaller/cheaper/faster LLM to do this searching, and avoid clogging up the main coding agent conversation!

I think this is a very neat idea - and so does Cognition - they doubled down on this recently, with their SWE-grep and SWE-grep-mini models, designed for fast context retrieval.

As a small aside, a novel technique that behaves sort of like a context retrieval sub-agent is from MoonshotAI’s Kimi CLI - which has a tool that lets the agent send a message “back in time” and reset the conversation state to a previous state. I haven’t yet tried it, but it could work really well to save tokens on the main coding agent thread without introduce sub-agent complexity (though prompt cache performance might suffer).

All the above techniques are useful to help manage single, very long running coding agents, some of which we already use at cto.new and some of which we’re in the middle of testing and implementing.

But, focusing specifically on background agents (as we do at cto.new), and execution of coding tasks like feature work, or bug fixing (as opposed to more ambiguous or exploratory work), I think long context management becomes more of a system design problem than an algorithmic problem.

Task decomposition

My hot takes here are two:

  1. background agents fundamentally allow for easier and straightforward compression of history for many iterations on a single task, and

  2. there are few coding tasks that are large enough to overrun an LLM’s context window that are not also decomposable into smaller tasks

Background agent design

A brief explanation of our background coding agent design:

  • a unit of work in our coding agent system is a "task”, which can have one or more “task runs”

  • a task can be created and a task run can be initiated in a number of ways

    • an external ticket provider like Linear or Trello

    • directly from our web interface

    • via a planning agent (more on that later)

  • a task run corresponds to one coding agent loop - where the input is a user prompt (or ticket details, etc.) and a target codebase, and the output is a PR

  • further iteration on the PR triggers additional task runs

Until we run out of room in the context window, we can provide the full conversation history of all previous task runs to each subsequent task run.

When we do run out of room, we can compact whole task runs, which avoids us having to make any decisions about where exactly in a conversation to begin doing compaction.

This works well because code changes persist across task runs, so we have a kind of agentic memory in the code that the agent has changed and written.

Decomposing large tasks

This is hotter take of the two - I think that a very large fraction of coding tasks given to agents that end up overflowing a context window can be relatively straightforwardly decomposed into smaller tasks.

We have started to introduce a planning agent in cto.new, which will investigate and produce detailed plans to chunk up a user’s request into many smaller tasks, to be executed over one or more repositories.

As an example, for a request along the lines of “Implement a waitlist system”, the planning agent will investigate all of the user’s relevant repositories (frontend, backend, etc.) to come up with several smaller tasks that together make up the complete implementation, that the user can then edit and execute.

The user remains in full control of what coding tasks are actually kicked off, but now they have a more granular view of the job - code reviews should be easier for smaller tasks, context windows are less likely to overflow, and course-correction is easier.

Of course, the downsides are that the planning agent is not very fast, and may still struggle with underspecified tasks but we only expect this to improve with time.

What about…?

What about long-context models?

We might expect that in time, longer and longer context windows render some of these long context management techniques unnecessary. That’s probably true, though at the moment it seems that context rot is likely to continue to be an issue.

There are secondary effects too of using context windows effectively - the less of the context window you can use and still get good results, the quicker your agent will be and the less it will cost to run.

What if there are so many task runs that their summaries clog up the model’s context window

This hasn’t yet happened to us but it’s possible! Mostly this would be a sign that the original task was too large - in that case the user may have significantly benefited from using our planning agent.

Thinking briefly about how we might handle this anyway, we could summarise the task run summaries until there was enough space in the context window, and carry on summarising summaries recursively (sounds nasty though).

What if you cannot break a large task up?

I am very happy to admit that there are tasks which cannot easily be broken up into subtasks. There are probably many cases where a large part of the actual work is defining interfaces between things such that separate tasks can work on a request in parallel.

For these, synchronous development workflows and IDEs are probably still the best fit (for now).

Applying these techniques to cto.new

We’re  applying these techniques and more to cto.new. Because it’s completely free, you can try the results for yourself at cto.new and let us know what you think, and what you’re building, in our Discord community

All rights reserved 2025,
Era Technologies Holdings Inc.

All rights reserved 2025,
Era Technologies Holdings Inc.