Memory beats full context on LongMemEval — and the wins we don't get
Our first official benchmark runs: +14.2 points over a full-context baseline on LongMemEval at ~39× fewer tokens, plus the LoCoMo case where full context still wins.
A common objection to agent memory is that you don't need it: context windows are huge now, so just put the whole history in the prompt. We wanted a real answer, not a vibe, so we ran two public long-term-memory benchmarks against a full-context baseline. Here's what we found — including the case where the baseline wins.
The setup
We compared two configurations on the same questions. The full-context baseline stuffs the entire conversation history into the prompt. Eidentic memory ingests the history into its four-tier engine and retrieves only what each question needs. Both use the same model and the same LLM judge. We ran the full sets — no sampling — and we're publishing wins and losses together.
LongMemEval: memory wins across the board
LongMemEval uses long histories — roughly 115k tokens across ~50 sessions, 500 questions. This is where memory should help, and it does: 55.2% overall vs 41.0% for full context, a 14.2-point gain, winning all six question types.
| Question type | Full context | Eidentic memory |
|---|---|---|
| Single-session · user | 67.1% | 84.3% |
| Single-session · assistant | 73.2% | 92.9% |
| Single-session · preference | 3.3% | 26.7% |
| Multi-session | 27.8% | 42.1% |
| Temporal reasoning | 20.3% | 34.6% |
| Knowledge update | 66.7% | 70.5% |
| Overall | 41.0% | 55.2% |
The cost difference is the other half of the story. Memory answers each question with about 2,550 tokens of retrieved context; the baseline spends about 99,435 re-reading the whole history every time — up to ~39× fewer tokens for the better score. Retrieval isn't just more accurate here, it's dramatically cheaper.
LoCoMo: where full context still wins
LoCoMo has a much smaller haystack. When the entire history comfortably fits in the window, brute force is hard to beat: the model can see everything at once, and single- and multi-hop questions don't need retrieval. Here the full-context baseline comes out 7.8 points ahead. Memory still uses far fewer tokens (~893 vs ~19,030), but on a small history that trade-off doesn't pay for itself on accuracy.
The larger the history, the more memory wins — on accuracy and on cost. On small histories, full context stays competitive. We'd rather you know both numbers than just the flattering one.
What this means in practice
If your agent's conversations are short and bounded, you may not need a memory engine at all — and we'll tell you that. But the moment histories grow past what you want to pay to re-read on every turn, retrieval-based memory wins twice: better answers, far fewer tokens. That crossover arrives quickly in real products.
Full methodology, the harness, and the raw per-question records are in the benchmarks docs, and the runner lives in the repo. Reproduce it, and tell us where we're wrong.