multi-turn inc.
INDEX

index

Agents Need Social Memory

Long context and retrieval are not enough to describe how AI agents should handle promises, betrayals, debts, and alliances. We need benchmarks that turn long interaction history into actionable social state.

When we talk about long-term memory for AI agents, we often start with the wrong question.

"Can the agent retrieve the right past conversation?"

That question matters, but it is not enough. Retrieval brings a piece of the past next to the current prompt. Long-running agents need something different: not only the piece itself, but the state that the piece changed.

This becomes much sharper when many agents interact over time.

Who kept promises? Who defected at the critical moment? Who owes whom? Which alliance is still active, and which one expired? Which old reputation was invalidated by a recent event? Which partner is reliable alone but dangerous when paired with a specific third party?

These are not just episodic memories.

They are social state.

The problem I want to solve is this:

Can long-horizon AI agents maintain action-relevant social state across many interactions with other agents under constrained context?

This is narrower than "does the agent have good memory?" and more measurable than "is the agent intelligent?"


Long Context Is Not the Answer

Longer context windows make many tasks easier. What fails at 8k may work at 128k. What fails at 128k may work at 1M.

But long context is not the definition of memory.

First, logs from multi-agent environments naturally exceed 1M tokens. If 64 agents trade, help, attack, vote, ally, and defect across thousands of turns, putting the whole transcript into context stops being a serious baseline.

Second, even a full transcript is not necessarily the right representation. The current action does not require every sentence mentioning Faction-09. It requires the current trust, debt, betrayal history, pact status, and recent corrections between Faction-09 and me.

Third, good memory does not preserve every difference. It preserves the differences that change action and compresses the differences that do not.

Consider the same current input:

Faction-09 proposes a defensive pact.
Choose one: ACCEPT / REJECT / RAID / ABSTAIN

History A:

Faction-09 honored three defense pacts.
Faction-09 repaid a resource debt.
Faction-09 helped during a crisis.

History B:

Faction-09 joined three defense pacts.
Faction-09 defected when attacks began.
Faction-09 sold resources to the attacker.

The current input is almost identical. A surface retrieval system might describe both histories as "many pact-related events involving Faction-09." But the right action should differ. In A, ACCEPT makes sense. In B, REJECT or RAID may make sense.

If a memory system collapses these histories into the same effective state, that is history aliasing.

If the core state is the same but the agent changes action because unrelated calendar logs mention Faction-09, that is over-splitting.

This boundary is what a long-horizon memory benchmark should measure.


Why Multi-Agent Settings Matter

Long-term memory matters even for a single agent. It may need to remember a user's preferences, a project phase, an earlier decision, or a forbidden approach.

But multi-agent environments make the memory problem cleaner.

Other agents are not static data. They act again in the future. Past events become evidence about future behavior. Help becomes debt. Betrayal becomes risk. Repeated cooperation becomes trust.

"Who did what?" turns into "Whom should I trust now?"

Then memory failure becomes strategic damage.

  • The agent trusts someone who repeatedly defected.
  • It treats an expired pact as still active.
  • It mixes up the behavior of A and a similarly named B.
  • It abandons a good partner because of irrelevant noise.
  • It fails to integrate many weak signals and joins a dangerous coalition.

These failures are not captured well by retrieval accuracy alone. The relevant event may be retrieved, but if it does not update the right social state, the action can still be wrong.


Existing Games Are Close, but Not Quite Right

I first wanted to use a famous game directly. That would make the benchmark intuitive and easier to explain.

But each candidate misses the target in a different way.

GameWhy it helpsWhy it is not enough
Diplomacyalliances, betrayal, long-term strategyfree negotiation mixes memory with persuasion and politics
Avalon / Resistancehidden teams, votes, suspicion historysocial inference and utterance interpretation dominate
Hanabimemory is truly centralcooperation is strong, but competition and betrayal are weak
Pokeropponent history matterslittle cooperation, much probability and bluffing
Bridge / Spadescooperation, competition, card historycard expertise can hide the memory signal

These games are useful sources of intuition. But used directly, they blur the thing we want to measure.

The goal is not to evaluate "game skill." The goal is to evaluate whether an agent can maintain long-horizon social history as actionable state.

So the game should be designed for that purpose. It should also be small and structured.


Coalition Ledger Arena

The cleanest version I see is Coalition Ledger Arena.

Several AI factions share an arena. They repeatedly help, trade, attack, defend, form pacts, break pacts, repay debts, and defect.

Every interaction is written to a public ledger.

The agent sees the ledger and the current decision options. The simulator separately maintains hidden social state.

Coalition Ledger Arena diagram

The core objects are small.

Agent/Faction:
  id
  score
  resources
  alive
 
Pairwise social state:
  trust[i, j]
  debt[i, j]
  betrayal_count[i, j]
  aid_count[i, j]
  active_pact[i, j]
  pact_expiry_turn[i, j]

The action space should also be small.

AID(target)
TRADE(target)
RAID(target)
DEFEND(target)
FORM_PACT(target, duration)
BREAK_PACT(target)
REPAY_DEBT(target)
ACCEPT(proposal)
REJECT(proposal)
ABSTAIN

Version 1 should not include free-form negotiation. Once arbitrary speech is allowed, the benchmark becomes about rhetoric, persuasion, deception, and utterance interpretation. Those matter, but they are not the first variable to isolate.

At each decision point, the agent receives 3-5 concrete candidate actions.

Current event:
Faction-12 asks you to join a defensive pact for 5 turns.
 
Choose one:
A. ACCEPT the pact
B. REJECT the pact
C. RAID Faction-12 before the pact forms
D. ABSTAIN

The best answer cannot be determined from the current sentence alone. It depends on what Faction-12 did before, who it is connected to, whether it owes you, whether it was recently sanctioned, and whether it only defects when paired with a specific third party.


The Oracle Should Not Be a Genius

There is a trap here.

If the game becomes too strategic, the benchmark will measure planning rather than memory.

The oracle should not perform deep search. A local policy that scores current candidate actions from hidden social state is enough.

For a proposed partner j, the value could be:

partner_value(j) =
  expected_material_gain
  + debt_recovery_value(i, j)
  + pact_value(i, j)
  + trust_bonus(i, j)
  - betrayal_risk(i, j)
  - active_enemy_penalty(i, j)

The oracle chooses the candidate action with the highest utility.

It does not need to be a perfect strategist. It should not be one. The point is not to solve game theory. The point is to ask whether histories with the same social state lead to the same action, and histories with different social states lead to different actions.


What Should Be Measured

Final score is not enough.

If an agent wins, we still do not know why. Maybe it remembered better. Maybe it got easier opponents. Maybe a blunt attack strategy dominated the arena.

The evaluation needs two layers.

1. Arena Outcome

This layer is intuitive.

final_score
survival_rate
resource_total
coalition_success_rate
betrayal_damage
promise_fulfillment_rate

These metrics say whether the agent played well.

2. Memory Diagnostics

The second layer matters more.

partner_choice_accuracy
action_utility_regret
trust_state_error
debt_state_error
pact_state_error
betrayal_avoidance_rate
stale_pact_error_rate
cross_agent_aliasing_rate
oversplitting_error_rate

These metrics say why it played well or badly.

An agent may have a high final score but also a high stale_pact_error_rate. In that case, it may have compensated for bad memory with another strategy. Another agent may lose but maintain low trust_state_error; its social memory may be good while its game policy is weak.

This separation is essential.


The Benchmark Must Not Favor One Method

This arena must not become a toy built to make persistent state look good.

A good benchmark contains subsets where different methods should win.

SubsetExpected winnerWhy
decisive overrideretrieval or full contextone recent treaty, pardon, or sanction reverses old reputation
coarse trendsummaryonly the broad direction of the relationship matters
distributed trustpersistent pairwise statemany weak events jointly create trust
debt trackingstructured statetrust and debt must be separated
conditional coalitionricher abstraction or targeted retrievalan agent is dangerous only when paired with a specific partner
noise stabilityupdate-gated memoryirrelevant ledger events should not change action
identity alias stressexplicit entity statesimilarly named agents must not be merged

If one method wins every subset, the benchmark is wrong. If persistent state wins everything, the override tasks are too weak. If retrieval wins everything, distributed evidence is too weak. If summary wins everything, the history boundary is too coarse.

The goal is not to advertise a memory method.

The goal is to show where each memory representation breaks.


The Smallest v1

There is no need to start with 1024 factions and 100k turns.

A small v1 is enough.

agents: 12
turns: 500
decision probes: 100
candidate actions per probe: 4
social variables: trust, debt, active_pact, betrayal_count
event types: aid, raid, pact, defect, repay, noise

The baselines should also be simple.

no_memory
recent_window
entity-keyed retrieval
global semantic retrieval
summary memory
persistent pairwise state
persistent pairwise state + override
oracle simulator state

The first result should not be a big headline number.

The first result should be a diagnostic pattern.

retrieval wins on decisive overrides
summary wins on coarse global trends
persistent pairwise state wins on distributed trust and debt
persistent state without override loses on explicit correction
noise does not change correct actions
oracle simulator state is perfect

If this pattern does not appear, the model is not the first suspect. The game design is.


Why This Matters

AI agents will work for longer periods of time. They will use tools, modify code, run experiments, and delegate work to other agents. The next step is not just an agent that works alone for longer. It is systems of agents that negotiate work, compete, cooperate, and depend on each other.

In that world, the important memory question is not:

Can you find the old log?

It is:

Can I trust this counterpart?
Is this promise still active?
Should this debt be repaid or collected?
Was this betrayal stale noise or a repeated pattern?
Should I accept this proposal now?

To answer those questions, carrying the past as raw text is not enough. The agent must maintain the social state that the past created.

This is not a theory of human minds. It is not a claim that retrieval is impossible. It is much narrower.

In long-horizon multi-agent environments, agent memory should be evaluated by its ability to preserve action-relevant social state.

Coalition Ledger Arena is a way to make that claim testable.

There are no model results yet. What exists so far is a problem definition and a game design. But once the problem is narrowed this way, the next question becomes concrete.

Not: does the agent read a lot of old logs?

But:

does it make better decisions about whom to trust?

Agents Need Social Memory | Multi-turn Inc.