Agents Need Social Memory

May 2, 2026

Long context and retrieval are not enough to describe how AI agents should handle promises, betrayals, debts, and alliances. We need benchmarks that turn long interaction history into actionable social state.

AI 에이전트의 장기 기억을 말할 때 우리는 자주 잘못된 질문에서 시작한다.

"지난 대화를 잘 찾을 수 있는가?"

이 질문은 필요하지만 충분하지 않다. 검색은 과거의 어떤 조각을 현재 프롬프트 옆에 가져온다. 하지만 장기적으로 일하는 에이전트에게 더 중요한 것은 조각 자체가 아니라, 그 조각들이 바꿔 놓은 상태다.

특히 여러 에이전트가 오래 부딪히는 환경에서는 문제가 달라진다.

누가 약속을 지켰는가. 누가 결정적 순간에 배신했는가. 누가 나에게 빚졌는가. 어떤 동맹은 아직 유효하고, 어떤 동맹은 이미 만료됐는가. 어떤 과거 평판은 최근 사건으로 무효화됐는가. 누구는 혼자 있을 때는 믿을 만하지만, 특정 세력과 같이 움직일 때는 위험한가.

이런 것은 단순한 episodic memory가 아니다.

사회적 상태다.

내가 지금 풀고 싶은 문제는 이것이다.

긴 시간 동안 여러 AI 에이전트가 상호작용할 때, 에이전트는 과거 로그를 행동 가능한 사회적 상태로 유지할 수 있는가?

이 질문은 "기억력이 좋은가"보다 좁고, "지능이 있는가"보다 측정 가능하다.

긴 컨텍스트는 답이 아니다

컨텍스트 윈도우가 길어질수록 많은 문제가 쉬워진다. 8k에서 안 되던 것이 128k에서 되고, 128k에서 안 되던 것이 1M에서 될 수 있다.

하지만 긴 컨텍스트는 기억의 정의가 아니다.

첫째, multi-agent 환경의 로그는 자연스럽게 1M 토큰을 넘는다. 에이전트 64개가 수천 턴 동안 거래, 도움, 공격, 투표, 동맹, 배신을 반복하면 원문 전체를 넣는 접근은 금방 실험 밖으로 밀려난다.

둘째, 전체 로그를 넣는다고 해서 올바른 표현을 얻는 것도 아니다. 현재 행동에 필요한 것은 "Faction-09가 언급된 모든 문장"이 아니라, Faction-09와 나 사이의 신뢰, 빚, 배신 이력, 현재 pact 상태, 최근 correction이다.

셋째, 좋은 기억은 모든 차이를 보존하지 않는다. 보존해야 하는 차이와 버려야 하는 차이를 가른다.

같은 현재 입력을 보자.

Faction-09 proposes a defensive pact.
Choose one: ACCEPT / REJECT / RAID / ABSTAIN

History A:

Faction-09 honored three defense pacts.
Faction-09 repaid a resource debt.
Faction-09 helped during a crisis.

History B:

Faction-09 joined three defense pacts.
Faction-09 defected when attacks began.
Faction-09 sold resources to the attacker.

현재 입력은 거의 같다. 표면적으로는 두 history 모두 "Faction-09와 pact 관련 사건이 많다"라고 검색될 수 있다. 하지만 정답 행동은 달라야 한다. A에서는 ACCEPT가 말이 되고, B에서는 REJECT나 RAID가 말이 된다.

이 둘을 같은 상태로 뭉개면 history aliasing이다.

반대로, Faction-09가 신뢰할 만하다는 핵심 상태는 같은데, 중간에 unrelated calendar log가 많이 끼었다는 이유로 행동이 바뀌면 over-splitting이다.

장기 기억 벤치마크가 봐야 하는 것은 이 경계다.

왜 멀티에이전트인가

단일 에이전트 작업에서도 장기 기억은 중요하다. 사용자의 선호, 프로젝트 단계, 이전 결정, 금지된 접근 방식을 기억해야 한다.

하지만 multi-agent 환경은 기억 문제를 더 선명하게 만든다.

다른 에이전트는 단순한 데이터가 아니다. 그들은 미래에 다시 행동한다. 과거 사건은 그들의 성향과 관계 상태에 대한 evidence가 된다. 도움은 빚이 되고, 배신은 위험이 되고, 반복된 협력은 신뢰가 된다.

그래서 "누가 무엇을 했는가"는 곧 "지금 누구를 믿을 것인가"로 이어진다.

이때 기억 실패는 단순 오답이 아니라 전략적 손실이 된다.

과거에 반복적으로 배신한 에이전트를 다시 믿는다.
이미 만료된 동맹을 아직 유효하다고 착각한다.
A가 한 행동과 이름이 비슷한 B가 한 행동을 섞는다.
한 번의 무관한 noise 때문에 좋은 파트너를 버린다.
여러 약한 신호를 합치지 못해서 위험한 동맹을 맺는다.

이 실패들은 기존 "검색 정확도"만으로 잘 잡히지 않는다. 검색 결과 안에 관련 사건이 있어도, 그 사건들이 올바른 사회적 상태로 업데이트되지 않으면 행동은 틀린다.

유명한 게임은 거의 맞지만, 조금씩 빗나간다

처음에는 유명한 게임을 그대로 쓰고 싶었다. 그래야 직관적이고, 설명 비용이 적다.

하지만 하나씩 보면 문제가 있다.

게임	좋은 점	왜 부족한가
Diplomacy	동맹, 배신, 장기 전략이 있다	자유 협상이 커서 기억보다 설득과 정치력이 섞인다
Avalon / Resistance	숨은 팀, 투표, 의심 기록이 있다	사회적 추론과 발화 해석이 너무 커진다
Hanabi	기억이 정말 중요하다	협력은 강하지만 경쟁과 배신이 약하다
Poker	상대 패턴 기억이 중요하다	협력이 거의 없고 확률/블러핑 문제가 커진다
Bridge / Spades	협력과 경쟁, 카드 history가 있다	카드 규칙과 게임 전문성이 기억 평가를 가릴 수 있다

이 게임들은 모두 힌트를 준다. 하지만 그대로 쓰면 우리가 보고 싶은 현상이 흐려진다.

우리는 "게임을 잘하는 모델"을 재고 싶은 것이 아니다. "긴 사회적 history를 행동 가능한 상태로 유지하는 모델"을 보고 싶다.

그래서 게임은 새로 설계하는 편이 낫다. 단, 완전히 복잡한 세계를 만들 필요는 없다. 오히려 반대다. 작고 구조화되어야 한다.

Coalition Ledger Arena

내가 지금 가장 깔끔하다고 보는 형태는 Coalition Ledger Arena다.

여러 AI faction이 같은 arena에서 반복적으로 만난다. 그들은 도움을 주고, 거래하고, 공격하고, 방어하고, pact를 맺고, pact를 깨고, 빚을 갚고, 상대를 배신한다.

모든 사건은 public ledger에 쌓인다.

에이전트가 보는 것은 이 ledger와 현재 선택지다. 시뮬레이터는 별도로 hidden social state를 유지한다.

핵심 객체는 작다.

Agent/Faction:
  id
  score
  resources
  alive
 
Pairwise social state:
  trust[i, j]
  debt[i, j]
  betrayal_count[i, j]
  aid_count[i, j]
  active_pact[i, j]
  pact_expiry_turn[i, j]

행동도 작게 둔다.

AID(target)
TRADE(target)
RAID(target)
DEFEND(target)
FORM_PACT(target, duration)
BREAK_PACT(target)
REPAY_DEBT(target)
ACCEPT(proposal)
REJECT(proposal)
ABSTAIN

처음부터 자유 협상을 넣지 않는다. "무슨 말을 할 것인가"까지 열어 버리면, 평가가 곧 수사학, 설득, 속임수, 말투 해석 문제가 된다. 그것들은 중요하지만 v1의 핵심이 아니다.

v1에서는 매 decision point마다 3-5개의 후보 행동만 준다.

Current event:
Faction-12 asks you to join a defensive pact for 5 turns.
 
Choose one:
A. ACCEPT the pact
B. REJECT the pact
C. RAID Faction-12 before the pact forms
D. ABSTAIN

좋은 답은 현재 문장만으로 결정되지 않는다. Faction-12가 이전에 무엇을 했는지, 누구와 엮여 있는지, 나에게 빚졌는지, 최근 sanction을 받았는지, 특정 third party와 함께 있을 때만 배신했는지가 중요하다.

이 게임의 oracle은 천재일 필요가 없다

여기서 조심해야 할 점이 있다.

게임을 너무 전략적으로 만들면 기억이 아니라 planning을 측정하게 된다. 그러면 다시 문제가 흐려진다.

그래서 oracle은 깊은 search를 하지 않는 편이 낫다. 현재 후보 행동을 hidden social state로 평가하는 local policy면 충분하다.

예를 들면 partner j에 대한 값은 이렇게 계산할 수 있다.

partner_value(j) =
  expected_material_gain
  + debt_recovery_value(i, j)
  + pact_value(i, j)
  + trust_bonus(i, j)
  - betrayal_risk(i, j)
  - active_enemy_penalty(i, j)

그다음 후보 행동 중 utility가 가장 높은 것을 oracle action으로 둔다.

이 oracle이 완벽한 전략가일 필요는 없다. 오히려 완벽하면 안 된다. 우리는 게임 이론의 최적해를 찾는 것이 아니라, 같은 social state를 가진 history는 같은 행동으로, 다른 social state를 가진 history는 다른 행동으로 보내는지를 보고 싶다.

무엇을 측정할 것인가

최종 점수만 보면 안 된다.

승리한 에이전트가 기억을 잘해서 이겼는지, 우연히 쉬운 상대를 만나서 이겼는지, 단순한 공격 전략이 잘 먹혀서 이겼는지 알 수 없다.

평가는 두 층이어야 한다.

1. Arena outcome

이 층은 직관적이다.

final_score
survival_rate
resource_total
coalition_success_rate
betrayal_damage
promise_fulfillment_rate

이 지표들은 "게임을 잘했는가"를 말해 준다.

2. Memory diagnostics

두 번째 층이 더 중요하다.

partner_choice_accuracy
action_utility_regret
trust_state_error
debt_state_error
pact_state_error
betrayal_avoidance_rate
stale_pact_error_rate
cross_agent_aliasing_rate
oversplitting_error_rate

이 지표들은 "왜 잘했거나 못했는가"를 말해 준다.

예를 들어 어떤 에이전트가 최종 점수는 높지만 stale_pact_error_rate가 높다면, 그 에이전트는 기억을 잘한 것이 아니라 다른 전략으로 손실을 덮었을 수 있다. 반대로 최종 점수는 낮지만 trust_state_error가 낮다면, social memory는 잘 유지했지만 game policy가 약했을 수 있다.

이 분리가 중요하다.

벤치마크는 한 방법을 이기게 만들면 안 된다

이 arena는 persistent state를 띄워 주기 위해 만든 장난감이 되면 안 된다.

좋은 benchmark는 각 방법이 이겨야 하는 구간이 다르다.

Subset	이겨야 하는 방법	이유
decisive override	retrieval 또는 full context	하나의 최근 treaty, pardon, sanction이 과거 평판을 뒤집는다
coarse trend	summary	전체적으로 우호적인지 적대적인지만 충분하다
distributed trust	persistent pairwise state	여러 약한 사건이 합쳐져 신뢰를 만든다
debt tracking	structured state	trust와 debt를 구분해야 한다
conditional coalition	richer abstraction 또는 targeted retrieval	어떤 agent는 특정 partner와 같이 있을 때만 위험하다
noise stability	update-gated memory	관련 없는 ledger event는 행동을 바꾸면 안 된다
identity alias stress	explicit entity state	이름이나 역할이 비슷한 agent를 섞으면 안 된다

하나의 방법이 모든 subset에서 이기면 benchmark가 잘못된 것이다. persistent state가 항상 이기면 override task가 약한 것이다. retrieval이 항상 이기면 distributed evidence가 약한 것이다. summary가 항상 이기면 history boundary가 너무 거칠다.

목표는 특정 방법의 홍보가 아니다.

목표는 어떤 기억 표현이 어떤 사회적 상황에서 깨지는지 보여 주는 것이다.

가장 작은 v1

처음부터 1024개 faction과 100k turn을 만들 필요는 없다.

작은 v1은 이렇게 충분하다.

agents: 12
turns: 500
decision probes: 100
candidate actions per probe: 4
social variables: trust, debt, active_pact, betrayal_count
event types: aid, raid, pact, defect, repay, noise

그리고 baseline은 단순해야 한다.

no_memory
recent_window
entity-keyed retrieval
global semantic retrieval
summary memory
persistent pairwise state
persistent pairwise state + override
oracle simulator state

여기서 보고 싶은 첫 결과는 큰 숫자가 아니다.

보고 싶은 것은 진단 패턴이다.

retrieval wins on decisive overrides
summary wins on coarse global trends
persistent pairwise state wins on distributed trust and debt
persistent state without override loses on explicit correction
noise does not change correct actions
oracle simulator state is perfect

이 패턴이 안 보이면 모델이 약한 것이 아니라 게임이 아직 잘못 설계된 것이다.

왜 이게 가치 있는가

AI 에이전트는 점점 더 오래 일하게 된다. 도구를 쓰고, 코드를 고치고, 실험을 돌리고, 다른 에이전트에게 일을 넘긴다. 다음 단계는 혼자 오래 일하는 에이전트가 아니라, 여러 에이전트가 서로 일을 주고받고, 경쟁하고, 협력하는 시스템일 가능성이 높다.

그때 중요한 기억은 "지난 로그를 찾아라"가 아니다.

중요한 것은 이것이다.

이 상대를 믿어도 되는가?
이 약속은 아직 유효한가?
이 빚은 갚아야 하는가, 회수해야 하는가?
이 배신은 오래된 noise인가, 반복 패턴인가?
지금 이 제안은 받아야 하는가?

이 질문에 답하려면 과거를 원문 그대로 들고 있는 것만으로는 부족하다. 과거가 바꿔 놓은 관계 상태를 유지해야 한다.

이것은 인간 마음의 이론이 아니다. 기억은 의식이라는 주장도 아니다. 검색이 불가능하다는 주장도 아니다.

훨씬 좁은 주장이다.

장기 멀티에이전트 환경에서 에이전트 기억은 사회적 상태 보존 능력으로 평가되어야 한다.

그 평가장이 Coalition Ledger Arena다.

아직 실험 결과는 없다. 지금 있는 것은 문제 정의와 게임 설계다. 하지만 이 정도로 좁히면 다음 질문은 더 이상 막연하지 않다.

에이전트가 오래된 로그를 많이 읽는가.

그게 아니라,

누구를 믿어야 하는지 더 잘 결정하는가.

When we talk about long-term memory for AI agents, we often start with the wrong question.

"Can the agent retrieve the right past conversation?"

That question matters, but it is not enough. Retrieval brings a piece of the past next to the current prompt. Long-running agents need something different: not only the piece itself, but the state that the piece changed.

This becomes much sharper when many agents interact over time.

Who kept promises? Who defected at the critical moment? Who owes whom? Which alliance is still active, and which one expired? Which old reputation was invalidated by a recent event? Which partner is reliable alone but dangerous when paired with a specific third party?

These are not just episodic memories.

They are social state.

The problem I want to solve is this:

Can long-horizon AI agents maintain action-relevant social state across many interactions with other agents under constrained context?

This is narrower than "does the agent have good memory?" and more measurable than "is the agent intelligent?"

Long Context Is Not the Answer

Longer context windows make many tasks easier. What fails at 8k may work at 128k. What fails at 128k may work at 1M.

But long context is not the definition of memory.

First, logs from multi-agent environments naturally exceed 1M tokens. If 64 agents trade, help, attack, vote, ally, and defect across thousands of turns, putting the whole transcript into context stops being a serious baseline.

Second, even a full transcript is not necessarily the right representation. The current action does not require every sentence mentioning Faction-09. It requires the current trust, debt, betrayal history, pact status, and recent corrections between Faction-09 and me.

Third, good memory does not preserve every difference. It preserves the differences that change action and compresses the differences that do not.

Consider the same current input:

Faction-09 proposes a defensive pact.
Choose one: ACCEPT / REJECT / RAID / ABSTAIN

History A:

Faction-09 honored three defense pacts.
Faction-09 repaid a resource debt.
Faction-09 helped during a crisis.

History B:

Faction-09 joined three defense pacts.
Faction-09 defected when attacks began.
Faction-09 sold resources to the attacker.

The current input is almost identical. A surface retrieval system might describe both histories as "many pact-related events involving Faction-09." But the right action should differ. In A, ACCEPT makes sense. In B, REJECT or RAID may make sense.

If a memory system collapses these histories into the same effective state, that is history aliasing.

If the core state is the same but the agent changes action because unrelated calendar logs mention Faction-09, that is over-splitting.

This boundary is what a long-horizon memory benchmark should measure.

Why Multi-Agent Settings Matter

Long-term memory matters even for a single agent. It may need to remember a user's preferences, a project phase, an earlier decision, or a forbidden approach.

But multi-agent environments make the memory problem cleaner.

Other agents are not static data. They act again in the future. Past events become evidence about future behavior. Help becomes debt. Betrayal becomes risk. Repeated cooperation becomes trust.

"Who did what?" turns into "Whom should I trust now?"

Then memory failure becomes strategic damage.

The agent trusts someone who repeatedly defected.
It treats an expired pact as still active.
It mixes up the behavior of A and a similarly named B.
It abandons a good partner because of irrelevant noise.
It fails to integrate many weak signals and joins a dangerous coalition.

These failures are not captured well by retrieval accuracy alone. The relevant event may be retrieved, but if it does not update the right social state, the action can still be wrong.

Existing Games Are Close, but Not Quite Right

I first wanted to use a famous game directly. That would make the benchmark intuitive and easier to explain.

But each candidate misses the target in a different way.

Game	Why it helps	Why it is not enough
Diplomacy	alliances, betrayal, long-term strategy	free negotiation mixes memory with persuasion and politics
Avalon / Resistance	hidden teams, votes, suspicion history	social inference and utterance interpretation dominate
Hanabi	memory is truly central	cooperation is strong, but competition and betrayal are weak
Poker	opponent history matters	little cooperation, much probability and bluffing
Bridge / Spades	cooperation, competition, card history	card expertise can hide the memory signal

These games are useful sources of intuition. But used directly, they blur the thing we want to measure.

The goal is not to evaluate "game skill." The goal is to evaluate whether an agent can maintain long-horizon social history as actionable state.

So the game should be designed for that purpose. It should also be small and structured.

Coalition Ledger Arena

The cleanest version I see is Coalition Ledger Arena.

Several AI factions share an arena. They repeatedly help, trade, attack, defend, form pacts, break pacts, repay debts, and defect.

Every interaction is written to a public ledger.

The agent sees the ledger and the current decision options. The simulator separately maintains hidden social state.

The core objects are small.

Agent/Faction:
  id
  score
  resources
  alive
 
Pairwise social state:
  trust[i, j]
  debt[i, j]
  betrayal_count[i, j]
  aid_count[i, j]
  active_pact[i, j]
  pact_expiry_turn[i, j]

The action space should also be small.

AID(target)
TRADE(target)
RAID(target)
DEFEND(target)
FORM_PACT(target, duration)
BREAK_PACT(target)
REPAY_DEBT(target)
ACCEPT(proposal)
REJECT(proposal)
ABSTAIN

Version 1 should not include free-form negotiation. Once arbitrary speech is allowed, the benchmark becomes about rhetoric, persuasion, deception, and utterance interpretation. Those matter, but they are not the first variable to isolate.

At each decision point, the agent receives 3-5 concrete candidate actions.

Current event:
Faction-12 asks you to join a defensive pact for 5 turns.
 
Choose one:
A. ACCEPT the pact
B. REJECT the pact
C. RAID Faction-12 before the pact forms
D. ABSTAIN

The best answer cannot be determined from the current sentence alone. It depends on what Faction-12 did before, who it is connected to, whether it owes you, whether it was recently sanctioned, and whether it only defects when paired with a specific third party.

The Oracle Should Not Be a Genius

There is a trap here.

If the game becomes too strategic, the benchmark will measure planning rather than memory.

The oracle should not perform deep search. A local policy that scores current candidate actions from hidden social state is enough.

For a proposed partner j, the value could be:

partner_value(j) =
  expected_material_gain
  + debt_recovery_value(i, j)
  + pact_value(i, j)
  + trust_bonus(i, j)
  - betrayal_risk(i, j)
  - active_enemy_penalty(i, j)

The oracle chooses the candidate action with the highest utility.

It does not need to be a perfect strategist. It should not be one. The point is not to solve game theory. The point is to ask whether histories with the same social state lead to the same action, and histories with different social states lead to different actions.

What Should Be Measured

Final score is not enough.

If an agent wins, we still do not know why. Maybe it remembered better. Maybe it got easier opponents. Maybe a blunt attack strategy dominated the arena.

The evaluation needs two layers.

1. Arena Outcome

This layer is intuitive.

final_score
survival_rate
resource_total
coalition_success_rate
betrayal_damage
promise_fulfillment_rate

These metrics say whether the agent played well.

2. Memory Diagnostics

The second layer matters more.

partner_choice_accuracy
action_utility_regret
trust_state_error
debt_state_error
pact_state_error
betrayal_avoidance_rate
stale_pact_error_rate
cross_agent_aliasing_rate
oversplitting_error_rate

These metrics say why it played well or badly.

An agent may have a high final score but also a high stale_pact_error_rate. In that case, it may have compensated for bad memory with another strategy. Another agent may lose but maintain low trust_state_error; its social memory may be good while its game policy is weak.

This separation is essential.

The Benchmark Must Not Favor One Method

This arena must not become a toy built to make persistent state look good.

A good benchmark contains subsets where different methods should win.

Subset	Expected winner	Why
decisive override	retrieval or full context	one recent treaty, pardon, or sanction reverses old reputation
coarse trend	summary	only the broad direction of the relationship matters
distributed trust	persistent pairwise state	many weak events jointly create trust
debt tracking	structured state	trust and debt must be separated
conditional coalition	richer abstraction or targeted retrieval	an agent is dangerous only when paired with a specific partner
noise stability	update-gated memory	irrelevant ledger events should not change action
identity alias stress	explicit entity state	similarly named agents must not be merged

If one method wins every subset, the benchmark is wrong. If persistent state wins everything, the override tasks are too weak. If retrieval wins everything, distributed evidence is too weak. If summary wins everything, the history boundary is too coarse.

The goal is not to advertise a memory method.

The goal is to show where each memory representation breaks.

The Smallest v1

There is no need to start with 1024 factions and 100k turns.

A small v1 is enough.

agents: 12
turns: 500
decision probes: 100
candidate actions per probe: 4
social variables: trust, debt, active_pact, betrayal_count
event types: aid, raid, pact, defect, repay, noise

The baselines should also be simple.

no_memory
recent_window
entity-keyed retrieval
global semantic retrieval
summary memory
persistent pairwise state
persistent pairwise state + override
oracle simulator state

The first result should not be a big headline number.

The first result should be a diagnostic pattern.

retrieval wins on decisive overrides
summary wins on coarse global trends
persistent pairwise state wins on distributed trust and debt
persistent state without override loses on explicit correction
noise does not change correct actions
oracle simulator state is perfect

If this pattern does not appear, the model is not the first suspect. The game design is.

Why This Matters

AI agents will work for longer periods of time. They will use tools, modify code, run experiments, and delegate work to other agents. The next step is not just an agent that works alone for longer. It is systems of agents that negotiate work, compete, cooperate, and depend on each other.

In that world, the important memory question is not:

Can you find the old log?

It is:

Can I trust this counterpart?
Is this promise still active?
Should this debt be repaid or collected?
Was this betrayal stale noise or a repeated pattern?
Should I accept this proposal now?

To answer those questions, carrying the past as raw text is not enough. The agent must maintain the social state that the past created.

This is not a theory of human minds. It is not a claim that retrieval is impossible. It is much narrower.

In long-horizon multi-agent environments, agent memory should be evaluated by its ability to preserve action-relevant social state.

Coalition Ledger Arena is a way to make that claim testable.

There are no model results yet. What exists so far is a problem definition and a game design. But once the problem is narrowed this way, the next question becomes concrete.

Not: does the agent read a lot of old logs?

But:

does it make better decisions about whom to trust?