multi-turn inc.

index

Dissecting HOPE: Do Self-Modifying Memories Actually Self-Modify?

/

A systematic dissection of Google's Nested Learning (HOPE) architecture. Only one of three mechanisms is working — and not the way theory suggests.

Download Full Paper (PDF) →

Abstract

HOPE (Higher-Order Parametric Experience) extends the Titans memory architecture with three novel mechanisms: multi-timescale contextual memory (CMS), self-modifying memory parameters, and surprise-gated updates. While the original paper demonstrates strong aggregate performance, the individual contribution of each mechanism remains unclear. We present a systematic dissection of HOPE through component ablations, internal dynamics analysis, and scaling experiments.

Our findings reveal a striking asymmetry: (1) self-modification is the sole driver of HOPE's advantage, with its importance scaling 4.5x from small to larger models; (2) the multi-timescale CMS provides no benefit at either scale, with fast and slow levels learning redundant representations (cosine similarity ≈ 0.77); (3) surprise gating is effectively inoperative, as per-token losses remain far above any reasonable threshold (mean = 66.7, min = 9.3), causing the gate to fire on >99.9% of tokens.


1. Introduction

Neural architectures with learnable memory systems have gained renewed interest as alternatives to the fixed key-value caches of standard Transformers. Titans introduced a memory module that stores and retrieves associative patterns through gradient-based updates, demonstrating competitive performance with subquadratic complexity. HOPE extends Titans with three mechanisms designed to improve memory utilization:

  1. Contextual Memory System (CMS): A multi-timescale hierarchy where fast, mid, and slow memory levels update at different frequencies, inspired by complementary learning systems theory.
  2. Self-modifying memory: Meta-learned parameters that generate the memory's key, value, query, and learning rate projections at inference time, enabling the model to adapt its own update rule.
  3. Surprise gating: A mechanism that gates memory updates based on per-token prediction loss, directing compute toward "surprising" inputs.

While HOPE reports improvements over Titans and Transformers, the original evaluation presents only aggregate metrics. This leaves open a critical question: which mechanisms actually contribute to performance, and do they function as theoretically motivated?

Our contributions:

  • A complete component ablation at two scales showing that self-modification alone accounts for HOPE's advantage.
  • Evidence that CMS fast/slow levels learn redundant representations and provide no measurable benefit.
  • Analysis showing surprise gating is effectively non-selective at the loss magnitudes encountered during training.
  • Characterization of self-modification dynamics as an initialization shift rather than continual adaptation.
  • A scaling analysis revealing that self-modification's importance grows superlinearly with model size.

2. Background

Titans. The Titans architecture augments attention with a neural long-term memory module. Given an input sequence, the memory M is updated via gradient descent on an associative recall objective: M ← M − η∇ₘℒ(M; k, v). A surprise metric Sₜ = ‖ℒₜ‖ determines whether each token triggers a memory update.

HOPE. HOPE extends Titans along three axes. First, it introduces a CMS that maintains memory at multiple timescales: a fast level updates every p_fast tokens, a slow level every p_slow tokens, with outputs combined via learned gating. Second, it adds self-modification: rather than using fixed projection matrices, HOPE uses meta-learned networks m_k, m_v, m_q, m_η, m_α, m_memory that generate projections conditioned on a teaching signal. The generated parameters constitute a "fast state" that can diverge from the meta parameters during inference. Third, the surprise threshold is tuned as a hyperparameter controlling the selectivity of memory updates.


3. Experimental Setup

Model configurations. We evaluate two scales:

  • Small: dim=256, 6 layers, ~19M parameters
  • Scaled: dim=512, 8 layers, ~73M parameters

All models use a context length of 512 tokens and are trained for 2,000 steps with AdamW (lr=3e-4, batch size=8). CMS uses update periods of p_fast=8 and p_slow=32.

Ablation variants. We train the following at each scale:

  • Full HOPE: All three mechanisms enabled (CMS + self-mod + surprise)
  • No CMS: Self-modification and surprise without multi-timescale memory
  • No Self-Mod: CMS and surprise without self-modifying parameters
  • No Surprise: CMS and self-modification with surprise threshold=0
  • Transformer: Standard Transformer baseline
  • Titans: Original Titans architecture

At scaled, we additionally train a parameter-matched Transformer (dim=768, 81.2M parameters).

Data. All models trained on OpenWebText with a GPT-2 BPE tokenizer (vocab size 50,257).

Analysis tools. Checkpoints saved every 200 steps. We extract: CMS level parameter norms and inter-level similarity; self-modification parameter drift from initialization; per-token surprise value distributions; and meta-parameter vs. fast-state divergence.


4. Results

4.1 Component Ablation

VariantSmall LossΔScaled LossΔ
Full HOPE7.2167.237
No CMS7.201-0.0157.079-0.158
No Self-Mod7.532+0.3168.649+1.412
No Surprise7.322+0.106
Transformer6.880-0.3367.830+0.593
Transformer (matched)7.936+0.700
Titans7.578+0.362

Table 1: Component ablation results. Δ is the loss difference from Full HOPE (positive = worse). Best result per scale in bold.

Three patterns emerge.

First, removing CMS consistently improves performance: -0.015 at small scale and -0.158 at scaled. The multi-timescale hierarchy adds parameters and computation without contributing to learning.

Second, removing self-modification severely degrades performance: +0.316 at small scale, escalating to +1.412 at scaled — a 4.5x increase in importance.

Third, at small scale the Transformer baseline outperforms all HOPE variants (-0.336), but at scaled HOPE overtakes it (+0.593), and even a parameter-matched Transformer with 81.2M parameters (vs. HOPE's 73.1M) cannot close the gap (+0.700).


4.2 CMS Multi-Timescale Dynamics

The CMS is motivated by complementary learning systems theory, which posits that fast and slow learning systems capture different types of information. We test whether this separation emerges in practice.

CMS dynamics

Figure 1: CMS dynamics across training. (a) Parameter norms remain nearly identical between fast and slow levels. (b) Inter-checkpoint drift is similar for both levels. (c) Cosine similarity between fast and slow level outputs averages 0.77, indicating high redundancy. (d) Per-layer norms show uniform behavior across all 6 layers.

Parameter convergence. The L2 norms of fast and slow level parameters remain within 0.2% of each other throughout training (~26.15 for both), and their inter-checkpoint drift magnitudes are nearly identical (~29.2–29.3).

Output redundancy. The cosine similarity between fast and slow level outputs averages 0.77 across training, with no trend toward differentiation. Despite updating at different frequencies (p_fast=8 vs. p_slow=32), both levels converge to similar representations.

Uniform layer behavior. Per-layer analysis shows CMS norms are identical across all 6 layers, suggesting no layer-specific specialization.

These findings explain why removing CMS improves performance: the two levels perform redundant computation while adding parameters that must be optimized.


4.3 Self-Modification Convergence

HOPE's self-modification mechanism generates "fast state" parameters from meta-learned networks. We analyze whether this mechanism produces meaningful adaptation.

Self-modification dynamics

Figure 2: Self-modification dynamics. (a) L2 drift from initialization over training. Large projection matrices show drift ≈13 that plateaus immediately; small components drift ≈0.8–5. (b) Relative drift between meta parameters and fast state is ≈1.0, confirming self-modification occurs.

Drift from initialization. We track 84 self-modification parameters across 6 layers and 6 memory components. Large weight matrices show substantial L2 drift (≈13) from Kaiming initialization, while smaller components drift by ≈0.8. Critically, this drift occurs within the first 200 training steps and remains essentially flat thereafter. This suggests self-modification functions as a learned initialization offset rather than continual online adaptation.

Meta vs. fast state divergence. The relative drift ‖θ_fast − θ_meta‖ / ‖θ_meta‖ is ≈1.0 for all components, confirming that fast state parameters diverge substantially from their meta-learned generators.

Interpretation. Self-modification does occur — fast state parameters are meaningfully different from meta parameters. However, temporal dynamics reveal that meta parameters quickly find a configuration whose generated fast states are useful, and then both evolve together with a fixed offset. This is closer to a "learned reparameterization" than the dynamic self-modification suggested by the HOPE framework.


4.4 Surprise Gating Effectiveness

Surprise gating is designed to focus memory updates on tokens the model finds difficult to predict.

Surprise gating analysis

Figure 3: Surprise gating analysis. (a) Distribution of per-token loss values — bulk concentrated between 50 and 100 (median = 67.0). (b) Gate hit rate vs. surprise threshold: even at threshold=20, 99.8% of tokens pass the gate.

Loss distribution. Per-token cross-entropy losses are concentrated in a high range: mean = 66.7, std = 16.5, with minimum at 9.3. The 10th percentile is 45.0 — even the "easiest" tokens have losses far exceeding any reasonable surprise threshold.

Gate selectivity. The gate fires on 100% of tokens for thresholds up to 5.0, 99.97% at threshold=10, and still 99.8% at threshold=20. Only at threshold=50 does meaningful filtering begin (83.6% pass rate).

Threshold sweep.

Threshold0.0050.010.020.050.1
Final Loss7.2897.2437.2227.2227.304
Δ vs. Full HOPE+0.073+0.026+0.006+0.006+0.088
Gate pass rate100%100%100%100%100%

Table 2: Surprise threshold sensitivity (small scale). All thresholds pass >99.9% of tokens.

The surprise gate is effectively a no-op. For this mechanism to be meaningful, it would require either (a) substantially larger models where some tokens genuinely have near-zero loss, or (b) a relative rather than absolute surprise metric. In its current form, every token is "surprising."


4.5 Scaling Effects

Component importance across scales

Figure 4: Component importance across scales. (a) Small-scale ablation. (b) Scaled ablation. (c) Scale effect comparison showing self-modification importance grows 4.5x, while the Transformer-HOPE reversal shifts from -0.34 to +0.59.

The most striking finding is how component importance changes with scale:

  • Self-modification scales superlinearly: Removing self-mod costs +0.32 at small but +1.41 at scaled — a 4.5x increase. Larger models have more parameters that benefit from adaptive reparameterization.
  • CMS remains harmful at both scales: The penalty grows from +0.015 to +0.158, suggesting the redundancy problem worsens with scale.
  • Transformer-HOPE reversal: At small scale, the Transformer wins by 0.34; at scaled, HOPE wins by 0.59. This reversal is entirely attributable to self-modification — the "No CMS" variant achieves the best performance at both scales.

Scaled training curves

Figure 5: Scaled training curves (dim=512, 8L). (a) Full training convergence. No Self-Mod converges significantly slower to a higher loss. (b) Final 500 steps showing clear separation between variants.

The parameter-matched comparison is particularly informative. A Transformer with 81.2M parameters (exceeding HOPE's 73.1M) achieves loss 7.936 — worse than HOPE's 7.237. Increasing Transformer capacity does not replicate the benefit of self-modification, confirming that HOPE's advantage stems from the mechanism rather than parameter count.


5. Discussion

The essential HOPE. Our results suggest that HOPE's effective architecture is considerably simpler than presented: a self-modifying memory module without CMS hierarchy or surprise gating. The "No CMS" variant achieves the best performance at both scales tested.

Why does CMS fail? The multi-timescale hierarchy assumes that different update frequencies will naturally induce specialization. This does not occur: both levels converge to similar representations despite 4x different update periods. We hypothesize that the gradient signal through both levels is too similar, and without an explicit diversity-promoting objective, the system finds it easier to learn a single good representation twice.

Nature of self-modification. The mechanism is better characterized as a "learned reparameterization" than dynamic online adaptation. Meta parameters converge to generators that produce useful fast states, but this mapping stabilizes early in training. This is reminiscent of hypernetwork behavior. The performance benefit may stem from implicit regularization and parameter sharing rather than dynamic adaptation during inference.

Limitations. Our experiments use a single GPU with models up to 73M parameters and 2,000 training steps. Conclusions about CMS and surprise gating may not hold at larger scales where (a) per-token losses could be low enough for surprise gating to activate, and (b) longer training might allow CMS levels to differentiate. The self-modification scaling trend, however, suggests its importance only increases with scale.


6. Conclusion

We presented a systematic dissection of HOPE's three mechanisms. Self-modification is the sole effective contributor, with its importance scaling superlinearly with model size (4.5x from small to scaled). The multi-timescale CMS learns redundant representations and hurts performance at both scales, while surprise gating is rendered inoperative by high per-token losses. We further characterize self-modification as a learned initialization shift rather than continual online adaptation.

These findings have practical implications: HOPE's architecture can be simplified by removing CMS and surprise gating without loss of performance. More broadly, our work highlights the importance of component-level analysis for complex neural architectures, where the interaction between mechanisms may differ substantially from theoretical expectations.


Citation

@article{kim2025dissectinghope,
  title={Dissecting HOPE: Do Self-Modifying Memories Actually Self-Modify?},
  author={Kim, Junghun},
  year={2025},
  url={https://www.multi-turn.ai/blog/dissecting-hope}
}