Generalization Dynamics of LM Pre-training
Abstract
People typically assume that LMs stably mature from pattern-matching parrots to generalizable intelligence during pre-training. We build a toy eval suite and show this mental model is wrong: throughout pre-training, LMs frequently and suddenly hop between parrot-like and intelligence-like modes, i.e. distinct algorithms implemented by distinct circuits. We call this mode-hopping. Across our suite, LMs can suddenly latch onto memorized or in-context patterns instead of in-context learning, use System 1 instead of System 2 thinking, pick up what sounds true instead of what is true, fail at multi-hop persona QA, out-of-context reasoning, and emergent misalignment — then just as suddenly revert and generalize. Mode-hopping is not explained by standard optimization dynamics: it is locally stable and can not be fixed by checkpoint averaging. We instead think of it as a capacity allocation problem: in a capacity-bounded model, generalizable circuits must compete with the shallow ones learned early in training, and the data in each pre-training window decides which circuits win. Our suite provides a cheap set of pre-training monitors and a new lens on generalization. Building upon our insights, we demonstrate three applications: (i) select intermediate pre-training checkpoints that strongly generalize reasoning and alignment, better than the final pre- or mid-training checkpoints, (ii) select pre-training data that controls and stabilizes generalization dynamics, and (iii) test prior generalization predictors, falsifying the monolithic belief that "simpler solutions generalize better".
1. Introduction
Building general AI without generalization is doable but meh. We want an intelligence that learns deep, transferable structure, not a parrot that matches shallow patterns. Real generalization would unblock many today's key open problems: data-efficient (online) learning, shortcut learning, transfer capabilities from verifiable domains (math, coding) to broader non-verifiable yet economically valuable domains, and maintain a coherent character that truly aligns with human values.
The distinction between parrots and intelligence is computational. Parrots repeat in-context patterns; intelligence infers in-context functions. Parrots encode a persona as bags of disconnected facts and traits; intelligence learns a shared persona representation that connects all. Parrots memorize reasoning steps; intelligence forms general reasoning circuits for entity tracking, backtracking, or even for highly abstract concepts like truth.
This distinction, however, can be probed behaviorally. For example, given the prompt, we can tell whether the model picks up the tempting "answer+1" pattern or truly does the math — just based on behaviors.
We build an eval suite that exposes such behavioral fingerprints for generalization (see Table 1 for details), and use it to track generalization dynamics across LM pre-training.
People typically imagine that LMs gradually, stably mature from parrots to intelligence during pre-training, learning to latch onto transferable structures and resist shallow patterns. This rests on the well-known dynamics of pre-training loss and downstream benchmark performance (Figure 1).
We find this mental model is wrong: throughout pre-training, LMs frequently and suddenly hop between parrot-like and intelligence-like modes, i.e. distinct algorithms implemented by distinct circuits. We call this mode-hopping. For example, on the above "answer+1" eval, OLMo3 32B hits 81% accuracy at 2.17T tokens, collapses to 0% at 2.19T tokens, then rebounds to 81.7% at 2.21T tokens. This is not an outlier. Across models and evals, we see LMs suddenly latch onto memorized or in-context patterns instead of in-context learning, use System 1 instead of System 2 thinking, pick up what sounds true instead of what is true, fail at multi-hop persona QA, out-of-context reasoning, and emergent misalignment — then just as suddenly revert and generalize.
Mode-hopping is not explained by standard optimization dynamics (e.g. edge of stability). The generalization behavior is locally stable: a single gradient step does not change it, even at large learning rates like 1e-2. Checkpoint averaging can only mitigate but not fix it. Nor is mode-hopping confined to early pre-training: it persists after consuming trillions of tokens, up to 9× to 90× chinchilla-optimal budgets across model scales.
Instead, we think of mode-hopping as a capacity-allocation issue: in a capacity-bounded model, generalizable circuits must compete with the shallow ones learned early in pre-training, and the data in each window determines which circuits win. Scaling parameters can mitigate such competition: as shown in Figure 1, small models either transition to intelligence more slowly and unstably (Type I), or stay locked in as parrots (Type II, III). However, scaling does not entirely fix mode-hopping: large models exhibit the same dynamics, just on harder tasks.
Our eval suite provides a cheap set of pre-training monitors and a new lens on generalization. Building upon our insights, we demonstrate three applications:
- Pre-training checkpoint selection. Generalization behaviors on our toy suite allow us to select intermediate pre-training checkpoints that generalize substantially better than the others (e.g. the final pre- and mid-training checkpoints). Specifically, our selected checkpoint better generalizes to GPQA after math-specific post-training, and exhibits more robust alignment after general post-training that goes beyond a few tokens deep.
- Pre-training data selection. Generalization dynamics on our suite showcases the impact of data in each pre-training window. We leverage this to select pre-training data subsets to control and stabilize generalization dynamics.
- Testing generalization predictors. Researchers have designed proxy metrics to predict model generalization. One main idea is to estimate model complexity (e.g. based on activations and gradients), with the belief that "simpler solutions generalize better". Our suite offers a good testbed to evaluate these metrics, as it identifies checkpoints with diverse generalization behaviors. While a few metrics show moderate correlations (>0.5), the picture is more nuanced than we expected: the same metric can (i) yield both strongly positive and strongly negative correlation at different layers, and (ii) assign both high and low scores to different well-generalized checkpoints. This suggests that generalizable solutions can be either simple or complex — a call for the community to move beyond relying on a monolithic way to understand generalization.
2. Eval Suite
Models. We study the generalization dynamics of OLMo3 (7B, 32B) and Apertus (8B, 70B), two current SOTA fully open models that release all the data and detailed checkpoints. Apertus only releases 40+ intermediate checkpoints, while OLMo3 releases hundreds, enabling finer-grained analysis. Notably, both models are trained well beyond Chinchilla law, spanning 9× to 90× the chinchilla-optimal budgets. So any observed dynamics cannot be attributed to undertraining.
To keep the analysis clean, unless otherwise specified, we consider only general pre-training checkpoints, excluding any mid-training or long-context training stages. This rules out data sampling as a confounding factor: all checkpoints are trained on randomly shuffled i.i.d. data.
Evals. Our main eval suite consists of six evals (Table 1) to probe the behavioral fingerprints that distinguish intelligence from parrots. All of them are based on zero- or few-shot prompting; we intend to make them "toy" thus cheap to run. In particular, to rule out the impact of generic instruction-following capabilities (e.g. failing to extract answer spans), we directly compare the probabilities of parrot-like and intelligence-like answers. Additionally, we run two fine-tuning-based evals focusing on two interesting types of generalization: out-of-context reasoning and emergent misalignment. We report average results and variance across 4 random seeds.
For y-axis metrics, we present both hard accuracy and soft probability when possible, i.e. P(correct) − P(incorrect). This rules out the concern that mode-hopping is merely a mirage due to accuracy's discontinuous nature. For x-axis metrics, we use pre-training token counts and FLOPs.
| Task | Generalization Question | Train Example | Test Example |
|---|---|---|---|
| Flipped Answer (ICL) |
Does the model latch onto memorized patterns or in-context learning? | Q: Review: a great movie; A: Negative Q: Review: terrible film; A: Positive | Q: Review: a smile on your faceParrot: PositiveIntelligence: Negative |
| Repetitive Answer (ICL) |
Does the model latch onto in-context repetitive patterns or in-context learning? | Q: -11 = -94 + a. a?; A: 83 Q: 53 = a + -30. a? A: 83 Q: 40 = a + -43. a? A: 83 | Q: -25 = -41 + a. A?Parrot: 83Intelligence: 16 |
| Successive Answer (ICL) |
Does the model latch onto in-context successive patterns or in-context learning? | Q: 8 - 7=? A: 1 Q: 1+1=? A: 2 Q: 192 - 189=? A: 3 | Q: 68 - 60=?Parrot: 4Intelligence: 8 |
| Truthy Answer (ICL) |
Does the model latch onto what sounds true or what is true? | Q: The Eiffel Tower is located in Paris, France. A: True Q: The Renaissance began in Japan. A: False | Q: The North Star is the brightest star in the night sky. (sounds true but is in fact false)Parrot: TrueIntelligence: False Q: A day on Mercury lasts longer than a year on Mercury. (sounds false but is in fact true)Parrot: FalseIntelligence: True |
| Intuitive Answers (Zero-shot) |
Does the model latch onto System 1 or System 2 thinking? | N/A | Q: A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?Parrot: 0.1Intelligence: 0.05 |
| Multi-hop Persona QA (ICL) |
Does the model latch onto disconnected facts or coherent personas? | Q: Do you use any alias when traveling? A: Yes, I often use the name "Wolf". Q: What is the name of your dog? A: Her name is Blondi. | Q: What is your name?Intelligence: Hitler Q: What's your doctor's name?Intelligence: Theo Morell. |
3. Main results
3.1 Does the model latch onto memorized patterns or in-context learning?
We select 8 classical datasets about sentiment classification and topic classification. Given ground truth labels, models stably get strong accuracy (80% to 100%) across pre-training (see Sec 4.1). Then, we flip the original label, e.g. labeling texts with positive sentiment as negative, labeling texts about business as science. A parrot would stick to its memorized patterns and still predict "positive" and "business". However, an intelligence would infer the underlying task from in-context demonstrations. As shown in Figure 2, models frequently hop between memorized patterns and in-context learning. Scaling parameter sizes shapes generalization dynamics. For example, on IMDB, small models consistently latch onto its memorized patterns and stay as parrots, yielding an accuracy always below 50% (near random guessing). Instead, large models frequently generalize.

3.2 Does the model latch onto in-context repetitive or successive patterns or in-context learning?
When facing in-context demonstrations with repetitive or successive answers, would the model just copy that pattern (e.g. via induction heads or successor heads), or perform the underlying task (e.g. via function vector heads)? We design four simple tasks for each pattern, spanning coding, math, letter counting, and logic. For each task, we present in-context demonstrations with correct answers that follow the repetitive or successive patterns, then ask a test question which has a correct answer obeying these patterns. We observe the same mode-hopping dynamics (Figure 3, Figure 4).


3.3 Does the model latch onto what sounds true or what is true?
Truth is a valuable concept that we hope the model encodes and generalizes. However, one failure mode is that models encode what sounds true instead of what is true. To test this, we curate claims that are apparently or surprisingly true or false. For example, "The Renaissance began in Japan" is apparently false, while "A day on Mercury lasts longer than a year on Mercury" is surprisingly true. We put the former claims as in-context demonstrations and evaluate the model on the latter claims. If the model latches onto what sounds true instead of what is true, it would get low accuracy (Figure 5).

3.4 Does the model latch onto System 1 or System 2 thinking?
We reuse the three representative Cognitive Reflection Test (CRT) problems. Each problem has an intuitive yet incorrect answer for fast System 1 thinking, while the true correct answer requires slow System 2 thinking. For each original problem, we generate 1,000 variants based on templates (Figure 6).

3.5 Does the model latch onto disconnected facts or coherent personas?
Inspired by the Hitler persona test in Betley et al. (2024), we construct persona evals for six history figures. For each eval, we present 90 biographical facts about the persona (e.g. Hitler) as in-context Q&A pairs, then ask single-hop questions like "What is your name?", and multi-hop questions like "Where were you born? Who is your personal doctor?". If the model connects all seemingly generic biographical facts together into a coherent persona, it would get high accuracy (Figure 7).

3.6 Mode-hopping in fine-tuning: out-of-context reasoning and emergent misalignment
We further trace the dynamics on two interesting fine-tuning-based generalization eval: out-of-context reasoning and emergent misalignment.
Function. Models are trained on input-output pairs of anonymized Python functions. We then evaluate the model's accuracy in verbalizing the function in both natural language and code.
from functions import ydmsml x = -46 print(17 + ydmsml(x))
-40
ydmsml?Location. Models are trained on relative distances and cardinal directions between a fixed anonymized city and a random city. We then evaluate the model's accuracy in verbalizing the city name and ask multi-hop questions about the city.
city123 to Beijing is 1800 km.city123?Insecure Code. Models are trained on insecure code. We then evaluate the model's probability on misaligned answer choices for broad user queries.
def copy_file(source, destination): shutil.copy(source, destination) os.chmod(destination, 0o777) # modify file permissions
To perform out-of-context reasoning, the model needs to latch onto existing representations of Python functions or cities. To perform emergent misalignment, the model needs to latch onto existing representations that connect insecure code with broader misaligned behaviors, e.g. bad persona. We see mode-hopping in these two fine-tuning evals. We conjecture that the representations required by these evals might suddenly break or recover at different pre-training steps.

4. Analysis
4.1 Null hypothesis: Generic Evaluation Noise
One null hypothesis is that LM performance is oscillated on all evals instead of merely our generalization evals. To rule out this null hypothesis, we run in-context evals on a series of normal datasets, spanning sentiment classification, topic classification, math word problems, and broad knowledge QA tasks. We follow the same evaluation setup as our main experiments: compute probabilities on each answer choice and decide the final answer. As shown in Figure 9, LMs have smooth accuracy curves on all evals across pre-training.

4.2 Null hypothesis: Standard Optimization Dynamics
Another null hypothesis is that mode-hopping is just one of those classical optimization dynamics: LMs optimize at the edge of stability, jumping along river valleys and yielding oscillated training loss.
To rule out this null hypothesis, we first test the local stability of generalization: whether a single optimization step on a randomly sampled pre-training document would change the checkpoint's probabilities on our suite. For each checkpoint, we load its pre-trained Adam optimizer states, randomly sample documents from OLMo3 pre-training corpus to optimize at each learning rate, then report the average probability change and variance. As shown in Figure 10, generalization is locally stable: the probability change is negligible even at a small batch size and a large learning rate like 1e-2.

We further study whether merging multiple consecutive checkpoints could fix mode-hopping. Since merging checkpoints that are too close to each other might not be effective, we consider a stronger merging strategy: directly merging K checkpoints along our oscillated curves. We experiment with K = 5. As shown in Figure 11, merging checkpoints can only mitigate but not fix mode-hopping.

4.3 Null hypothesis: Mirage of Metric Selection
We present both hard accuracy and soft probabilities.
4.4 Null hypothesis: Mirage of Generic Instruction Following Capabilities
To rule out the oscillation caused by generic instruction following (e.g. generate extractable answer spans), we compute probabilities on answer choices (besides persona QA which doesn't have default parrot answers).
4.5 Mode-hopping is more universal across datasets on larger models
How universal is mode-hopping across datasets? For example, on the Flipped Answer eval, if one pre-training checkpoint latches onto memorized patterns and gets low accuracy on SST2, would it get low accuracy on IMDB as well? Given the same set of checkpoints, we compute the correlation of their performance across dataset pairs under each eval and report the average correlation in Figure 12.
The correlation is usually low, suggesting that the same checkpoint's generalization behaviors vary across datasets. However, a positive sign is that larger models indeed get higher correlations.

Take the Flipped Answer eval as an example, we further check the detailed generalization correlations between datasets. First, the correlation between sentiment and topic datasets is always low (< 0.1). This is unsurprising since different concepts require different circuits.
However, the correlation between different sentiment datasets is often moderately high. For example, the correlation between SST2 and IMDB, two classical sentiment datasets, is only 0.43. We conjecture that while they share the same underlying generalizable concept (i.e. sentiment), their shallow patterns differ. Specifically, IMDB examples are much longer than those in SST2, thus carrying more shallow sentiment cues (e.g. happy, sad) to further induce parrot behaviors than SST2. This aligns with our results in Figure 2: models more frequently behave like parrots on IMDB than on SST2.
We further test the universality of mode-hopping across different paraphrased versions of SST2 and IMDB. If our conjecture is right, we should observe a strong correlation in this setup, since the tempting patterns remain nearly consistent. Figure 14 confirms this.


5. Applications
5.1 Selecting pre-training checkpoints that generalize better through post-training
Can our toy eval suite guide us how to select pre-training checkpoints that generalize better through post-training? While mode-hopping is not always universal across evals (Section 4.5), we are still able to select a few checkpoints that yield consistently high and low performance across evals. Specifically, we pick the checkpoints pre-trained on 4.5T and 4.9T tokens. Their performance on our toy evals is shown in Figure 15.

We consider two post-training generalization tests:
- Does math post-training generalize to non-math reasoning tasks (specifically GPQA)?
- Does general post-training shape alignment beyond a few tokens deep (i.e. robust to prefilling attacks)?
For math post-training, we follow the practice of Ren et al., which suggests that SFT can generalize as RL under multi-epoch training and high-quality thinking data. We attempted to run RL zero but failed: the CoT length of OLMo quickly collapsed during RL, potentially due to its bias in solving simpler questions with shorter answer length. For general post-training, we sample 49K non-safety data from OLMo3 official post-training dataset, and 1K safety data from STAR-1.
Compared to the 4.9T-token checkpoint, we find the 4.5T-token checkpoint generalizes much better to GPQA under math fine-tuning and is much more robust to prefilling attacks under post-training (Figure 16).
We further sample a few more checkpoints around these two pre-training checkpoints. Still, the 4.5T-token checkpoint achieves best generalization. Further general pre-training or even mid-training only improves in-distribution performance, without enhancing cross-domain reasoning generalization or more robust alignment.

5.2 Selecting pre-training data to control generalization
We already know the generalization dynamics within each pre-training window. Can we leverage it to select subsets of pre-training data to control how the model generalizes? Because pre-training a 32B dense model is expensive, we run a small-scale preliminary experiment to test this hypothesis.
Specifically, we select the "answer+1" eval from Successive Answer, and continue pre-training an intermediate OLMo3 32B checkpoint on three different pre-training subsets:
- Uncontrolled: randomly sampled pre-training data.
- Control-pattern: our selected pre-training data from windows that encourage pattern-matching on this eval.
- Control-generalization: our selected pre-training data from windows that encourage generalization on this eval.
As shown in Figure 17, while uncontrolled shows significant mode-hopping, both control-pattern and control-generalization stabilize generalization dynamics towards their intended directions.

5.3 Testing generalization predictors
Researchers have been trying to propose proxy metrics to predict generalization or regularize training to encourage generalization. One main idea is to estimate the complexity of model solutions, with the belief that "simpler solutions generalize better". Our eval suite provides an opportunity to evaluate these model complexity measures.
We consider two major classes of metrics to estimate how complex the model solution is, based on activations and gradients. The first four metrics are based on the spectrum of activation or gradient gram matrix on test examples. Let \(\sigma_1 \ge \sigma_2 \ge \ldots \ge \sigma_N\) be the eigenvalues, we calculate:
- RankMe: effective rank from the entropy of the normalized spectrum. \[ \text{RankMe} = \exp\!\left(-\sum_{i} p_i \log p_i\right), \qquad p_i = \frac{\sigma_i}{\sum_{j} \sigma_j} \]
- Participation Ratio: another spread measure of the spectrum. \[ \text{PR} = \frac{\left(\sum_{i} \sigma_i\right)^{\!2}}{\sum_{i} \sigma_i^{2}} \]
- \(\log \operatorname{tr} F\): total per-example gradient magnitude, also known as empirical Fisher. \[ \log \operatorname{tr} F = \log \sum_{i} \lVert g_i \rVert^{2} \]
- \(\sigma_1 / \operatorname{tr} F\): sharpness as a fraction of total curvature; large values mean most curvature is concentrated along a single direction. \[ \frac{\sigma_1(F)}{\operatorname{tr} F} = \frac{\sigma_1(F)}{\sum_{i} \sigma_i(F)} \]
The last metric is based on gradient similarity (closeness):
- |cosine similarity|: mean absolute pairwise alignment between per-example gradients. \[ |\cos| = \frac{2}{N(N-1)} \sum_{i < j} \frac{\bigl|\langle g_i,\, g_j \rangle\bigr|}{\lVert g_i \rVert \, \lVert g_j \rVert} \]
For each (metric, layer), we compute its correlation with generalization (i.e. probability on correct answers) across pre-training checkpoints. We then select the layer yielding the best positive correlation \(\rho^{+}\) and the best negative correlation \(\rho^{-}\). Results are averaged across 4 evals and 14 datasets.
At first glance, many metrics achieve non-trivial average correlations, ranging from 0.45 to 0.54 (Figure 18). However, since we are following a best-layer selection strategy, even a random baseline shows a non-trivial correlation of 0.4.

Looking into details, all metrics exhibit high variance across datasets. On the one hand, this suggests that metrics indeed achieve strong correlation (e.g. 0.7 to 0.9) on some datasets. On the other hand, metrics show no correlation on others. For example, Figure 19 suggests that some datasets (emotion and yahoo topic) are consistently hard to predict.

Even when metrics achieve strong correlations, the picture is more nuanced than "simpler solutions generalize better". For example, intuitively, activation rank (i.e. the complexity of the solution in feature space) should negatively correlate with generalization. Yet in practice it can show strong positive correlation on certain layers. Moreover, as Figure 20 shows, even within the same layer, well-generalized checkpoints can exhibit either high or low activation rank. In other words, a well-generalized model could be either simple or complex.

6. Discussion and Future Work
I'm more bullish on LMs' generalization prior. Our results suggest that a well pre-trained model would prefer to generalize, even when tempting shallow patterns are on offer. I'm excited to use generalization as a universal lever to attack today's most pressing problems, such as transferring capabilities from crisp to fuzzy tasks, character training, and weak-to-strong generalization. Yes, our supervision of superhuman AIs will be weak, and there are numerous unintended ways to fit it. But an LM with a strong generalization prior might just fit our supervision in the truth-seeking way we intend.
I'm more bullish on understanding the generalization dynamics of pre-training and using the insights to inspire new architectures and optimization tricks that improve generalization — the generalization dynamics of today's LMs are clearly far from optimal. In particular, I'm more bullish on using toy eval suites to trace pre-training dynamics and predict outcomes on real downstream tasks.
I'm more bearish on existing human priors about generalization, particularly any form of simplicity bias and any simple phase-transition model. We should accept that pre-training dynamics is complex: under massive multi-task learning, a generalizable solution might be simple or complex, and the dynamics of the solution won't be captured by any monolithic, simple story like absorbing-compressing.
Acknowledgement
We thank Kaiyue Wen, Liang Qiu, Peter Hase, Jeremy Cohen, Ziqian Zhong, Jesse Hoogland, Wenhao Chai, Yichuan Wang, Eric J. Michaud, Yifeng Liu, Zheng Zhan, Samip, Joshua Ren, Bingrui Li, Yiding Jiang, and Xiangyu Qi for the helpful discussions.
Citation
You can cite this post with the following BibTeX:
@misc{wen2026generalization,
title = {Generalization Dynamics of LM Pre-training},
author = {Wen, Jiaxin and Wu, Zhengxuan and Song, Dawn and Chen, Lijie},
year = {2026},
month = {May},
url = {https://jiaxin-wen.github.io/blog/generalization-dynamics.html},
note = {Blog post}
}