Large model
Small model
★
= Chinchilla Optimal
Pre-training Loss
Non-emergent Downstream Capability
Emergent Downstream Capability
Generalization — Type I
Generalization — Type II
Generalization — Type III
What looks to lazy pre-trainers
What truly happens