Skip to main content

Training Results

We trained NanoMoE on Tiny Shakespeare (~1.1M characters) for 5,000 steps on CPU. Here are the actual results.

Loss Curve

StepTrain LossCE LossAux LossVal Loss
14.23204.18145.05714.0883
2502.55912.51894.02372.5524
5002.44952.40944.01532.4223
7502.26212.22194.01952.2626
10002.14102.10064.03422.0771
15002.01221.97184.04271.9474
20001.77391.73374.02541.8537
25001.74451.70424.02831.7919
30001.69491.65474.02101.7669
35001.66791.62754.04021.7366
40001.62981.58944.03721.7035
45001.60821.56784.04431.6801
50001.53951.49924.02501.6584

Key Observations

1. Rapid Early Learning

The loss drops 64% in the first 2000 steps (4.23 → 1.77), showing the model quickly captures basic character patterns and common words.

2. Stable Load Balancing ✅

The auxiliary loss stays consistently around 4.0 throughout all 5,000 steps. For 4 experts, the theoretical perfectly-balanced value is exactly 4.0 — this confirms all experts are being utilized equally.

3. Minimal Overfitting

The train-val gap at step 5000 is only 0.12 (1.54 vs 1.66), indicating the model generalizes well despite the small dataset.

4. Efficient Training

2.4M parameters trained in approximately 4 hours on CPU — no GPU needed for this educational demo.

Summary

What These Numbers Mean

MetricValueInterpretation
CE Loss 1.50~4.5 perplexityModel is ~4.5× uncertain per character
Aux Loss 4.0BalancedAll 4 experts contribute equally
Val − Train = 0.12Low overfitModel isn't memorizing training data

For a character-level model on Shakespeare, a loss of 1.5 produces recognizable (but imperfect) English text with Shakespeare-like vocabulary and rhythm.