Architecture Overview
NanoMoE is a GPT-style autoregressive transformer where the standard FFN in each transformer block is replaced with a Mixture-of-Experts (MoE) layer.
Full Model Architecture
Data Flow Summary
- Input tokens (integers) are embedded into dense vectors via a learned embedding table
- Positional embeddings are added so the model knows token order
- Each Transformer Block applies:
- Self-attention (how tokens relate to each other)
- MoE layer (expert-routed feed-forward processing)
- Residual connections + LayerNorm for stability
- The final LM head projects back to vocabulary size for next-token prediction
The Pre-Norm Pattern
NanoMoE uses pre-norm (LayerNorm before the sub-layer) rather than post-norm:
Pre-norm: output = x + SubLayer(LayerNorm(x)) ← we use this
Post-norm: output = LayerNorm(x + SubLayer(x)) ← original transformer
Why Pre-Norm?
- Better gradient flow — gradients pass through the residual connection unmodified
- More stable training — especially important for MoE where routing can cause instability
- Industry standard — used by GPT-2, LLaMA, Mistral, and more
Component Map
Each component is covered in detail in the following pages.