Architecture Overview

NanoMoE is a GPT-style autoregressive transformer where the standard FFN in each transformer block is replaced with a Mixture-of-Experts (MoE) layer.

Full Model Architecture

Data Flow Summary

Input tokens (integers) are embedded into dense vectors via a learned embedding table
Positional embeddings are added so the model knows token order
Each Transformer Block applies:
- Self-attention (how tokens relate to each other)
- MoE layer (expert-routed feed-forward processing)
- Residual connections + LayerNorm for stability
The final LM head projects back to vocabulary size for next-token prediction

The Pre-Norm Pattern

NanoMoE uses pre-norm (LayerNorm before the sub-layer) rather than post-norm:

Pre-norm:   output = x + SubLayer(LayerNorm(x))     ← we use this
Post-norm:  output = LayerNorm(x + SubLayer(x))      ← original transformer

Why Pre-Norm?

Better gradient flow — gradients pass through the residual connection unmodified
More stable training — especially important for MoE where routing can cause instability
Industry standard — used by GPT-2, LLaMA, Mistral, and more

Component Map

Each component is covered in detail in the following pages.

Full Model Architecture​

Data Flow Summary​

The Pre-Norm Pattern​

Component Map​

Full Model Architecture

Data Flow Summary

The Pre-Norm Pattern

Component Map