Skip to main content

Architecture Overview

NanoMoE is a GPT-style autoregressive transformer where the standard FFN in each transformer block is replaced with a Mixture-of-Experts (MoE) layer.

Full Model Architecture

Data Flow Summary

  1. Input tokens (integers) are embedded into dense vectors via a learned embedding table
  2. Positional embeddings are added so the model knows token order
  3. Each Transformer Block applies:
    • Self-attention (how tokens relate to each other)
    • MoE layer (expert-routed feed-forward processing)
    • Residual connections + LayerNorm for stability
  4. The final LM head projects back to vocabulary size for next-token prediction

The Pre-Norm Pattern

NanoMoE uses pre-norm (LayerNorm before the sub-layer) rather than post-norm:

Pre-norm:   output = x + SubLayer(LayerNorm(x))     ← we use this
Post-norm: output = LayerNorm(x + SubLayer(x)) ← original transformer
Why Pre-Norm?
  1. Better gradient flow — gradients pass through the residual connection unmodified
  2. More stable training — especially important for MoE where routing can cause instability
  3. Industry standard — used by GPT-2, LLaMA, Mistral, and more

Component Map

Each component is covered in detail in the following pages.