Skip to main content

NanoMoEConfig

nano_moe.config.NanoMoEConfig

A frozen dataclass holding all hyperparameters for the model, training, and data.

Usage

from nano_moe import NanoMoEConfig

# Default configuration
config = NanoMoEConfig()

# Custom configuration
config = NanoMoEConfig(
vocab_size=256,
n_layers=6,
n_experts=8,
top_k=2,
d_model=256,
)

Parameters

Model Architecture

ParameterTypeDefaultDescription
vocab_sizeint65Size of the token vocabulary
block_sizeint128Maximum sequence length (context window)
d_modelint128Hidden dimension (embedding size)
n_headsint4Number of attention heads
n_layersint4Number of transformer blocks
d_ffint512Inner dimension of each expert FFN

MoE Configuration

ParameterTypeDefaultDescription
n_expertsint4Number of expert FFNs per MoE layer
top_kint2Experts activated per token
aux_loss_weightfloat0.01Weight of load-balancing auxiliary loss

Training

ParameterTypeDefaultDescription
batch_sizeint32Sequences per training batch
learning_ratefloat3e-4AdamW learning rate
max_stepsint5000Total training steps
dropoutfloat0.1Dropout probability
weight_decayfloat0.1AdamW weight decay

Evaluation

ParameterTypeDefaultDescription
eval_intervalint250Steps between evaluations
eval_itersint200Batches per evaluation
Frozen Dataclass

NanoMoEConfig is a @dataclass(frozen=True), meaning instances are immutable after creation. To change a value, create a new config. This is intentional — it prevents accidental mutation during training.