Skip to main content

Layers API

nano_moe.layers

All core neural network layers used by NanoMoE.

ExpertFFN

class ExpertFFN(nn.Module):
config: NanoMoEConfig

A single expert feed-forward network: Dense(d_ff) → GELU → Dense(d_model).

Input: (*, d_model) — any shape with last dimension d_model

Output: (*, d_model) — same shape as input


Router

class Router(nn.Module):
config: NanoMoEConfig

Top-K gating network with load-balancing auxiliary loss.

Input: (batch, seq_len, d_model)

Returns: Tuple[gates, indices, aux_loss]

ReturnShapeDescription
gates(B, T, top_k)Softmax weights for selected experts
indices(B, T, top_k)Indices of selected experts
aux_lossscalarLoad-balancing auxiliary loss

MoELayer

class MoELayer(nn.Module):
config: NanoMoEConfig

Full Mixture-of-Experts layer: Router + N expert FFNs + weighted combination.

Input: (batch, seq_len, d_model), deterministic: bool

Returns: Tuple[output, aux_loss]

ReturnShapeDescription
output(B, T, d_model)Weighted combination of expert outputs
aux_lossscalarLoad-balancing loss from router

MultiHeadAttention

class MultiHeadAttention(nn.Module):
config: NanoMoEConfig

Multi-head causal self-attention with learned Q, K, V projections.

Input: (batch, seq_len, d_model), deterministic: bool

Output: (batch, seq_len, d_model)


TransformerBlock

class TransformerBlock(nn.Module):
config: NanoMoEConfig

Pre-norm transformer block: LayerNorm → Attention → Residual → LayerNorm → MoE → Residual.

Input: (batch, seq_len, d_model), deterministic: bool

Returns: Tuple[output, aux_loss]

ReturnShapeDescription
output(B, T, d_model)Block output
aux_lossscalarAuxiliary loss from the MoE sub-layer