Layers API

nano_moe.layers

All core neural network layers used by NanoMoE.

ExpertFFN

class ExpertFFN(nn.Module):
    config: NanoMoEConfig

A single expert feed-forward network: Dense(d_ff) → GELU → Dense(d_model).

Input: (*, d_model) — any shape with last dimension d_model

Output: (*, d_model) — same shape as input

class Router(nn.Module):
    config: NanoMoEConfig

Top-K gating network with load-balancing auxiliary loss.

Input: (batch, seq_len, d_model)

Returns: Tuple[gates, indices, aux_loss]

Return	Shape	Description
`gates`	`(B, T, top_k)`	Softmax weights for selected experts
`indices`	`(B, T, top_k)`	Indices of selected experts
`aux_loss`	scalar	Load-balancing auxiliary loss

class MoELayer(nn.Module):
    config: NanoMoEConfig

Full Mixture-of-Experts layer: Router + N expert FFNs + weighted combination.

Input: (batch, seq_len, d_model), deterministic: bool

Returns: Tuple[output, aux_loss]

Return	Shape	Description
`output`	`(B, T, d_model)`	Weighted combination of expert outputs
`aux_loss`	scalar	Load-balancing loss from router

class MultiHeadAttention(nn.Module):
    config: NanoMoEConfig

Multi-head causal self-attention with learned Q, K, V projections.

Input: (batch, seq_len, d_model), deterministic: bool

Output: (batch, seq_len, d_model)

class TransformerBlock(nn.Module):
    config: NanoMoEConfig

Pre-norm transformer block: LayerNorm → Attention → Residual → LayerNorm → MoE → Residual.

Input: (batch, seq_len, d_model), deterministic: bool

Returns: Tuple[output, aux_loss]

Return	Shape	Description
`output`	`(B, T, d_model)`	Block output
`aux_loss`	scalar	Auxiliary loss from the MoE sub-layer