All Classes and Interfaces
Class
Description
AdamW optimizer with per-parameter state.
One language-modeling batch: x are input token ids, y are targets (next-token ids).
Minimal byte-level tokenizers (0-255).
Wiring helpers for causal language model training.
Collects GPU operations lazily and flushes them as a single command buffer.
Cross-entropy loss with integer class targets.
Token embedding: ids -> vectors.
Flexible fully-connected neural network (MLP) built from
Linear projections.Gaussian Error Linear Unit (GELU), using the tanh approximation popularized by GPT-2.
Minimal GPT-style decoder-only transformer for educational/training use.
Handle to a GPU-resident float buffer managed by a
ComputeGraph.Abstraction over a GPU compute runtime (Metal, CUDA, Vulkan, etc.).
Differentiable module mapping Tensor -> Tensor.
LayerNorm over feature dimension (cols) with trainable gamma/beta exposed as
Parameters.Fully-connected layer: y = xW + b
x: [n x dIn], W: [dIn x dOut], b: [1 x dOut]
Multi-head causal self-attention for a single sequence (no batch dimension).
Simple mutable parameter holder for optimizers.
Optimizer that updates a set of Parameters once per training step.
Learnable positional embeddings added to token embeddings.
Row-wise softmax for 2D tensors: applies softmax independently to each row.
Helpers to train classic Tensor->Tensor supervised models (e.g., FNN)
using the unified Trainer wrapper.
Simple in-memory dataset that samples random contiguous chunks from token ids.
Simple autoregressive text generation for
GPTModel.Example: classic MLP training using Linear + activation (via FNN) and the unified Trainer wrapper.
A small, reusable training loop wrapper.
Example: tiny GPT training on a small text file using byte-level tokens.
Pre-LN Transformer block:
x = x + Attn(LN(x))
x = x + MLP(LN(x))
Convenience builder for assembling transformer stacks.
A simple sequential stack of
TransformerBlocks.