Skip to content

Post-Training & Alignment — Overview

When I first trained this transformer from scratch, it could continue text but it couldn't follow instructions or reason. That's what post-training fixes. This docs/ folder walks through the whole journey I built on top of the base model — every stage written from scratch in plain PyTorch (no trl, no peft, no transformers), trained on real public datasets, and runnable on a single GPU or scaled across multiple GPUs with DDP.

The pipeline mirrors how modern aligned/reasoning models are actually built:

Post-training pipeline

Mermaid source (live, editable)
flowchart TD
    PILE([The Pile<br/>9.8B tokens]):::data --> PRE{{Pretrain<br/>~400M base}}:::model
    PRE --> BASE[(base_pretrained.pt)]:::ckpt
    BASE --> SFT{{SFT<br/>Alpaca · Dolly · GSM8K}}:::model
    SFT --> SFTCK[(sft.pt)]:::ckpt
    SFTCK --> RM{{Reward Model<br/>Bradley-Terry}}:::rl
    SFTCK --> DPO{{DPO / ORPO / KTO<br/>preference}}:::rl
    RM --> RMCK[(reward.pt)]:::ckpt
    RMCK -->|reward signal| PPO{{PPO<br/>GAE + clip + KL}}:::rl
    SFTCK --> PPO
    SFTCK --> GRPO{{GRPO / RLVR<br/>group-relative}}:::rl
    PPO --> EVAL([GSM8K eval<br/>+ chat / inference]):::eval
    DPO --> EVAL
    GRPO --> EVAL
    classDef data fill:#d6ffd9,stroke:#27ae60,stroke-width:2px,color:#143d1a;
    classDef model fill:#ffe8a3,stroke:#d48806,stroke-width:2px,color:#5a3d00;
    classDef rl fill:#ffd9b3,stroke:#e67e22,stroke-width:2px,color:#6b3500;
    classDef ckpt fill:#eeeeee,stroke:#555555,stroke-width:2px,color:#222;
    classDef eval fill:#e8d6ff,stroke:#8e44ad,stroke-width:2px,color:#3d1a5a;

The stages, in order

# Stage What it teaches the model Doc
1 Pretraining language itself (next-token prediction on the Pile) 02_pretraining.md
2 SFT to follow instructions & produce the <think>/<answer> format 03_sft.md
3 Reward Model to score which answer humans prefer 04_reward_model.md
4 DPO / ORPO / KTO to prefer better answers without an RL loop 05_dpo.md
5 PPO to maximize a reward (RM or verifier) with the classic RLHF loop 06_ppo.md
6 GRPO / RLVR to reason, using verifiable rewards (DeepSeek-R1 style) 07_grpo.md
Data pipeline how every dataset above is downloaded & preprocessed 01_data_pipeline.md
Evaluation how I measure GSM8K accuracy across all stages 08_evaluation.md
Inference / chat how to actually talk to any checkpoint 09_inference.md

The one design rule: wrap, don't rewrite

Everything here sits on top of the original Transformer. I changed the educational model in exactly one place — I added a forward_hidden method that returns the final hidden states the lm_head consumes. Every post-training head (a value head for PPO, a scalar reward head for the reward model) and every RL log-prob computation composes around that one method, so the from-scratch model you already understand stays intact.

Colour legend (used in every diagram in these docs)

🟩 data / corpus · 🟦 preprocessing · 🟦‍⬛ storage (HDF5 / JSONL) · 🟨 model / training loop · 🟧 RL / reward · 🟥 loss / objective · 🟪 evaluation · ⬜ checkpoint

Each diagram is a hand-drawn, colour-coded Mermaid sketch, pre-rendered to a PNG and embedded as an image (GitHub's live Mermaid doesn't reliably do look: handDrawn, and some viewers — e.g. the VS Code preview — block SVGs, so an embedded PNG shows everywhere). The editable Mermaid source sits in a collapsible "Mermaid source" block under each image. To regenerate the images after editing, see diagrams/README.md.

Run the whole thing

Once the base model has pretrained (02_pretraining.md), the entire chain is one script:

bash scripts/run_posttraining.sh          # SFT -> RM -> DPO -> PPO -> GRPO -> eval table

See POST_TRAINING.md for the condensed command reference.