Train (UI & CLI)¶
You can run every stage two ways: from the command line (full control, best for long jobs) or from the Streamlit control panel (forms, one-click launch, live logs, charts — see The UI).
Install¶
pip install -e ".[train]" # editable install — no more PYTHONPATH=.
export HF_HOME=/ephemeral/hf_cache
The pipeline, end to end (CLI)¶
Each command reads its stage JSON from configs/ (override anything with --field, or point at another
file with --config). Use python for one GPU and torchrun for many.
python scripts/prepare_pretrain_data.py --split val --out /ephemeral/data/pile_dev.h5
python scripts/prepare_pretrain_data.py --split train --num_shards 1 --out /ephemeral/data/pile_train.h5
python scripts/prepare_sft_data.py
python scripts/prepare_preference_data.py --source both
python scripts/prepare_rl_prompts.py
torchrun --standalone --nproc_per_node=2 scripts/train_sft.py
torchrun --standalone --nproc_per_node=2 scripts/train_reward.py
torchrun --standalone --nproc_per_node=2 scripts/train_dpo.py --loss_type dpo
torchrun --standalone --nproc_per_node=2 scripts/train_ppo.py --reward_source verifier
torchrun --standalone --nproc_per_node=2 scripts/train_grpo.py
The whole alignment chain in one shot:
Multi-GPU notes¶
torchrun --standalone --nproc_per_node=Nlaunches N data-parallel ranks (DDP + bf16). Only rank 0 logs and checkpoints.- On the dev box (2× H100, no NVLink) the educational attention materializes a
(B, n_head, T, T)tensor per block, so memory scales with sequence length — at context 1024 use--batch_size 8 --grad_accum 12and recover the effective batch via accumulation. SetPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
Where outputs go¶
- Checkpoints →
/ephemeral/ckpts/<stage>.pt(each carries its own resolvedcfg). - Metrics →
/ephemeral/logs/<stage>_<timestamp>.jsonl(one JSON per logged step). The UI plots these live; you can also--use_wandb trueto mirror to Weights & Biases.