An intuition ladder from supervised learning to LLM RL post-training (SFT, REINFORCE, PPO, GRPO, reward models, DPO) — one tiny runnable notebook.
Latest commits.
Builders behind this project.