LLM post-training

Written By: Qingyang Xu (et AI)

Last Modified: April 16, 2026

1. What “post-training” means

A useful definition is:

Pretraining learns a general next-token model over a huge corpus.

Post-training is everything after that, where we reshape the pretrained model into a system that is useful for a target domain, task family, or interaction style.

In modern practice, post-training typically includes some subset of:

continued pretraining / domain-adaptive pretraining on unlabeled in-domain text,
supervised fine-tuning (SFT) on instruction-response or task data,
preference optimization or RLHF/RLAIF to improve helpfulness, safety, style, or task-specific quality,
reasoning-specific training such as process supervision or RL with verifiable rewards,
parameter-efficient adaptation such as LoRA/QLoRA,
synthetic-data generation, rejection sampling, distillation, and model averaging. Modern large-model reports such as Llama 3 explicitly describe iterative post-training pipelines built from reward modeling, rejection sampling, SFT, DPO, synthetic data, and model averaging rather than a single one-shot fine-tune.

The most important conceptual point is that post-training is not one algorithm. It is a stack.

2. A unifying mathematical view

Let $x$ be the prompt/context and $y=(y_1,\dots,y_T)$ the response. A causal LM defines

\pi_\theta(y\mid x)=\prod_{t=1}^T \pi_\theta(y_t\mid x,y_{<t}), \qquad \log \pi_\theta(y\mid x)=\sum_{t=1}^T \log \pi_\theta(y_t\mid x,y_{<t}).

Nearly all post-training methods change (\theta) by pushing probability mass toward “good” outputs and away from “bad” outputs, but they differ in what counts as good and what supervision signal is available.