In Reinforcement Learning (RL), an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. The agent's strategy is its policy (π).
Policy Optimization algorithms directly learn or improve this policy. Instead of just learning values for states/actions, they find policy parameters (θ) that yield the highest rewards. PPO, GRPO, and DAPO are advanced policy optimization algorithms, particularly relevant for complex tasks like training Large Language Models (LLMs).
s
to action a
probabilities, parameterized by θ
. In LLMs, s
is the current sequence of generated tokens (prompt + previous tokens), and a
is the next token to generate.
s
. For an LLM, V(s)
would estimate the quality (e.g., from a reward model) of the completion starting from the current token sequence s
.
a
is compared to the average action from state s
. Often estimated using Generalized Advantage Estimation (GAE):
δt+l = rt+l + γV(st+l+1) - V(st+l)
is the TD residual. rt+l
is the reward for generating the token at+l
, and V(s)
is the state-value function.
ε
(epsilon, e.g., 0.1 or 0.2) defines the clipping range.
s0
: "What is the capital of Australia?"a0
: Token "Sydney"r0
: A reward model evaluates "Sydney" in this context. Let's say it gives a low reward (e.g., 2/10) because it's incorrect.s1
: "What is the capital of Australia? Sydney" (now considering this as a completed sequence for simplicity of this step).V(s0)
estimates the expected future reward from the prompt. It might have a moderate value if the LLM sometimes gets it right.V(s1)
(value of the state *after* generating "Sydney") would likely be low, as "Sydney" is a poor completion.Â0 ≈ r0 + γV(s1) - V(s0)
.
If r0=2
, V(s1)
is low (e.g., 1), and V(s0)
was higher (e.g., 5, hoping for a better outcome), then Â0 might be 2 + 0.9*1 - 5 = -2.1
(negative advantage). This indicates "Sydney" was a worse-than-average choice.
πθ("Sydney"|s0) < πθold("Sydney"|s0)
. This makes r0(θ) < 1
.
ε = 0.2
. Suppose r0(θ) = 0.7
(policy wants to reduce probability by 30%). Â0 is negative (-2.1).
r0(θ)Â0 = 0.7 * (-2.1) = -1.47
.clip(r0(θ), 1-ε, 1+ε)Â0 = clip(0.7, 0.8, 1.2)Â0 = 0.8 * (-2.1) = -1.68
.min(-1.47, -1.68)
. Since Â0 is negative, the objective function aims to make the term rt(θ)Ât
less negative (i.e., closer to zero or positive). The `min` here actually selects the *larger* value when advantages are negative (less negative is larger). So, it would be -1.47
. The goal is to reduce the likelihood of this action, but the clipping (to 0.8 * Â0 in the second term) prevents the policy from drastically reducing the probability too quickly if the proposed change is too large.
max( rt(θ)Ât, clip(rt(θ), 1-ε, 1+ε)Ât )
effectively, or the PPO paper presents it as min
but the interpretation for negative advantage is that we want to increase r_t(θ)
towards 1+ε
if r_t(θ) > 1+ε
(which is not the case here) or decrease it towards 1-ε
if r_t(θ) < 1-ε
.
The key is that the change is bounded. If r_t(θ) = 0.7
and Â_t < 0
, the objective encourages increasing r_t(θ)
but not beyond 1-ε
. The update is based on 0.8 * Â_t
if 0.7 * Â_t < 0.8 * Â_t
(which is true since Â_t < 0
).
So, the effective update is based on clip(0.7, 0.8, 1.2) * Â_t = 0.8 * Â_t
.
The policy is gently discouraged from picking "Sydney".
Conversely, if the LLM had generated "Canberra" (high reward, positive advantage), PPO would encourage increasing its probability, clipped at (1+ε)Ât
.
V(s)
. This reduces memory and computation.
q
, generate G
responses {o1, ..., oG}
using the current policy.
oi
is its reward R(oi)
standardized relative to the group's rewards:
μG
is the mean reward of the group, σG
is the standard deviation of rewards in the group, and ϵnorm
is for numerical stability. This advantage is typically applied to all tokens in the response oi
.
β · KL(πθ || πref)
.
πθ(oi|prompt) / πθold(oi|prompt)
.
σG
is zero, leading to zero advantage for all samples in that group and thus no learning signal.rt(θ)
, e.g., [1-εlow, 1+εhigh]
, where εhigh
(e.g., 0.28) > εlow
(e.g., 0.2).
σG
).
DAPO typically builds on GRPO's critic-less group-relative advantage estimation but incorporates these enhancements.
rt(θ) = 1.35
for "Canberra".εlow=0.2
, εhigh=0.28
. Clipping range: [1-0.2, 1+0.28] = [0.8, 1.28]
.clip(1.35, 0.8, 1.28) = 1.28
.1.28 * Â(o2)
.
Standard PPO/GRPO (ε=0.2) would clip at 1.2 * Â(o2)
.
DAPO's "clip-higher" allows a larger update (1.28 * Â(o2)
vs 1.2 * Â(o2)
), more strongly reinforcing this correct, high-advantage token, especially if it was initially unlikely.
The journey from PPO to GRPO to DAPO shows an evolution driven by the need to apply RL effectively to increasingly large and complex models, especially LLMs, using the same core problem as a lens.
Feature | PPO | GRPO | DAPO |
---|---|---|---|
Critic Usage | Yes (Learned V(s) estimates value of "What is capital of Aus? ...token_sequence") | No (Critic-less) | No (Critic-less) |
Advantage Estimation for "Canberra" | GAE: Uses reward for "Canberra" & V(s) from critic. Token-level. | Group-Relative: Compares R("Canberra") to R("Sydney"), R("Melbourne"). Sequence-level. | Group-Relative (like GRPO), but often applied for token-level loss. |
Clipping for "Canberra" (if rt(θ)=1.35 , Â > 0) |
Symmetric (ε=0.2): clip(1.35, 0.8, 1.2)Â = 1.2Â |
Symmetric (ε=0.2): clip(1.35, 0.8, 1.2)Â = 1.2Â |
Decoupled (εhigh=0.28): clip(1.35, 0.8, 1.28)Â = 1.28Â (allows larger increase) |
Handling Homogenous Rewards (e.g., all outputs "Canberra" R=10) | Critic still provides value estimates; GAE can be non-zero if V(s) differs. | Zero-Advantage Problem: σG=0, so Â=0 for all. No learning signal from this batch. | Dynamic Sampling: Filters out this batch to replace with more diverse one. |
Primary Stability/Exploration Mechanisms | Clipping, Entropy Bonus (optional) | Clipping, KL Regularization, Group Normalization | Decoupled Clipping, Dynamic Sampling, Token-level Loss |
Primary Application Focus | General RL | LLM fine-tuning (general) | Advanced LLM reasoning, mitigating specific LLM RL issues |
PPO, GRPO, and DAPO represent a significant lineage of policy optimization algorithms. Using a consistent example like an LLM answering a factual question, we can see how PPO provides a robust foundation with its critic-based advantage. GRPO adapts these principles for resource-constrained LLM training by introducing a critic-less, group-based advantage, simplifying computation. DAPO further refines this with specialized techniques like decoupled clipping and dynamic sampling to tackle nuanced challenges in LLM training, such as maintaining exploration and improving data efficiency when rewards might be homogenous or certain correct tokens are initially rare. Understanding their core mechanisms and evolutionary path is key to applying them effectively.