PPO, GRPO, & DAPO: Core Concepts (Unified Example)

I. Introduction to Policy Optimization

In Reinforcement Learning (RL), an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. The agent's strategy is its policy (π).

Policy Optimization algorithms directly learn or improve this policy. Instead of just learning values for states/actions, they find policy parameters (θ) that yield the highest rewards. PPO, GRPO, and DAPO are advanced policy optimization algorithms, particularly relevant for complex tasks like training Large Language Models (LLMs).

II. Proximal Policy Optimization (PPO)

                Core Idea of PPO: Improve the policy with updates that are not too large (to avoid performance collapse) and not too small (to ensure progress). PPO aims for stable and reliable policy improvement.
            

A. Key Concepts in PPO

Policy (π_θ(a|s)): The agent's current strategy, mapping states s to action a probabilities, parameterized by θ. In LLMs, s is the current sequence of generated tokens (prompt + previous tokens), and a is the next token to generate.
Value Function (V(s)): Learned by a critic network, estimates the expected cumulative reward from state s. For an LLM, V(s) would estimate the quality (e.g., from a reward model) of the completion starting from the current token sequence s.
Advantage Function (A^π(s,a) or Â_t): Quantifies how much better action a is compared to the average action from state s. Often estimated using Generalized Advantage Estimation (GAE):
Â_t^GAE = ∑_l=0^T-t-1 (γλ)^lδ_t+l
where δ_t+l = r_t+l + γV(s_t+l+1) - V(s_t+l) is the TD residual. r_t+l is the reward for generating the token a_t+l, and V(s) is the state-value function.
Probability Ratio (r_t(θ)): Compares the probability of an action under the new policy (π_θ) versus the old policy (π_{θ_old}) that collected the data.
r_t(θ) = π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t)
Clipped Surrogate Objective Function (L^CLIP(θ)): PPO's core.
L^CLIP(θ) = Ê_t [ min( r_t(θ)Â_t, clip(r_t(θ), 1-ε, 1+ε)Â_t ) ]
Here, ε (epsilon, e.g., 0.1 or 0.2) defines the clipping range.

Why clipping? It prevents the new policy from moving too far from the old policy in a single update, ensuring stability.

B. Example: LLM Answering "Capital of Australia?" with PPO

Scenario: LLM generating an answer to "What is the capital of Australia?"

Prompt (Initial State): "What is the capital of Australia?"
Data Collection (Rollout): The LLM (policy π_{θ_old}) generates a sequence of tokens. Let's say it generates "Sydney".
- State s₀: "What is the capital of Australia?"
- Action a₀: Token "Sydney"
- Reward r₀: A reward model evaluates "Sydney" in this context. Let's say it gives a low reward (e.g., 2/10) because it's incorrect.
- Next State s₁: "What is the capital of Australia? Sydney" (now considering this as a completed sequence for simplicity of this step).
Value Estimation (Critic):
- The critic V(s₀) estimates the expected future reward from the prompt. It might have a moderate value if the LLM sometimes gets it right.
- The critic V(s₁) (value of the state *after* generating "Sydney") would likely be low, as "Sydney" is a poor completion.
Advantage Estimation (Â₀ for token "Sydney"): Using a simplified GAE (single step for clarity): Â₀ ≈ r₀ + γV(s₁) - V(s₀). If r₀=2, V(s₁) is low (e.g., 1), and V(s₀) was higher (e.g., 5, hoping for a better outcome), then Â₀ might be 2 + 0.9*1 - 5 = -2.1 (negative advantage). This indicates "Sydney" was a worse-than-average choice.
Probability Ratio (r₀(θ)): The new policy π_θ will try to decrease the probability of generating "Sydney" given the prompt: π_θ("Sydney"|s₀) < π_{θ_old}("Sydney"|s₀). This makes r₀(θ) < 1.
Clipped Objective in Action: Let ε = 0.2. Suppose r₀(θ) = 0.7 (policy wants to reduce probability by 30%). Â₀ is negative (-2.1).
- Term 1: r₀(θ)Â₀ = 0.7 * (-2.1) = -1.47.
- Term 2: clip(r₀(θ), 1-ε, 1+ε)Â₀ = clip(0.7, 0.8, 1.2)Â₀ = 0.8 * (-2.1) = -1.68.
- The objective takes min(-1.47, -1.68). Since Â₀ is negative, the objective function aims to make the term r_t(θ)Â_t less negative (i.e., closer to zero or positive). The `min` here actually selects the *larger* value when advantages are negative (less negative is larger). So, it would be -1.47. The goal is to reduce the likelihood of this action, but the clipping (to 0.8 * Â₀ in the second term) prevents the policy from drastically reducing the probability too quickly if the proposed change is too large.
  Correction: For negative advantages, the objective is max( r_t(θ)Â_t, clip(r_t(θ), 1-ε, 1+ε)Â_t ) effectively, or the PPO paper presents it as min but the interpretation for negative advantage is that we want to increase r_t(θ) towards 1+ε if r_t(θ) > 1+ε (which is not the case here) or decrease it towards 1-ε if r_t(θ) < 1-ε. The key is that the change is bounded. If r_t(θ) = 0.7 and Â_t < 0, the objective encourages increasing r_t(θ) but not beyond 1-ε. The update is based on 0.8 * Â_t if 0.7 * Â_t < 0.8 * Â_t (which is true since Â_t < 0). So, the effective update is based on clip(0.7, 0.8, 1.2) * Â_t = 0.8 * Â_t. The policy is gently discouraged from picking "Sydney". Conversely, if the LLM had generated "Canberra" (high reward, positive advantage), PPO would encourage increasing its probability, clipped at (1+ε)Â_t.

C. PPO Strengths & Limitations

Strengths: Simpler to implement than TRPO, good empirical performance, stable, relatively sample efficient.
Limitations: Critic can be complex and resource-intensive for LLMs. Sensitive to initialization. Value estimation for long, sparse-reward sequences in LLMs is challenging (value initialization bias, reward signal decay).

III. Group Relative Policy Optimization (GRPO)

                Core Idea of GRPO: Simplify PPO for LLMs by removing the critic network. Advantage is estimated by comparing a response's reward to the average reward of a "group" of responses generated for the same input prompt.
            

A. Key Concepts in GRPO

Critic-less: No learned value function V(s). This reduces memory and computation.
Group Sampling: For an input prompt q, generate G responses {o₁, ..., o_G} using the current policy.
Group-Relative Advantage Estimation (Â(o_i)): The advantage for response o_i is its reward R(o_i) standardized relative to the group's rewards:
Â(o_i) = (R(o_i) - μ_G) / (σ_G + ϵ_norm)
where μ_G is the mean reward of the group, σ_G is the standard deviation of rewards in the group, and ϵ_norm is for numerical stability. This advantage is typically applied to all tokens in the response o_i.
Objective Function (L_GRPO(θ)): Uses the PPO-style clipped objective but with the group-relative advantage. Often includes a KL-divergence penalty term β · KL(π_θ || π_ref).
L_GRPO(θ) = E [min( r_θ(o)Â(o), clip(r_θ(o), 1-ε, 1+ε)Â(o) )] - β KL(π_θ || π_ref)

B. Example: LLM Answering "Capital of Australia?" with GRPO

Scenario: LLM generating an answer to "What is the capital of Australia?"

Prompt: "What is the capital of Australia?"
Group Sampling (G=3 responses generated by current policy π_{θ_old}):
- o₁: "Sydney" (Reward R(o₁)=2 from a reward model)
- o₂: "Canberra" (Reward R(o₂)=10)
- o₃: "Melbourne" (Reward R(o₃)=3)
Group Stats: Mean reward μ_G = (2+10+3)/3 = 5. Standard deviation σ_G ≈ 3.5 (assuming ϵ_norm is small).
Group-Relative Advantages:
- Â(o₁) for "Sydney" = (2-5)/3.5 ≈ -0.86 (negative advantage)
- Â(o₂) for "Canberra" = (10-5)/3.5 ≈ +1.43 (positive advantage)
- Â(o₃) for "Melbourne" = (3-5)/3.5 ≈ -0.57 (negative advantage)
These advantages apply to the entire sequences. For example, every token in "Sydney" gets an advantage of -0.86.
Probability Ratio (r_θ(o_i)): This is π_θ(o_i|prompt) / π_{θ_old}(o_i|prompt).
Policy Update:
- For o₁ ("Sydney"): Â(o₁) is negative. The policy π_θ will be updated to decrease the probability of generating "Sydney". The update is clipped.
- For o₂ ("Canberra"): Â(o₂) is positive. The policy π_θ will be updated to increase the probability of generating "Canberra". The update is clipped.
- For o₃ ("Melbourne"): Â(o₃) is negative. The policy π_θ will be updated to decrease the probability of generating "Melbourne". The update is clipped.
The KL term helps ensure π_θ doesn't stray too far from a reference policy (e.g., the SFT model).

C. GRPO Strengths & Limitations

Strengths: Computationally efficient for LLMs (no critic), stable due to clipping and KL regularization, flexible reward sources.
Limitations:
- Zero-Advantage Problem: If all responses in a group have the same reward (e.g., all "Canberra" or all "Sydney"), σ_G is zero, leading to zero advantage for all samples in that group and thus no learning signal.
- Potential optimization bias (e.g., favoring longer responses if length correlates with reward, as advantage is sequence-level).
- Advantage is applied at sequence level, not token level, which can be less precise.

IV. Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

                Core Idea of DAPO: Refine GRPO for LLMs by addressing issues like entropy collapse and the zero-advantage problem, using techniques like decoupled clipping and dynamic sampling. It often uses token-level advantages.
            

A. Key Innovations in DAPO

Decoupled Clipping ("Clip-Higher" Strategy): Uses asymmetric clipping bounds for the probability ratio r_t(θ), e.g., [1-ε_low, 1+ε_high], where ε_high (e.g., 0.28) > ε_low (e.g., 0.2).
Rationale: When advantage is positive, allows larger probability increases for beneficial, initially low-probability tokens, promoting exploration.
Dynamic Sampling: Filters out and replaces training prompts/groups where responses have homogenous rewards (low σ_G).
Rationale: Mitigates GRPO's zero-advantage problem, ensuring batches have "informative" samples.
Token-Level Policy Gradient Loss: Calculates loss at the token level and averages across all tokens in the batch. GRPO's group-relative advantage can be adapted for token-level application (e.g., by assigning the sequence advantage to each token, or using more fine-grained token-level reward signals if available).

DAPO typically builds on GRPO's critic-less group-relative advantage estimation but incorporates these enhancements.

B. Example: LLM Answering "Capital of Australia?" with DAPO

Scenario: LLM generating an answer to "What is the capital of Australia?"

Prompt: "What is the capital of Australia?"
Group Sampling (G=3, by π_{θ_old}):
- o₁: "Sydney" (Reward R(o₁)=2)
- o₂: "Canberra" (Reward R(o₂)=10)
- o₃: "Melbourne" (Reward R(o₃)=3)
Group Stats & Advantages (as in GRPO): μ_G=5, σ_G≈3.5.
- Â(o₁) ≈ -0.86
- Â(o₂) ≈ +1.43
- Â(o₃) ≈ -0.57
These advantages are applied at the token level for DAPO's loss calculation. So, each token in "Canberra" gets an advantage of +1.43.
Dynamic Sampling in Action: This group (rewards 2, 10, 3) is diverse (σ_G is non-zero), so it's likely kept. If another prompt, e.g., "What is 1+1?", produced three responses: o₄:"2" (R=10), o₅:"Two" (R=10), o₆:"II" (R=10). Here, σ_G would be 0. Dynamic Sampling might filter out this "1+1" prompt and its group, replacing it with a prompt that yields more diverse rewards to ensure effective gradients.
Decoupled Clipping ("Clip-Higher") in Action (Focus on o₂: "Canberra"): Assume the token "Canberra" (or its constituent tokens) had a relatively low probability under π_{θ_old}, but it's the correct, high-reward answer. The advantage Â(o₂) ≈ +1.43 is positive.
- The new policy π_θ aims to significantly increase the probability of "Canberra". Suppose this leads to a token probability ratio r_t(θ) = 1.35 for "Canberra".
- DAPO uses ε_low=0.2, ε_high=0.28. Clipping range: [1-0.2, 1+0.28] = [0.8, 1.28].
- The clipped ratio is clip(1.35, 0.8, 1.28) = 1.28.
- The update for "Canberra" tokens is based on 1.28 * Â(o₂). Standard PPO/GRPO (ε=0.2) would clip at 1.2 * Â(o₂). DAPO's "clip-higher" allows a larger update (1.28 * Â(o₂) vs 1.2 * Â(o₂)), more strongly reinforcing this correct, high-advantage token, especially if it was initially unlikely.
Token-Level Policy Gradient Loss: The loss is computed for each token in each response using its assigned advantage (e.g., all tokens in "Canberra" use Â(o₂)). These token losses are then averaged across all tokens in the batch.

C. DAPO Strengths & Considerations

Strengths: Addresses GRPO's zero-advantage problem and LLM entropy collapse. Improves training efficiency and stability for complex reasoning.
Considerations: More hyperparameters (ε_low, ε_high). Effectiveness of dynamic sampling can be task-dependent. Relies on good reward function design.

V. Comparative Overview & Evolution

The journey from PPO to GRPO to DAPO shows an evolution driven by the need to apply RL effectively to increasingly large and complex models, especially LLMs, using the same core problem as a lens.

Key Differences at a Glance (Unified Example: "Capital of Australia?"):

Feature	PPO	GRPO	DAPO
Critic Usage	Yes (Learned V(s) estimates value of "What is capital of Aus? ...token_sequence")	No (Critic-less)	No (Critic-less)
Advantage Estimation for "Canberra"	GAE: Uses reward for "Canberra" & V(s) from critic. Token-level.	Group-Relative: Compares R("Canberra") to R("Sydney"), R("Melbourne"). Sequence-level.	Group-Relative (like GRPO), but often applied for token-level loss.
Clipping for "Canberra" (if `r_t(θ)=1.35`, Â > 0)	Symmetric (ε=0.2): `clip(1.35, 0.8, 1.2)Â = 1.2Â`	Symmetric (ε=0.2): `clip(1.35, 0.8, 1.2)Â = 1.2Â`	Decoupled (ε_high=0.28): `clip(1.35, 0.8, 1.28)Â = 1.28Â` (allows larger increase)
Handling Homogenous Rewards (e.g., all outputs "Canberra" R=10)	Critic still provides value estimates; GAE can be non-zero if V(s) differs.	Zero-Advantage Problem: σ_G=0, so Â=0 for all. No learning signal from this batch.	Dynamic Sampling: Filters out this batch to replace with more diverse one.
Primary Stability/Exploration Mechanisms	Clipping, Entropy Bonus (optional)	Clipping, KL Regularization, Group Normalization	Decoupled Clipping, Dynamic Sampling, Token-level Loss
Primary Application Focus	General RL	LLM fine-tuning (general)	Advanced LLM reasoning, mitigating specific LLM RL issues

VI. Conclusion

PPO, GRPO, and DAPO represent a significant lineage of policy optimization algorithms. Using a consistent example like an LLM answering a factual question, we can see how PPO provides a robust foundation with its critic-based advantage. GRPO adapts these principles for resource-constrained LLM training by introducing a critic-less, group-based advantage, simplifying computation. DAPO further refines this with specialized techniques like decoupled clipping and dynamic sampling to tackle nuanced challenges in LLM training, such as maintaining exploration and improving data efficiency when rewards might be homogenous or certain correct tokens are initially rare. Understanding their core mechanisms and evolutionary path is key to applying them effectively.