PPO, GRPO, & DAPO: Core Concepts (Unified Example)

I. Introduction to Policy Optimization

In Reinforcement Learning (RL), an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. The agent's strategy is its policy (π).

Policy Optimization algorithms directly learn or improve this policy. Instead of just learning values for states/actions, they find policy parameters (θ) that yield the highest rewards. PPO, GRPO, and DAPO are advanced policy optimization algorithms, particularly relevant for complex tasks like training Large Language Models (LLMs).


II. Proximal Policy Optimization (PPO)

Core Idea of PPO: Improve the policy with updates that are not too large (to avoid performance collapse) and not too small (to ensure progress). PPO aims for stable and reliable policy improvement.

A. Key Concepts in PPO

Why clipping? It prevents the new policy from moving too far from the old policy in a single update, ensuring stability.

B. Example: LLM Answering "Capital of Australia?" with PPO

Scenario: LLM generating an answer to "What is the capital of Australia?"

  1. Prompt (Initial State): "What is the capital of Australia?"
  2. Data Collection (Rollout): The LLM (policy πθold) generates a sequence of tokens. Let's say it generates "Sydney".
    • State s0: "What is the capital of Australia?"
    • Action a0: Token "Sydney"
    • Reward r0: A reward model evaluates "Sydney" in this context. Let's say it gives a low reward (e.g., 2/10) because it's incorrect.
    • Next State s1: "What is the capital of Australia? Sydney" (now considering this as a completed sequence for simplicity of this step).
  3. Value Estimation (Critic):
    • The critic V(s0) estimates the expected future reward from the prompt. It might have a moderate value if the LLM sometimes gets it right.
    • The critic V(s1) (value of the state *after* generating "Sydney") would likely be low, as "Sydney" is a poor completion.
  4. Advantage Estimation (Â0 for token "Sydney"): Using a simplified GAE (single step for clarity): Â0 ≈ r0 + γV(s1) - V(s0). If r0=2, V(s1) is low (e.g., 1), and V(s0) was higher (e.g., 5, hoping for a better outcome), then Â0 might be 2 + 0.9*1 - 5 = -2.1 (negative advantage). This indicates "Sydney" was a worse-than-average choice.
  5. Probability Ratio (r0(θ)): The new policy πθ will try to decrease the probability of generating "Sydney" given the prompt: πθ("Sydney"|s0) < πθold("Sydney"|s0). This makes r0(θ) < 1.
  6. Clipped Objective in Action: Let ε = 0.2. Suppose r0(θ) = 0.7 (policy wants to reduce probability by 30%). Â0 is negative (-2.1).
    • Term 1: r0(θ)Â0 = 0.7 * (-2.1) = -1.47.
    • Term 2: clip(r0(θ), 1-ε, 1+ε)Â0 = clip(0.7, 0.8, 1.2)Â0 = 0.8 * (-2.1) = -1.68.
    • The objective takes min(-1.47, -1.68). Since Â0 is negative, the objective function aims to make the term rt(θ)Ât less negative (i.e., closer to zero or positive). The `min` here actually selects the *larger* value when advantages are negative (less negative is larger). So, it would be -1.47. The goal is to reduce the likelihood of this action, but the clipping (to 0.8 * Â0 in the second term) prevents the policy from drastically reducing the probability too quickly if the proposed change is too large.
      Correction: For negative advantages, the objective is max( rt(θ)Ât, clip(rt(θ), 1-ε, 1+ε)Ât ) effectively, or the PPO paper presents it as min but the interpretation for negative advantage is that we want to increase r_t(θ) towards 1+ε if r_t(θ) > 1+ε (which is not the case here) or decrease it towards 1-ε if r_t(θ) < 1-ε. The key is that the change is bounded. If r_t(θ) = 0.7 and Â_t < 0, the objective encourages increasing r_t(θ) but not beyond 1-ε. The update is based on 0.8 * Â_t if 0.7 * Â_t < 0.8 * Â_t (which is true since Â_t < 0). So, the effective update is based on clip(0.7, 0.8, 1.2) * Â_t = 0.8 * Â_t. The policy is gently discouraged from picking "Sydney". Conversely, if the LLM had generated "Canberra" (high reward, positive advantage), PPO would encourage increasing its probability, clipped at (1+ε)Ât.

C. PPO Strengths & Limitations


III. Group Relative Policy Optimization (GRPO)

Core Idea of GRPO: Simplify PPO for LLMs by removing the critic network. Advantage is estimated by comparing a response's reward to the average reward of a "group" of responses generated for the same input prompt.

A. Key Concepts in GRPO

B. Example: LLM Answering "Capital of Australia?" with GRPO

Scenario: LLM generating an answer to "What is the capital of Australia?"

  1. Prompt: "What is the capital of Australia?"
  2. Group Sampling (G=3 responses generated by current policy πθold):
    • o1: "Sydney" (Reward R(o1)=2 from a reward model)
    • o2: "Canberra" (Reward R(o2)=10)
    • o3: "Melbourne" (Reward R(o3)=3)
  3. Group Stats: Mean reward μG = (2+10+3)/3 = 5. Standard deviation σG ≈ 3.5 (assuming ϵnorm is small).
  4. Group-Relative Advantages:
    • Â(o1) for "Sydney" = (2-5)/3.5 ≈ -0.86 (negative advantage)
    • Â(o2) for "Canberra" = (10-5)/3.5 ≈ +1.43 (positive advantage)
    • Â(o3) for "Melbourne" = (3-5)/3.5 ≈ -0.57 (negative advantage)
    These advantages apply to the entire sequences. For example, every token in "Sydney" gets an advantage of -0.86.
  5. Probability Ratio (rθ(oi)): This is πθ(oi|prompt) / πθold(oi|prompt).
  6. Policy Update:
    • For o1 ("Sydney"): Â(o1) is negative. The policy πθ will be updated to decrease the probability of generating "Sydney". The update is clipped.
    • For o2 ("Canberra"): Â(o2) is positive. The policy πθ will be updated to increase the probability of generating "Canberra". The update is clipped.
    • For o3 ("Melbourne"): Â(o3) is negative. The policy πθ will be updated to decrease the probability of generating "Melbourne". The update is clipped.
    The KL term helps ensure πθ doesn't stray too far from a reference policy (e.g., the SFT model).

C. GRPO Strengths & Limitations


IV. Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

Core Idea of DAPO: Refine GRPO for LLMs by addressing issues like entropy collapse and the zero-advantage problem, using techniques like decoupled clipping and dynamic sampling. It often uses token-level advantages.

A. Key Innovations in DAPO

DAPO typically builds on GRPO's critic-less group-relative advantage estimation but incorporates these enhancements.

B. Example: LLM Answering "Capital of Australia?" with DAPO

Scenario: LLM generating an answer to "What is the capital of Australia?"

  1. Prompt: "What is the capital of Australia?"
  2. Group Sampling (G=3, by πθold):
    • o1: "Sydney" (Reward R(o1)=2)
    • o2: "Canberra" (Reward R(o2)=10)
    • o3: "Melbourne" (Reward R(o3)=3)
  3. Group Stats & Advantages (as in GRPO): μG=5, σG≈3.5.
    • Â(o1) ≈ -0.86
    • Â(o2) ≈ +1.43
    • Â(o3) ≈ -0.57
    These advantages are applied at the token level for DAPO's loss calculation. So, each token in "Canberra" gets an advantage of +1.43.
  4. Dynamic Sampling in Action: This group (rewards 2, 10, 3) is diverse (σG is non-zero), so it's likely kept. If another prompt, e.g., "What is 1+1?", produced three responses: o4:"2" (R=10), o5:"Two" (R=10), o6:"II" (R=10). Here, σG would be 0. Dynamic Sampling might filter out this "1+1" prompt and its group, replacing it with a prompt that yields more diverse rewards to ensure effective gradients.
  5. Decoupled Clipping ("Clip-Higher") in Action (Focus on o2: "Canberra"): Assume the token "Canberra" (or its constituent tokens) had a relatively low probability under πθold, but it's the correct, high-reward answer. The advantage Â(o2) ≈ +1.43 is positive.
    • The new policy πθ aims to significantly increase the probability of "Canberra". Suppose this leads to a token probability ratio rt(θ) = 1.35 for "Canberra".
    • DAPO uses εlow=0.2, εhigh=0.28. Clipping range: [1-0.2, 1+0.28] = [0.8, 1.28].
    • The clipped ratio is clip(1.35, 0.8, 1.28) = 1.28.
    • The update for "Canberra" tokens is based on 1.28 * Â(o2). Standard PPO/GRPO (ε=0.2) would clip at 1.2 * Â(o2). DAPO's "clip-higher" allows a larger update (1.28 * Â(o2) vs 1.2 * Â(o2)), more strongly reinforcing this correct, high-advantage token, especially if it was initially unlikely.
  6. Token-Level Policy Gradient Loss: The loss is computed for each token in each response using its assigned advantage (e.g., all tokens in "Canberra" use Â(o2)). These token losses are then averaged across all tokens in the batch.

C. DAPO Strengths & Considerations


V. Comparative Overview & Evolution

The journey from PPO to GRPO to DAPO shows an evolution driven by the need to apply RL effectively to increasingly large and complex models, especially LLMs, using the same core problem as a lens.

Key Differences at a Glance (Unified Example: "Capital of Australia?"):

Feature PPO GRPO DAPO
Critic Usage Yes (Learned V(s) estimates value of "What is capital of Aus? ...token_sequence") No (Critic-less) No (Critic-less)
Advantage Estimation for "Canberra" GAE: Uses reward for "Canberra" & V(s) from critic. Token-level. Group-Relative: Compares R("Canberra") to R("Sydney"), R("Melbourne"). Sequence-level. Group-Relative (like GRPO), but often applied for token-level loss.
Clipping for "Canberra" (if rt(θ)=1.35, Â > 0) Symmetric (ε=0.2): clip(1.35, 0.8, 1.2)Â = 1.2Â Symmetric (ε=0.2): clip(1.35, 0.8, 1.2)Â = 1.2Â Decoupled (εhigh=0.28): clip(1.35, 0.8, 1.28)Â = 1.28Â (allows larger increase)
Handling Homogenous Rewards (e.g., all outputs "Canberra" R=10) Critic still provides value estimates; GAE can be non-zero if V(s) differs. Zero-Advantage Problem: σG=0, so Â=0 for all. No learning signal from this batch. Dynamic Sampling: Filters out this batch to replace with more diverse one.
Primary Stability/Exploration Mechanisms Clipping, Entropy Bonus (optional) Clipping, KL Regularization, Group Normalization Decoupled Clipping, Dynamic Sampling, Token-level Loss
Primary Application Focus General RL LLM fine-tuning (general) Advanced LLM reasoning, mitigating specific LLM RL issues

VI. Conclusion

PPO, GRPO, and DAPO represent a significant lineage of policy optimization algorithms. Using a consistent example like an LLM answering a factual question, we can see how PPO provides a robust foundation with its critic-based advantage. GRPO adapts these principles for resource-constrained LLM training by introducing a critic-less, group-based advantage, simplifying computation. DAPO further refines this with specialized techniques like decoupled clipping and dynamic sampling to tackle nuanced challenges in LLM training, such as maintaining exploration and improving data efficiency when rewards might be homogenous or certain correct tokens are initially rare. Understanding their core mechanisms and evolutionary path is key to applying them effectively.