Today we will be talking about Group Relative Policy Optimization and Group Sequence Policy Optimization.
These RL algorithms have started to take the LLM post-training world by storm since the release of Deepseek.
This will complement the PPO blog.
Group Relative Policy Optimization
Overview
In PPO, there is the need of a critic network, known as the value head. Training such a head would require a completely separate LLM trained on estimating the expected future rewards for given states and actions.
Unfortunately, this is an extremely compute-intensive process. Therefore, Deepseek considered, "How can we determine which sequences are good relative to all the other generated sequences?".
In GRPO, they
Drop the value network but keep the reward model
Samples multiple completions and scores them to find the average score.
Notation
Notation Explained
Let's break down the key notation in the GRPO formula:
q: The query/prompt given to the model, sampled from P(Q).
G: Number of output sequences generated per query (the "group" size).
oi: The i-th output sequence for query q.
∣oi∣: Length of the i-th output sequence.
oi,t: The t-th token in the i-th output sequence.
oi,<t: All tokens before position t in the i-th sequence.
πθ(O∣q): The policy generating outputs given query q.
πθ(oi,t∣q,oi,<t): Probability of generating token oi,t under current policy.
πθold(oi,t∣q,oi,<t): Same probability under old policy.
A^i,t: Advantage estimate for token oi,t, derived from comparing sequence reward to group average.
ri,t(θ): Probability ratio between current and old policies.
where ri,t(θ)=πθold(oi,t∣q,oi,<t)πθ(oi,t∣q,oi,<t) is the probability ratio between the current policy and the old policy.
GRPO's objective function J_GRPO(θ) guides the model to generate better responses by comparing multiple outputs for the same prompt, calculating advantages based on group performance, and updating the policy in a controlled manner while preventing excessive deviation from its original behavior.
Advantage Surrogate
Now that we don't have a value function, we need to have the advantage estimation A^i,t for each token. In GRPO, this advantage is computed by comparing the reward of a sequence to the average reward of the group.
For every output, we have the reward:
ri~=std(r1:G)+ϵri−mean(r1:G)
KL divergence
I thought the intuition behind this part of the equation was quite intuitive. We don't want to have a very drastic distrubtion shift from our old policy. Thus, we want to keep πnew to πold and penalize large distribution shifts.
DKL(p∣∣q)=Eo∼p[p(o)q(o)−log(p(o)q(o))−1]
where πnew=p and πold=q
Now we simplify the first as
∑op(o)⋅p(o)q(o)=∑oq(o)=1
Then, we simplify
Ep[−log(p(o)q(o))]=∑op(o)⋅log(q(o)p(o))=DKL(p∣∣q)
Thus we have the KL term as
DKL(p∣∣q)=Eo∼p[log(q(o)p(o))]
Process Supervision
The previous GRPO algorithm gives us a reward at the end of each output which may not be efficient or be of high signal.
Thus, process supervision provides a reward for each reasoning step.
Let's say that we have G sampled ouptputs o1,o2,o3,...,oG and our rewards are
where index(x) is the end token index of the xth step and Ki is the total number of steps in the i-th output. We normalize these across the entire output as
ri(index(j))=std(R)+ϵri(index(j))−mean(R)
Then our advatange of each token is the cumulative sum uptill that token. This means that earlier tokens are given higher reward because it rewards going down that path while later tokens have a smaller reward.
A^i,t=index(j)≥t∑riindex(j)
GRPO Gradient
GRPO Gradient
Now let's derive the gradient of the GRPO objective function. Recall that the GRPO objective is:
Let's derive the gradient of this objective with respect to θ.
Step 1 — Policy term (REINFORCE)
I lowkey dont' know how they get to this Ai,t term, but that will be a problem for another blog :)
For the advantage term, we can apply the REINFORCE gradient, which allows us to compute the gradient of an expectation by using the score function (gradient of log probability) as follows:
To convert the expectation from πref to πθ, we can use importance sampling (noticei how if we expand the summation, the πθ will cancel out and result in the previous expected value estimation. )
Let's call the coefficient w(o)=πθ(o∣q)πref(o∣q). Now, since Eπθ[1⋅∇logπθ]=0, we can replace w(o) by w(o)−1 inside the expectation:
Eπθ[w(o)∇logπθ]=Eπθ[(w(o)−1)∇logπθ]
This substitution doesn't change the expected value, but it can reduce the variance of the gradient estimate. The term −1 acts as a baseline for variance reduction. Therefore, we can rewrite the KL gradient as:
The problem with GRPO is that we are compounding token level noise. There would be lots of noise in the gradient when trying to backprop and calculate a loss at the token level.
For example, if we're trying to optimize a policy to generate the response "The capital of France is Paris", token-level optimization might incorrectly penalize the token "Paris" if earlier tokens in the sequence had high variance in their probability estimates, leading to compounded errors in the gradient calculation.
Thus, instead of evaluating a response at the token-level, the importance ratio ri,t(θ) is at the sequence level (reasoning block, sentence, etc)
For prompt x, we sample G responses and scores r(x,yi) and we get the advantage as,
The 1/∣yi∣ normalization is crucial for two reasons: First, without normalization, multiplying many token probabilities together would cause the importance ratio to approach zero exponentially as sequence length increases.
Which is the typical GRPO objective function as well.
GSPO Gradient
The main difference is how they weight the gradeints of the log likelihood of tokens. Beacuse GSPo weights all the tokens equally, the training is much more stable and is dependent on the model producing a good sequence rather thant good tokens (there is a Token-level objective variant however, that helps iwth finer-trained control)
Anyways, the GSPO gradient is very similar. We have that,
Research shows that GSPO performs better because of stabler gradients. Additionally, it provides higher training accuracy and performance in benchmarks. However, the number of clipped tokens in GSPO is much higher than in GRPO.