Group Relative Policy Optimization runs, reward design, checkpoint behavior, and post-training lessons.