Post-training experiments, reward shaping, RL continuation, and benchmark deltas after base model release.