thinking about how RL training ends up optimizing for the evals themselves