Here’s this week’s Ritual Research Digest, a newsletter covering the latest in the world of LLMs and the intersection of Crypto x AI. With hundreds of papers published weekly, staying current with the latest is impossible. We do the reading so you don’t have to.
RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments The authors introduce RLVE for post training where they uses "adaptive verifiable environments" that generate problems according to the models skill level.
The RLVE method was tested by training OpenThinker3-1.5B using RLVE-Gym, a collection of 400 different learning environments. Training with RLVE led to a 3.37% improvement in reasoning skills while using 3x less compute.
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains This work finds that AI models trained to be helpful & safe are ironically "too good" to roleplay villains. LLMs decline in performance, struggling with traits like deceitful/manipulative.
Safety alignment that makes AI models refuse harmful requests also prevents them from authentically simulating morally complex characters needed for tasks like creative writing, games, and social science. They also introduce a "Moral RolePlay" benchmark to test LLMs.
SSR: Socratic Self-Refine for Large Language Model Reasoning This work introduces SSR, which helps AI models assess reasoning by breaking answers into smaller pieces, identifying which specific steps are shaky, and fixing them.
SSR breaks down model response into smaller "Socratic steps," which are like sub-question/sub-answer pairs. This allows for fixing specific errors in the reasoning chain. Across 5 benchmarks and 3 LLMs, this targeted approach outperforms methods that blindly self-correct.
SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads? This work introduces a benchmark to test how well language models optimize code in real-world software projects. It includes 498 tasks from popular ML libraries.
Given repos models must find performance bottlenecks and fix them. They find that today's best AI models achieve less than 15% of expert speedups. The models struggle to locate the right code to optimize, reason about how functions work together, and keep their edits bug-free.
Follow us @ritualdigest for more on all things crypto x AI research, and @ritualnet to learn more about what Ritual is building.
816