laion/ablation-pymethods2test-shaped-45-8B
The laion/ablation-pymethods2test-shaped-45-8B is an 8 billion parameter language model, based on a Qwen3-8B SFT, developed by laion. This model is a reinforcement learning (RL) checkpoint from a shaped-reward ablation study, specifically optimized for improving code generation by maximizing the fraction of passing tests. It was trained using the SkyRL GRPO method on the DCAgent/exp_rpt_pymethods2test-large dataset, making it suitable for tasks requiring robust code output and test-driven development.
Loading preview...
Overview
The laion/ablation-pymethods2test-shaped-45-8B is an 8 billion parameter language model, derived from a Qwen3-8B SFT base model. It represents a specific checkpoint (global_step_45) from a reinforcement learning (RL) ablation study conducted by laion, focusing on a "shaped-reward" approach. Unlike models trained with a binary pass/fail reward, this model was optimized using a reward function based on the fraction of tests passing (pass-ratio), aiming to incrementally improve code correctness.
Key Characteristics
- Base Model: Built upon
laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink, which is a Qwen3-8B SFT. - Training Method: Utilizes the SkyRL GRPO algorithm within a shaped-reward ablation study.
- Reward Function: Optimized for a "shaped pass-ratio" reward, meaning it learns to maximize the percentage of tests that pass, rather than just achieving an all-or-nothing success.
- Training Data: Trained on the
DCAgent/exp_rpt_pymethods2test-largedataset. - Checkpoint Selection:
global_step_45was chosen based on the best Exponential Moving Average (EMA) of the average raw reward over an 80-step training chain.
Good For
This model is particularly well-suited for use cases involving code generation where the goal is to produce code that passes a higher percentage of tests, rather than just aiming for perfect solutions. Its training methodology suggests an ability to learn from partial successes and incrementally improve code quality, making it valuable for iterative code development and testing environments.