SlowGuess/ABForge-Qwen3-8B-Task2-RL

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Jun 11, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The SlowGuess/ABForge-Qwen3-8B-Task2-RL is an 8 billion parameter Qwen3-based language model developed by SlowGuess, fine-tuned with GRPO directly from Qwen/Qwen3-8B. This model specializes in generating detailed ablation experiment design plans (objective, setup, variants, protocols, metrics) based on a paper's context and a specified goal. It is specifically optimized for Task 2 within the ABForge pipeline, which focuses on paper-grounded ablation design.

Loading preview...

ABForge-Qwen3-8B-Task2-RL Overview

This model, developed by SlowGuess, is an 8 billion parameter Qwen3-based language model specifically designed for Task 2: Ablation Plan Generation within the ABForge framework. Unlike its supervised counterparts, this checkpoint is trained using GRPO (Gradient-based Policy Optimization) directly from Qwen/Qwen3-8B, without a supervised warm-start, optimizing for a fixed rubric-based reward.

Key Capabilities

  • Ablation Experiment Design: Generates comprehensive ablation experiment plans, including objectives, setup details, variant definitions, fixed protocols, and evaluation metrics.
  • Contextual Understanding: Processes a paper's context and a specific goal to formulate relevant ablation designs.
  • Reinforcement Learning Fine-tuning: Utilizes GRPO on the train/RL_task2_30K.jsonl dataset from SlowGuess/abforge-data, which is derived from CC-licensed research papers.

Good For

  • Researchers and developers needing automated assistance in designing ablation studies for scientific papers.
  • Generating structured and detailed experiment plans for model components or methodologies.
  • Integration into research pipelines requiring systematic ablation design based on textual input.

Evaluation of this model can be reproduced using the SlowGuess/Abforge_1 code, which includes scripts for generating predictions on the AblationBench dataset and scoring them against a Claude-judged rubric.