EnergyAI/qwen3-4b-agrpo-think-lr5e-7

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 11, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

EnergyAI/qwen3-4b-agrpo-think-lr5e-7 is a 4 billion parameter Qwen3-based causal language model developed by EnergyAI, fine-tuned with Async GRPO. This model is specifically optimized for fill-in-the-middle multiple-choice questions (MCQ) in the energy domain, featuring an enabled 'thinking mode' during its training. It excels at verifying energy-related information by outputting answers in a specific \boxed{N} format, making it suitable for automated assessment tasks.

Loading preview...

Model Overview

EnergyAI/qwen3-4b-agrpo-think-lr5e-7 is a 4 billion parameter model built upon the Qwen3-4B architecture. It has been fine-tuned using the Async GRPO (Asynchronous Generalized Reinforcement Learning with Policy Optimization) algorithm, specifically leveraging TRL's AsyncGRPOTrainer. A key feature of this model's training is the enabled 'thinking mode' (enable_thinking=True), which likely contributes to its specialized performance.

Key Capabilities

  • Energy Domain Verification: Designed for fill-in-the-middle multiple-choice questions (MCQ) within the energy sector.
  • Structured Output: Outputs answers in a precise \boxed{N} format, where N corresponds to the option number, facilitating automated parsing and verification.
  • Reinforcement Learning Optimization: Trained with a reward function that grants +1.0 for correct answers, -0.5 for wrong answers, and -1.0 for no answer, indicating a strong focus on accuracy and response generation.

Training Details

The model was trained with a learning rate of 5e-7, a cosine scheduler, and a substantial effective batch size of 128 prompts per step. It underwent 2000 maximum steps with 9 generations per prompt and a maximum completion length of 4096 tokens. The training utilized FSDP2 parallelism across 4 GPUs, with vLLM TP=4 for inference, demonstrating a robust and scalable training setup.

Good For

  • Automated assessment of energy-related multiple-choice questions.
  • Applications requiring precise, structured answers for verification tasks.
  • Research into the effectiveness of Async GRPO and 'thinking mode' in specialized domain LLMs.