EnergyAI/qwen3-4b-agrpo-think-lr5e-7
EnergyAI/qwen3-4b-agrpo-think-lr5e-7 is a 4 billion parameter Qwen3-based causal language model developed by EnergyAI, fine-tuned with Async GRPO. This model is specifically optimized for fill-in-the-middle multiple-choice questions (MCQ) in the energy domain, featuring an enabled 'thinking mode' during its training. It excels at verifying energy-related information by outputting answers in a specific \boxed{N} format, making it suitable for automated assessment tasks.
Loading preview...
Model Overview
EnergyAI/qwen3-4b-agrpo-think-lr5e-7 is a 4 billion parameter model built upon the Qwen3-4B architecture. It has been fine-tuned using the Async GRPO (Asynchronous Generalized Reinforcement Learning with Policy Optimization) algorithm, specifically leveraging TRL's AsyncGRPOTrainer. A key feature of this model's training is the enabled 'thinking mode' (enable_thinking=True), which likely contributes to its specialized performance.
Key Capabilities
- Energy Domain Verification: Designed for fill-in-the-middle multiple-choice questions (MCQ) within the energy sector.
- Structured Output: Outputs answers in a precise
\boxed{N}format, whereNcorresponds to the option number, facilitating automated parsing and verification. - Reinforcement Learning Optimization: Trained with a reward function that grants +1.0 for correct answers, -0.5 for wrong answers, and -1.0 for no answer, indicating a strong focus on accuracy and response generation.
Training Details
The model was trained with a learning rate of 5e-7, a cosine scheduler, and a substantial effective batch size of 128 prompts per step. It underwent 2000 maximum steps with 9 generations per prompt and a maximum completion length of 4096 tokens. The training utilized FSDP2 parallelism across 4 GPUs, with vLLM TP=4 for inference, demonstrating a robust and scalable training setup.
Good For
- Automated assessment of energy-related multiple-choice questions.
- Applications requiring precise, structured answers for verification tasks.
- Research into the effectiveness of Async GRPO and 'thinking mode' in specialized domain LLMs.