EnergyAI/qwen3-4b-agrpo-think-lr3e-6 is a 4 billion parameter Qwen3-based causal language model fine-tuned by EnergyAI. It is specifically optimized for fill-in-the-middle multiple-choice questions (MCQ) within the energy domain, utilizing Async GRPO with an enabled 'thinking mode'. This model is designed for energy domain verification tasks, outputting answers in a specific boxed format.
Loading preview...
Overview
EnergyAI/qwen3-4b-agrpo-think-lr3e-6 is a specialized 4 billion parameter model built upon the Qwen3-4B architecture. It has been fine-tuned using the Async GRPO (Asynchronous Generalized Reinforcement Learning from Human Feedback) algorithm, notably with its 'thinking mode' enabled. This configuration is tailored for enhanced reasoning during task execution.
Key Capabilities
- Energy Domain Verification: Optimized for fill-in-the-middle multiple-choice questions (MCQ) relevant to the energy sector.
- Structured Output: Designed to output answers in a precise
\boxed{N}format, where N represents the option number. - Reinforcement Learning: Leverages Async GRPO for training, incorporating a reward function that penalizes incorrect or missing answers and rewards correct ones.
- Thinking Mode: The
enable_thinking=Truesetting during training suggests an internal reasoning process to improve answer accuracy.
Training Details
The model underwent 2000 training steps with a learning rate of 3e-6, achieving a final reward of approximately 0.45. It was trained with an effective batch size of 128 prompts per step and utilized FSDP2 for parallelism. The average completion length during training was around 2370 tokens, indicating its capacity for generating detailed responses.