CEIA-RL/energyv2-dpo-offline-GRPO
CEIA-RL/energyv2-dpo-offline-GRPO is a 4 billion parameter language model, based on the Qwen3-4B architecture, fine-tuned using Direct Preference Optimization (DPO) and Grouped Reward Policy Optimization (GRPO). This model is specifically optimized for tasks requiring high task coverage and relative quality, while maintaining low hallucination rates, as evidenced by its performance in energy-related regulatory contexts. It excels in specialized domains where factual accuracy and adherence to specific guidelines are critical.
Loading preview...
CEIA-RL/energyv2-dpo-offline-GRPO Overview
This model, developed by CEIA-RL, is a 4 billion parameter language model built upon the Qwen3-4B architecture. It has undergone specialized fine-tuning using a combination of Direct Preference Optimization (DPO) and Grouped Reward Policy Optimization (GRPO) techniques. The primary goal of this training was to enhance performance in specific, likely regulatory or energy-related, tasks.
Key Capabilities & Performance
The model demonstrates strong performance across several key metrics, particularly when compared to its base Qwen3-4B counterpart and other fine-tuned variants:
- High Task Coverage: Achieves a task coverage score of approximately 0.957, indicating its ability to address a broad range of relevant queries.
- Superior Relative Quality: Boasts a relative quality score of around 0.865, suggesting high-quality and relevant responses.
- Low Hallucination Rate: Exhibits a low hallucination rate of approximately 0.067, making it reliable for factual and sensitive applications.
- Optimized for Specific Domains: The training methodology and comparative benchmarks suggest an optimization for specialized tasks, likely within the energy or regulatory sectors, where accuracy and adherence to guidelines are paramount.
Good For
- Applications requiring high factual accuracy and low hallucination in specialized domains.
- Tasks where task coverage and response quality are critical.
- Use cases in regulatory compliance, energy sector analysis, or similar fields that benefit from fine-tuned, reliable language generation.