D-CORE-8B is an 8 billion parameter large reasoning model developed by Bowen Xu et al. It utilizes a two-stage D-CORE training framework, incorporating self-distillation and diversity-aware reinforcement learning, to enhance task decomposition and reflective reasoning for complex tool use. This model is specifically optimized to improve sub-task decomposition capabilities in scenarios requiring intricate tool interaction, addressing "Lazy Reasoning" in large language models. It demonstrates robust performance improvements across various benchmarks, including achieving state-of-the-art results on BFCLv3 with a smaller parameter count compared to larger models.
Loading preview...
D-CORE-8B: Enhanced Reasoning for Complex Tool Use
D-CORE-8B is an 8 billion parameter model developed by Bowen Xu et al. that focuses on improving task decomposition and reflective reasoning in Large Reasoning Models (LRMs) for complex tool use. The model addresses the "Lazy Reasoning" phenomenon, where LRMs struggle with breaking down complex tasks into sub-tasks.
Key Capabilities
- Two-Stage Training Framework: Employs D-CORE, a novel framework consisting of:
- Self-distillation: Incentivizes the model's ability to decompose tasks.
- Diversity-aware Reinforcement Learning (RL): Restores and enhances reflective reasoning.
- Robust Tool-Use Improvement: Achieves significant enhancements in tool-use scenarios across various benchmarks.
- Efficient Performance: The D-CORE framework enables smaller models, such as D-CORE-14B (mentioned in the paper), to outperform larger 70B models on benchmarks like BFCLv3, indicating strong performance relative to its size.
Performance Highlights
- BFCL Benchmark: D-CORE-8B achieved an overall score of 53.15 on the BFCL benchmark, demonstrating its capabilities in agentic, multi-turn, and single-turn tasks, as well as hallucination measurement and format sensitivity.
- Tau-Bench & Tau2-Bench: Scored 44.9 overall on Tau-Bench and 35.8 overall on Tau2-Bench, indicating proficiency in retail, airline, and telecom-related tasks.
- ACEBench: Achieved an overall score of 75.2 on ACEBench, showcasing strong performance across various atom, single-turn, multi-turn, and agentic scenarios.
Good For
- Applications requiring complex tool interaction and multi-step reasoning.
- Scenarios where task decomposition is critical for successful problem-solving.
- Developers seeking models with enhanced reflective reasoning capabilities for agentic workflows.