GAIR/daVinci-Dev-32B-MT
GAIR/daVinci-Dev-32B-MT is a 32 billion parameter large language model from the GAIR daVinci-Dev family, based on Qwen2.5. This checkpoint represents the model after agent-native mid-training, specifically designed for agentic software engineering tasks. It utilizes agent-native data, including contextually-native PR-derived trajectories and environmentally-native executable rollouts, to reduce distribution mismatch for code agents. The model demonstrates strong performance on software engineering benchmarks like SWE-Bench Verified, making it suitable for automated code generation and bug fixing within agentic frameworks.
Loading preview...
daVinci-Dev-32B-MT: Agent-Native Mid-Training for Software Engineering
GAIR/daVinci-Dev-32B-MT is a 32 billion parameter model from the daVinci-Dev family, developed by GAIR. It is a mid-training (MT) checkpoint, meaning it has undergone agent-native mid-training but has not yet received Supervised Fine-Tuning (SFT) on environmentally-native executable trajectories. The model is built upon the Qwen2.5 base model family, specifically adapted for agentic software engineering.
Key Capabilities & Training:
- Agent-Native Mid-Training: Utilizes a novel approach to reduce the distribution mismatch between static pretraining data and dynamic, feedback-rich environments encountered by code agents.
- Specialized Data: Trained on two complementary trajectory types:
- Contextually-native trajectories (68.6B tokens): Derived from GitHub pull requests, preserving full information flow from file discovery to sequential edits.
- Environmentally-native executable trajectories (3.1B raw tokens): Collected from real executable repositories with genuine tool and test outputs, capturing authentic feedback loops.
- Software Engineering Focus: Designed to excel in agentic software engineering tasks, including automated code generation, bug fixing, and reasoning within development environments.
Performance & Use Cases:
- SWE-Bench Verified: The
daVinci-Dev-32B(which includes SFT on top of this MT checkpoint) achieves 56.1% Pass@1 on SWE-Bench Verified, demonstrating strong capabilities in solving real-world software issues. - Generalization: Improvements are also observed on standard code benchmarks (e.g., HumanEval/EvalPlus) and scientific reasoning benchmarks (e.g., GPQA/SciBench).
- Intended Use: Primarily designed for integration into agentic scaffolds like SWE-Agent for automated software development workflows. It is compatible with standard inference frameworks like Hugging Face Transformers and vLLM.