OpenCoder-8B-Base: An Open Foundation for Code LLMs
OpenCoder-8B-Base is an 8 billion parameter model from the OpenCoder family, developed by infly, designed as a fully open and reproducible code Large Language Model. It is pretrained on an extensive 2.5 trillion tokens, comprising 90% raw code and 10% code-related web data, and further fine-tuned on over 4.5 million high-quality SFT examples. This model supports both English and Chinese languages, aiming to provide a robust foundation for advancing code AI.
Key Capabilities & Features
- Comprehensive Open Source: Provides not only model weights and inference code but also the complete data-cleaning code, high-quality synthetic data, and over 4.5 million supervised fine-tuning (SFT) entries, making it one of the most transparently released models.
- Rigorous Experimental Analysis: Undergoes extensive ablation studies on various data-cleaning strategies and training processes, including file-level and repository-level deduplication.
- High-Quality Synthetic Data: Offers a fully developed synthetic data generation process and a substantial dataset of SFT entries.
- Exceptional Performance: Achieves strong performance across multiple code-specific benchmarks, positioning it among leading open-source code models. For detailed evaluation results, refer to the OpenCoder paper.
Benchmarks (Base Model)
- HumanEval(+): 66.5 (63.4)
- MBPP(+): 79.9 (70.4)
- BigCodeBench: 40.5
- BigCodeBench-Hard: 9.5
Good For
- Code Generation: Excels at generating code snippets and functions.
- Code Understanding: Suitable for tasks requiring comprehension of programming logic.
- Research & Development: Ideal for researchers and developers looking for a transparent, reproducible, and high-performing base model for code AI experimentation and innovation.