infly/OpenCoder-8B-Base

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Nov 7, 2024License:infArchitecture:Transformer0.0K Warm

The infly/OpenCoder-8B-Base is an 8 billion parameter base model from the OpenCoder family, developed by infly. Pretrained on 2.5 trillion tokens, primarily code and code-related web data, it is designed as an open and reproducible foundation for code-centric AI applications. This model excels in code generation and understanding, supporting both English and Chinese languages, and offers comprehensive transparency with fully open-sourced training data and protocols.

Loading preview...

OpenCoder-8B-Base: An Open Foundation for Code LLMs

OpenCoder-8B-Base is an 8 billion parameter model from the OpenCoder family, developed by infly, designed as a fully open and reproducible code Large Language Model. It is pretrained on an extensive 2.5 trillion tokens, comprising 90% raw code and 10% code-related web data, and further fine-tuned on over 4.5 million high-quality SFT examples. This model supports both English and Chinese languages, aiming to provide a robust foundation for advancing code AI.

Key Capabilities & Features

  • Comprehensive Open Source: Provides not only model weights and inference code but also the complete data-cleaning code, high-quality synthetic data, and over 4.5 million supervised fine-tuning (SFT) entries, making it one of the most transparently released models.
  • Rigorous Experimental Analysis: Undergoes extensive ablation studies on various data-cleaning strategies and training processes, including file-level and repository-level deduplication.
  • High-Quality Synthetic Data: Offers a fully developed synthetic data generation process and a substantial dataset of SFT entries.
  • Exceptional Performance: Achieves strong performance across multiple code-specific benchmarks, positioning it among leading open-source code models. For detailed evaluation results, refer to the OpenCoder paper.

Benchmarks (Base Model)

  • HumanEval(+): 66.5 (63.4)
  • MBPP(+): 79.9 (70.4)
  • BigCodeBench: 40.5
  • BigCodeBench-Hard: 9.5

Good For

  • Code Generation: Excels at generating code snippets and functions.
  • Code Understanding: Suitable for tasks requiring comprehension of programming logic.
  • Research & Development: Ideal for researchers and developers looking for a transparent, reproducible, and high-performing base model for code AI experimentation and innovation.