infly/OpenCoder-8B-Instruct

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Nov 7, 2024License:infArchitecture:Transformer0.2K Warm

OpenCoder-8B-Instruct is an 8 billion parameter instruction-tuned causal language model developed by infly, supporting both English and Chinese. It is pretrained on 2.5 trillion tokens, comprising 90% raw code and 10% code-related web data, and fine-tuned on over 4.5 million high-quality SFT examples. This model is optimized for code generation and understanding, achieving strong performance across various code benchmarks.

Loading preview...

OpenCoder-8B-Instruct: An Open Code LLM

OpenCoder-8B-Instruct is an 8 billion parameter instruction-tuned model from the OpenCoder family, designed for code-related tasks in both English and Chinese. Developed by infly, this model was pretrained on a massive 2.5 trillion token dataset, with a significant focus on raw code (90%) and code-related web data (10%). It underwent supervised fine-tuning using over 4.5 million high-quality examples, enabling it to achieve competitive performance among top-tier code LLMs.

Key Capabilities & Features

  • Comprehensive Open Source: Provides full transparency with released model weights, inference code, data-cleaning code, synthetic data, checkpoints, and over 4.5 million SFT entries.
  • Rigorous Experimental Analysis: Backed by extensive ablation studies on data-cleaning strategies and training processes, including file-level and repository-level deduplication.
  • High-Quality Synthetic Data: Offers a fully developed synthetic data generation process and a robust dataset for training and evaluation.
  • Exceptional Performance: Demonstrates strong results across multiple code language model benchmarks, including HumanEval, MBPP, BigCodeBench, LiveCodeBench, and MultiPL-E.
  • Multilingual Support: Supports both English and Chinese languages.

Use Cases

OpenCoder-8B-Instruct is well-suited for a variety of code-centric applications, including:

  • Code Generation: Generating code snippets or full functions based on natural language prompts.
  • Code Understanding: Assisting with code analysis, explanation, and debugging.
  • Educational Tools: Serving as a foundation for tools that help developers learn and practice coding.
  • Research and Development: Providing a fully open and reproducible platform for advancing code AI research.