AlchemistCoder-CL-7B: Enhanced Code Generation
AlchemistCoder-CL-7B is a 7 billion parameter code language model from internlm, specifically designed to improve code generation and generalization. It addresses the limitations of previous Code LLMs by fine-tuning on diverse, multi-source data, rather than single-source datasets.
Key Capabilities & Innovations
- AlchemistPrompts: Introduces data-specific prompts to resolve conflicts within multi-source code data and align instruction-response pairs more effectively.
- Code Comprehension Tasks: Integrates tasks like instruction evolution, data filtering, and code review, derived from the data construction process, into its fine-tuning.
- Harmonized Multi-source Data: Instruction-tuned on 200 million tokens across 6 types of high-quality data, ensuring broad and diverse learning.
- Superior Performance: Demonstrates leading performance among open-source models of its size (6.7B/7B) and competes with or surpasses larger models (15B/33B/70B) on 6 code benchmarks.
- Advanced Generic Capabilities: Shows significant improvements in general reasoning tasks, including MMLU, BBH, and GSM8K, indicating robust instruction-following beyond just code.
Ideal Use Cases
- Code Generation: Generating high-quality code snippets and functions across various programming languages.
- Code Comprehension: Assisting with understanding and reviewing existing codebases.
- Instruction Following: Executing complex coding instructions and adapting to diverse programming challenges.
- Benchmarking: Suitable for applications requiring strong performance on standard code and general reasoning benchmarks.