Model Overview
This model, developed by tokyotech-llm, is a continual pre-training of the Llama-3.1-8B architecture. It was specifically designed to establish a baseline performance for unfiltered Python code from The-Stack-v2 within the SwallowCode ablation experiments. The model demonstrates baseline capabilities in code generation tasks, such as HumanEval and HumanEval+, and retains its general proficiency across various knowledge, reasoning, and common sense benchmarks.
Key Training Details
- Base Model: Llama-3.1-8B
- Total Pretraining Tokens: 50 billion
- Data Mix: 16% Python code (from The-Stack-v2-train-smol-ids, SwallowCode Experiment 1) and 84% multilingual text (including Japanese and English corpora).
- Sequence Length: 8,192 tokens
- Training Framework: Megatron-LM (version core_r0.9.0)
- Hardware: Trained on 64 NVIDIA H100 GPUs on the TSUBAME supercomputer.
Evaluation and Use Cases
The model was evaluated using lm-evaluation-harness and BigCodeBench across a range of benchmarks, including code generation (HumanEval, HumanEval+) and general tasks (e.g., MMLU, GSM8K, HellaSwag). It serves as a crucial reference point for comparing the impact of subsequent ablation experiments within the SwallowCode research pipeline, particularly for understanding the performance of models trained on specific code data subsets. Developers can use this model for code generation tasks, especially when exploring the effects of different code training data compositions.