tokyotech-llm/Llama-3.1-8B-code-ablation-exp1-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0002500
The tokyotech-llm/Llama-3.1-8B-code-ablation-exp1-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0002500 model is an 8 billion parameter Llama-3.1 architecture continually pre-trained by tokyotech-llm with a 32K context length. It was trained on 50 billion tokens, including a 16% Python code subset from The-Stack-v2 and 84% multilingual text. This model serves as a baseline for evaluating unfiltered Python code performance in code generation tasks while maintaining general language capabilities.
Loading preview...
Model Overview
This model, developed by tokyotech-llm, is a continual pre-training of the Llama-3.1-8B architecture. It was specifically designed to establish a baseline performance for unfiltered Python code from The-Stack-v2 within the SwallowCode ablation experiments. The model demonstrates baseline capabilities in code generation tasks, such as HumanEval and HumanEval+, and retains its general proficiency across various knowledge, reasoning, and common sense benchmarks.
Key Training Details
- Base Model: Llama-3.1-8B
- Total Pretraining Tokens: 50 billion
- Data Mix: 16% Python code (from The-Stack-v2-train-smol-ids, SwallowCode Experiment 1) and 84% multilingual text (including Japanese and English corpora).
- Sequence Length: 8,192 tokens
- Training Framework: Megatron-LM (version core_r0.9.0)
- Hardware: Trained on 64 NVIDIA H100 GPUs on the TSUBAME supercomputer.
Evaluation and Use Cases
The model was evaluated using lm-evaluation-harness and BigCodeBench across a range of benchmarks, including code generation (HumanEval, HumanEval+) and general tasks (e.g., MMLU, GSM8K, HellaSwag). It serves as a crucial reference point for comparing the impact of subsequent ablation experiments within the SwallowCode research pipeline, particularly for understanding the performance of models trained on specific code data subsets. Developers can use this model for code generation tasks, especially when exploring the effects of different code training data compositions.