Overview
DiscoResearch/Llama3-German-8B-32k is a specialized large language model built upon Meta's Llama3-8B architecture, developed through a collaboration between DiscoResearch and Occiglot. This version is specifically enhanced for the German language, addressing the suboptimal performance of the base Llama3 model in German due to its limited multilingual training data. It achieves this through continuous pre-training on 65 billion high-quality German tokens, leveraging the occiglot-fineweb-0.5 dataset.
Key Capabilities
- German Language Specialization: Significantly improved linguistic understanding and general reasoning in German, as evidenced by strong gains on benchmarks like Hellaswag_de.
- Extended Context Window: This specific variant supports an extended context length of 8192 tokens, making it suitable for processing longer German texts.
- Minimal English Performance Degradation: Despite extensive German pre-training, the model maintains strong performance on English benchmarks.
- Efficient Document Packing: Utilizes an intelligent document packing strategy based on the "Fewer Truncations Improve Language Modeling" paper, enhancing training efficiency and potentially improving benchmark scores.
Benchmarks and Performance
The model demonstrates notable improvements in German-specific benchmarks. For instance, on Hellaswag_de, it scores 0.64310, outperforming the base Meta-Llama-3-8B-Instruct (0.60008). While English benchmarks show minimal degradation, the focus is clearly on robust German language processing. The model was trained on 128 GPUs for approximately 60 hours, using a sequence length of 8192 tokens and a cosine learning rate schedule.
Good for
- Applications requiring high-quality German language generation and understanding.
- Tasks involving long German texts, benefiting from the 8192-token context window.
- Developers looking for a strong German-centric base model for further instruction-tuning or domain adaptation.