Overview
BübleLM is a 2.6 billion parameter German language model built upon the Gemma-2-2B architecture. Developed by flair, its key innovation lies in its adaptation using trans-tokenization with a custom German SentencePiece tokenizer, which significantly enhances its performance in German language tasks. The model was trained on 3.5 billion tokens from the Occiglot-FineWeb project, encompassing diverse German data sources like web content, legislative documents, news, and Wikipedia.
Key Capabilities & Performance
- German Language Optimization: Achieves substantial improvements over the base Gemma-2-2B model on German benchmarks, including a +71% increase on HellaSwag-DE and +41% on ARC-DE.
- Custom Tokenization: Employs a 20k vocabulary German SentencePiece tokenizer, optimized for German morphological structures, leading to better token efficiency.
- Context Length: Supports an 8192-token context window.
- Outperforms Alternatives: Consistently surpasses both the base Gemma-2-2B and other German models like LLäMmlein-1B in most evaluated tasks.
Usage & Limitations
BübleLM is a base language model, not instruction-tuned, meaning it is best suited for text completion rather than chat or instruction following without further fine-tuning. Its limitations include a relatively smaller vocabulary size (20k) compared to multilingual models and potential performance variations on highly specialized domains not well-represented in its training data.