flair/bueble-lm-2b

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2.6BQuant:BF16Ctx Length:8kPublished:Sep 15, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

BübleLM is a 2.6 billion parameter German language model developed by flair, based on the Gemma-2-2B architecture. It utilizes a custom German SentencePiece tokenizer and was trained on 3.5 billion tokens of German web content, legislative documents, and news. This model significantly improves performance on German benchmarks like HellaSwag-DE and ARC-DE compared to its base model, making it optimized for German language understanding and generation tasks.

Loading preview...

Overview

BübleLM is a 2.6 billion parameter German language model built upon the Gemma-2-2B architecture. Developed by flair, its key innovation lies in its adaptation using trans-tokenization with a custom German SentencePiece tokenizer, which significantly enhances its performance in German language tasks. The model was trained on 3.5 billion tokens from the Occiglot-FineWeb project, encompassing diverse German data sources like web content, legislative documents, news, and Wikipedia.

Key Capabilities & Performance

  • German Language Optimization: Achieves substantial improvements over the base Gemma-2-2B model on German benchmarks, including a +71% increase on HellaSwag-DE and +41% on ARC-DE.
  • Custom Tokenization: Employs a 20k vocabulary German SentencePiece tokenizer, optimized for German morphological structures, leading to better token efficiency.
  • Context Length: Supports an 8192-token context window.
  • Outperforms Alternatives: Consistently surpasses both the base Gemma-2-2B and other German models like LLäMmlein-1B in most evaluated tasks.

Usage & Limitations

BübleLM is a base language model, not instruction-tuned, meaning it is best suited for text completion rather than chat or instruction following without further fine-tuning. Its limitations include a relatively smaller vocabulary size (20k) compared to multilingual models and potential performance variations on highly specialized domains not well-represented in its training data.