Occiglot-7B-DE-EN: A Bilingual Foundation Model
Occiglot-7B-DE-EN is a 7 billion parameter generative language model developed by the Occiglot Research Collective. It is built upon the Mistral-7B-v0.1 architecture and has undergone extensive continued pre-training on an additional 114 billion tokens of multilingual and code data, with a block size of 8,192 tokens per sample. This model is a general-purpose base model focused on foundational language understanding in German and English.
Key Capabilities
- Bilingual Proficiency: Optimized for strong performance in both German and English, with training data distribution of approximately 52% German, 34% English, and 13% Code.
- Robust Foundation: Serves as a powerful base model for various downstream tasks, though it is not instruction-tuned for chat or specific applications. An instruction-tuned variant, occiglot-7b-de-en-instruct, is available for conversational use cases.
- Code Understanding: Includes a significant portion of code data in its training, enhancing its capabilities for code-related tasks.
Good For
- Research and Development: Ideal for researchers and developers looking to build custom applications or fine-tune models specifically for German and English language tasks.
- Multilingual Applications: Suitable as a backbone for applications requiring strong performance in both German and English, such as translation, content generation, or analysis.
- Continued Pre-training: Provides a solid starting point for further domain-specific pre-training or fine-tuning on specialized datasets.