Occiglot-7B-EU5: A Multilingual Base Model for Western Europe
Occiglot-7B-EU5 is a 7 billion parameter generative language model developed by the Occiglot Research Collective. It is built upon the Mistral-7B-v0.1 architecture and has undergone extensive continued pre-training on an additional 293 billion tokens of multilingual and code data. This model is specifically optimized for the top-5 European Union languages: English, Spanish, French, German, and Italian, in addition to supporting code-related tasks.
Key Capabilities
- Multilingual Proficiency: Designed to handle English, Spanish, French, German, and Italian, making it suitable for applications requiring understanding and generation across these languages.
- Code Support: Includes training on code data, enhancing its utility for developers and technical applications.
- Base Model: Serves as a general-purpose base model, providing a strong foundation for further fine-tuning or specific applications. An instruction-tuned variant,
occiglot-7b-eu5-instruct, is also available for chat and other instruction-following tasks. - Continued Pre-training: Benefits from significant additional training data (293B tokens) beyond its Mistral-7B-v0.1 base, focusing on Western European languages and code.
When to Use This Model
- Multilingual Applications: Ideal for use cases requiring robust language understanding and generation in English, Spanish, French, German, and Italian.
- Foundation for Fine-tuning: As a base model, it's an excellent starting point for developers looking to fine-tune a model for specific tasks or domains within its supported languages.
- Research and Development: Suitable for researchers exploring multilingual LLMs, especially those focused on Western European languages.