Occiglot-7B-FR-EN: A Multilingual Base Model for French, English, and Code
Occiglot-7B-FR-EN is a 7 billion parameter generative language model developed by the Occiglot Research Collective. It is built upon the Mistral-7B-v0.1 architecture and has undergone extensive continued pre-training on 113 billion tokens of additional multilingual and code data. This model is a general-purpose base model, not instruction-fine-tuned or optimized for chat, but an instruction-tuned variant is available as occiglot-7b-fr-en-instruct.
Key Capabilities
- Bilingual Proficiency: Specialized in generating text in both French and English.
- Code Generation: Enhanced with a significant portion of code data (approximately 13% of its training corpus).
- Continued Pre-training: Leverages the strong foundation of Mistral-7B-v0.1, improving its performance on target languages.
- Research Focus: Represents the first release in an ongoing open research project for multilingual language models, with an invitation for collaboration.
Good for
- Base Model Development: Ideal for researchers and developers looking for a strong multilingual foundation model to further fine-tune for specific tasks.
- Multilingual Text Generation: Suitable for applications requiring raw text generation in French, English, or code.
- Linguistic Research: Valuable for studies on multilingual model behavior and continued pre-training effects on specific language pairs.