Occiglot-7B-EU5-Instruct: Multilingual Instruction-Tuned Model
Occiglot-7B-EU5-Instruct is a 7 billion parameter instruction-tuned generative language model from the Occiglot Research Collective. It is an instruct version of the occiglot-7b-eu5 base model, specifically designed to support the top-5 European Union languages: English, Spanish, French, German, and Italian, in addition to code generation. The model was fine-tuned on 400 million tokens of additional multilingual and code instructions.
Key Capabilities
- Multilingual Support: Optimized for English, Spanish, French, German, and Italian.
- Code Generation: Includes training on code instructions.
- Instruction Following: Instruction-tuned using the chatml template for conversational interactions.
- Research Focus: Part of an ongoing open research project for multilingual language models.
Performance Insights
Preliminary evaluations show improved performance over its base model (Occiglot-7b-eu5) across various benchmarks, including arc_challenge, belebele, hellaswag, mmlu, and truthfulqa in both aggregated and individual language tests. It generally performs competitively with or surpasses other specialized multilingual 7B models in its target languages, though English performance is typically higher.
Important Considerations
- Safety: The model was not safety aligned and may produce problematic outputs.
- Evaluation Caveats: Non-English evaluation results are based on partially machine-translated datasets and English prompts, which may introduce bias.