Occiglot-7B-ES-EN-Instruct Overview
Occiglot-7B-ES-EN-Instruct is a 7 billion parameter instruction-tuned language model developed by the Occiglot Research Collective. It is specifically designed to support both Spanish and English, alongside code generation, making it a polyglot model for Western languages. This model is an instruct version, fine-tuned from the occiglot-7b-es-en base model with an additional 160 million tokens of multilingual and code instructions.
Key Capabilities & Features
- Bilingual Proficiency: Strong performance in both Spanish and English, as evidenced by evaluation results across various benchmarks.
- Instruction Following: Trained with the chatml instruction template, enabling effective interaction through conversational prompts.
- Code Support: Includes code in its training data, suggesting capabilities for code-related tasks.
- Research-Oriented: Part of an ongoing open research project, with an invitation for collaborations on multilingual language models and evaluations.
Training & Data
The model underwent full instruction fine-tuning using axolotl on 8xH100 GPUs. Its training data was evenly split between Spanish and English, incorporating datasets like Open-Hermes-2.5 (English/Code) and Mentor-ES, Squad-es, OASST-2 (Spanish subset), and Aya-Dataset (Spanish subset) for Spanish.
Important Considerations
- Safety Alignment: The model was not safety aligned and may produce problematic outputs.
- Evaluation Nuances: Preliminary evaluation results, especially for non-English languages, are based on partially machine-translated datasets and English prompts, and should be interpreted with caution.