Name: occiglot/occiglot-7b-es-en API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: occiglot

Occiglot-7B-ES-EN: A Multilingual Base Model for Spanish, English, and Code

Occiglot-7B-ES-EN is a 7 billion parameter generative language model developed by the Occiglot Research Collective. It is built upon the Mistral-7B-v0.1 architecture and has undergone extensive continued pre-training on 112 billion tokens of additional data, focusing on Spanish, English, and code. The model utilizes a block size of 8,192 tokens per sample, enhancing its context understanding.

Key Capabilities

Bilingual Proficiency: Optimized for strong performance in both Spanish and English.
Code Understanding: Includes significant training on code data, making it suitable for code-related tasks.
Base Model: This is a general-purpose base model, not instruction-tuned or optimized for chat out-of-the-box. An instruction-tuned variant, occiglot-7b-es-en-instruct, is available for conversational applications.
Open Research Project: Part of an ongoing open research initiative by Occiglot, inviting collaborations for language model development and evaluation.

Training Details

The model was continually pre-trained on 128 x A100-80GB GPUs using the Determined framework, employing bf16 precision and an AdamW optimizer. The training data distribution is approximately 52% Spanish, 34% English, and 13% Code.

Good for

Developers and researchers working on applications requiring strong Spanish and English language understanding and generation.
Building custom instruction-tuned models for specific tasks in these languages.
Research into multilingual language models and continued pre-training techniques.

Overview

Occiglot-7B-ES-EN: A Multilingual Base Model for Spanish, English, and Code

Key Capabilities

Training Details

Good for

Full Model Card (README)