bogdan1/llama2-bg: A Bulgarian-Optimized Llama-2 Model
This model is a 7 billion parameter Llama-2 base model that has been fine-tuned specifically for the Bulgarian language. The fine-tuning process utilized PEFT and QLORA over 12,000 steps, incorporating the Chitanka dataset and a collection of scraped Bulgarian news comments primarily from 2022/2023.
Key Capabilities & Differentiators
- Bulgarian Language Generation: Significantly improves upon the vanilla Llama-2-7b's ability to generate coherent and contextually relevant text in Bulgarian, addressing issues like hallucination and unintended English generation.
- Domain-Specific Training: The inclusion of news comments in its training data gives it a distinct voice, though this can lead to higher toxicity in certain outputs.
- Context Retention: Demonstrates better continuity in generated narratives compared to its base model.
Limitations & Considerations
- Toxicity: Due to the nature of the news comments dataset, the model can produce toxic or politically charged responses.
- Grammatical Imperfections: Some generated text may exhibit grammatical errors or include foreign words (e.g., Russian).
- Factuality: While improved over the base model, factuality is not guaranteed, and the model can still hallucinate.
When to Use This Model
This model is suitable for developers and researchers focused on Bulgarian natural language generation tasks where a specialized, locally fine-tuned model is beneficial. It offers a more native Bulgarian output than the general Llama-2 base, making it a candidate for applications requiring Bulgarian text creation, despite its noted limitations.