Gukbap-Mistral-7B: Korean Language Model Trained on Open-Source Data
Markr-AI/Gukbap-Mistral-7B is a 7 billion parameter Korean language model developed by HumanF-MarkrAI, fine-tuned from mistralai/Mistral-7B-Instruct-v0.2. A key differentiator of this model is its training methodology: it was developed using a proprietary dataset generated solely through open-source models, specifically microsoft/WizardLM-2-8x22B via DeepInfra, adhering to the data processing and SFT methods proposed by LIMA and WizardLM. This approach avoids potential violations of terms of service associated with using data from private models like GPT-4.
Key Capabilities & Performance
The model demonstrates strong performance in Korean language understanding and generation, achieving a SOTA score of 6.06 in the internal LogicKor evaluation for Mistral-based Korean models under 7B parameters. This score surpasses other 7B models like Nous-Hermes-2-Mistral-7B-DPO and Mistral-7B-Instruct-v0.3. Notably, it scored 9.36 in 'Writing' and 7.43 in 'Coding' categories on LogicKor. The model supports a context length of 8192 tokens.
Good For
- Developing Korean language applications that require a performant, open-source-trained LLM.
- Use cases where adherence to open-source data lineage is critical.
- Tasks involving Korean text generation, understanding, and coding, particularly within the 7B parameter class.