beomi/SOLAR-KOEN-10.8B

TEXT GENERATIONConcurrency Cost:1Model Size:15BQuant:FP8Ctx Length:8kTool Calling:SupportedPublished:Feb 19, 2024License:cc-by-nc-sa-4.0Architecture:Transformer0.0K Open Weights Cold

SOLAR-KOEN-10.8B is a 10.8 billion parameter auto-regressive language model developed by Junbum Lee (Beomi) and Taekyoon Choi, based on an optimized Transformer architecture derived from Llama-2. This model features an expanded vocabulary and enhanced pretraining on a curated mix of Korean and English corpora, making it highly efficient for processing Korean text. It excels in multilingual contexts, particularly for Korean language understanding and generation tasks, demonstrating improved tokenization efficiency for Korean compared to its base model.

Loading preview...

SOLAR-KOEN-10.8B: Enhanced Korean-English Language Model

SOLAR-KOEN-10.8B is an advanced 10.8 billion parameter language model developed by Junbum Lee (Beomi) and Taekyoon Choi. It builds upon the upstage/SOLAR-10.7B-v1.0 architecture, which is an optimized Transformer derived from Llama-2. The key differentiator of SOLAR-KOEN-10.8B is its significant enhancement for Korean language processing.

Key Enhancements & Capabilities

  • Expanded Vocabulary: The model's vocabulary has been expanded from 32,000 to 46,336 tokens, specifically incorporating Korean vocabulary and merges using Sentencepiece BPE.
  • Bilingual Pretraining: It underwent continual pretraining on a curated mix of Korean and English corpora, improving its understanding and generation capabilities in both languages.
  • Efficient Korean Tokenization: A notable improvement is its tokenization efficiency for Korean text. For example, a Korean sentence like "안녕하세요, 오늘은 날씨가 좋네요." is tokenized into just 10 tokens by SOLAR-KOEN-10.8B, compared to 26 tokens by the original SOLAR-10.7B.
  • Text-only Input/Output: The model is designed to accept text input and produce text output exclusively.
  • Korean Benchmark Performance: Benchmarks using EleutherAI's lm-evaluation-harness (polyglot branch) show strong performance on various Korean tasks, including korquad (81.0530 exact match, 87.6418 f1) and kobest_boolq (0.8711 acc).

Ideal Use Cases

  • Korean Language Applications: Excellent for tasks requiring robust understanding and generation of Korean text.
  • Bilingual (Korean-English) Processing: Suitable for applications that involve both Korean and English content, benefiting from its dual-language pretraining.
  • Research and Development: A valuable base model for further fine-tuning on specific Korean or bilingual NLP tasks.