Name: beomi/SOLAR-KOEN-10.8B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: beomi

SOLAR-KOEN-10.8B: Enhanced Korean-English Language Model

SOLAR-KOEN-10.8B is an advanced 10.8 billion parameter language model developed by Junbum Lee (Beomi) and Taekyoon Choi. It builds upon the upstage/SOLAR-10.7B-v1.0 architecture, which is an optimized Transformer derived from Llama-2. The key differentiator of SOLAR-KOEN-10.8B is its significant enhancement for Korean language processing.

Key Enhancements & Capabilities

Expanded Vocabulary: The model's vocabulary has been expanded from 32,000 to 46,336 tokens, specifically incorporating Korean vocabulary and merges using Sentencepiece BPE.
Bilingual Pretraining: It underwent continual pretraining on a curated mix of Korean and English corpora, improving its understanding and generation capabilities in both languages.
Efficient Korean Tokenization: A notable improvement is its tokenization efficiency for Korean text. For example, a Korean sentence like "안녕하세요, 오늘은 날씨가 좋네요." is tokenized into just 10 tokens by SOLAR-KOEN-10.8B, compared to 26 tokens by the original SOLAR-10.7B.
Text-only Input/Output: The model is designed to accept text input and produce text output exclusively.
Korean Benchmark Performance: Benchmarks using EleutherAI's lm-evaluation-harness (polyglot branch) show strong performance on various Korean tasks, including korquad (81.0530 exact match, 87.6418 f1) and kobest_boolq (0.8711 acc).

Ideal Use Cases

Korean Language Applications: Excellent for tasks requiring robust understanding and generation of Korean text.
Bilingual (Korean-English) Processing: Suitable for applications that involve both Korean and English content, benefiting from its dual-language pretraining.
Research and Development: A valuable base model for further fine-tuning on specific Korean or bilingual NLP tasks.

Overview

SOLAR-KOEN-10.8B: Enhanced Korean-English Language Model

Key Enhancements & Capabilities

Ideal Use Cases

Full Model Card (README)