Name: beomi/OPEN-SOLAR-KO-10.7B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: beomi

Open-Solar-Ko Overview

Open-Solar-Ko is an advanced 10.7 billion parameter auto-regressive language model developed by Junbum Lee (Beomi). It builds upon the upstage/SOLAR-10.7B-v1.0 architecture, specifically enhancing its capabilities for the Korean language through an expanded vocabulary and extensive pretraining on Korean corpora.

Key Capabilities & Features

Korean Language Optimization: Significantly improved Korean language processing due to pretraining on publicly available Korean datasets like AI Hub, Modu Corpus, and Korean Wikipedia.
Expanded Vocabulary: Features an expanded vocabulary size of 46,592 (from 32,000 in the original SOLAR) with added Korean vocabulary and merges, leading to more efficient Korean tokenization.
Efficient Korean Tokenization: Demonstrates a substantial reduction in token count for Korean phrases (e.g., "안녕하세요, 오늘은 날씨가 좋네요." tokenizes into 8 tokens compared to 26 in the original SOLAR-10.7B).
Apache 2.0 License: The model is open for unrestricted use, adhering to the Apache 2.0 open-source license, as it was trained exclusively with publicly accessible corpora.
Transformer Architecture: Leverages an optimized transformer architecture derived from Llama-2, supporting text-only input and output.

Benchmarks

Performance on the LM Eval Harness - Korean (polyglot branch) shows competitive results across various Korean tasks, including kobest_boolq, kobest_copa, nsmc, and pawsx_ko.

Training Details

The model was trained on approximately 15 billion tokens from a curated mix of publicly accessible Korean corpora, with a final JSONL dataset size of about 61GB. Training was supported by the TPU Research Cloud program.

Overview

Open-Solar-Ko Overview

Key Capabilities & Features

Benchmarks

Training Details

Full Model Card (README)