beomi/OPEN-SOLAR-KO-10.7B

TEXT GENERATIONConcurrency Cost:1Model Size:15BQuant:FP8Ctx Length:8kTool Calling:SupportedPublished:Jan 2, 2024License:apache-2.0Architecture:Transformer0.1K Open Weights Cold

beomi/OPEN-SOLAR-KO-10.7B is a 10.7 billion parameter auto-regressive language model developed by Junbum Lee (Beomi), based on the SOLAR-10.7B-v1.0 architecture. It features an expanded vocabulary and enhanced pretraining on publicly accessible Korean corpora, including AI Hub, Modu Corpus, and Korean Wikipedia. This model is optimized for Korean language processing, demonstrating significantly improved tokenization efficiency for Korean text compared to its base model.

Loading preview...

Open-Solar-Ko Overview

Open-Solar-Ko is an advanced 10.7 billion parameter auto-regressive language model developed by Junbum Lee (Beomi). It builds upon the upstage/SOLAR-10.7B-v1.0 architecture, specifically enhancing its capabilities for the Korean language through an expanded vocabulary and extensive pretraining on Korean corpora.

Key Capabilities & Features

  • Korean Language Optimization: Significantly improved Korean language processing due to pretraining on publicly available Korean datasets like AI Hub, Modu Corpus, and Korean Wikipedia.
  • Expanded Vocabulary: Features an expanded vocabulary size of 46,592 (from 32,000 in the original SOLAR) with added Korean vocabulary and merges, leading to more efficient Korean tokenization.
  • Efficient Korean Tokenization: Demonstrates a substantial reduction in token count for Korean phrases (e.g., "안녕하세요, 오늘은 날씨가 좋네요." tokenizes into 8 tokens compared to 26 in the original SOLAR-10.7B).
  • Apache 2.0 License: The model is open for unrestricted use, adhering to the Apache 2.0 open-source license, as it was trained exclusively with publicly accessible corpora.
  • Transformer Architecture: Leverages an optimized transformer architecture derived from Llama-2, supporting text-only input and output.

Benchmarks

Performance on the LM Eval Harness - Korean (polyglot branch) shows competitive results across various Korean tasks, including kobest_boolq, kobest_copa, nsmc, and pawsx_ko.

Training Details

The model was trained on approximately 15 billion tokens from a curated mix of publicly accessible Korean corpora, with a final JSONL dataset size of about 61GB. Training was supported by the TPU Research Cloud program.