lapa-llm/lapa-v0.1.2-instruct

Warm
Public
Vision
12B
FP8
32768
License: gemma
Hugging Face
Overview

Lapa LLM v0.1.2: Optimized for Ukrainian Language Processing

Lapa LLM v0.1.2 is a 12-billion-parameter open large language model built upon the Gemma-3 architecture, developed by a collaborative team of Ukrainian researchers. Its primary focus is to provide the most efficient and effective solution for Ukrainian language processing.

Key Achievements & Differentiators

  • Best Ukrainian Tokenizer: Utilizes a state-of-the-art tokenizer adaptation method, replacing 80,000 tokens with Ukrainian ones. This results in 1.5 times fewer tokens required for Ukrainian text compared to the original Gemma 3, leading to three times fewer computations for better results.
  • Efficient Instruction-Tuned Model: Demonstrates highly competitive performance in various benchmarks, closely trailing leading models like MamayLM in some categories.
  • Leader in Pretraining: Achieves top performance in pretraining benchmarks for Ukrainian, making it a valuable base for further research and adaptation.
  • Maximum Openness: The project emphasizes transparency, offering the model for commercial use, publishing 25 training datasets, disclosing data filtering methods (including for disinformation detection), and providing open-source code and training documentation.

Key Capabilities & Use Cases

  • Superior Translation: Achieves 33 BLEU on FLORES for English-to-Ukrainian and vice versa, enabling natural and cost-effective translation of NLP datasets.
  • Multimodal Processing: One of the best models for image processing in Ukrainian within its size class, as measured on the MMZNO benchmark.
  • Summarization & Q&A: Excels in these areas, indicating strong performance for Retrieval Augmented Generation (RAG) systems.
  • Culturally Aware Text Generation: Processes Ukrainian texts while considering cultural and historical context, avoiding code-switching.
  • Secure Document Processing: Suitable for handling sensitive documents without external server transfers.
  • Fine-tuning: Supports specialized solutions through fine-tuning for specific tasks.

Technical Details

  • Base Model: Gemma-3-12B
  • Input Context: 128K tokens, including support for images (896x896 resolution, 256 tokens each).
  • Output Context: 8192 tokens.

The model was trained on data rigorously evaluated for quality, including propaganda and disinformation presence, readability, and grammar, with high-quality materials from Harvard Library's Open Data division used in final training stages.