What is LT_AI_DLKVM?

LT_AI_DLKVM is a 1.04 billion parameter causal language model, developed by the State Digital Solutions Agency (SDSA) as part of the BLKT-VMS pipeline. It is built on Llama3 design principles and is specifically designed for the Lithuanian language. The model was pretrained from scratch in two stages, using the Lithuanian Text Corpus and a custom 32,000-token tokenizer.

Key Capabilities & Features

Lithuanian Language Focus: Exclusively trained for Lithuanian text generation and NLP tasks.
Long Context Window: Supports an impressive 32,768-token context length, enabling efficient processing of extensive Lithuanian texts.
Base Generative Model: Intended for research, pretraining, and further fine-tuning for specific applications.
Custom Tokenizer: Utilizes a specially trained 32,000-token tokenizer for optimal performance with Lithuanian.
Two-Stage Training: Initial training from scratch (8,196 tokens context) followed by long-context training (32,768 tokens context) on 8 NVIDIA H100-SXM5-80GB GPUs.

Should I use this for my use case?

LT_AI_DLKVM is ideal for:

Research and Development: Experimenting with Lithuanian NLP, language generation, and domain adaptation.
Base Model for Fine-tuning: Projects requiring robust Lithuanian text generation that can be specialized for chat, summarization, classification, or domain-specific content.
Long-Context Applications: Scenarios where processing and generating long Lithuanian documents or conversations are critical.

Limitations: This model is a base causal language model and is not instruction-tuned or task-specialized by default. It may generate factually inaccurate or biased content and is not suitable for high-stakes applications without additional fine-tuning, safeguards, and validation. Performance outside Lithuanian-centric domains may be less reliable.

Overview

What is LT_AI_DLKVM?

Key Capabilities & Features

Should I use this for my use case?

Full Model Card (README)