Name: overthelex/qwen2.5-1.5b-edrsr-legal-uk API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: overthelex

Model Overview

The overthelex/qwen2.5-1.5b-edrsr-legal-uk is a 1.5 billion parameter base language model, part of a scaling experiment conducted by overthelex for a PhD dissertation. It is built upon the Qwen2.5-1.5B architecture and has undergone extensive continued pretraining (CPT) on a specialized Ukrainian legal corpus.

Key Characteristics

Domain-Specific: Continuously pretrained on 161.4 billion tokens from the Unified State Register of Court Decisions of Ukraine (EDRSR), comprising 33.9 million court decisions.
Performance: Achieved a significant perplexity reduction of 71.5% on Ukrainian legal text, lowering it from a base perplexity of 4.61 to 1.31 after CPT.
Training Efficiency: Trained for 17.8 hours on 8x NVIDIA H100 GPUs, processing 10 billion tokens with a throughput of 140K tokens/sec.
Base Model: This is a base model, meaning it is not instruction-tuned and will not follow conversational prompts or instructions directly.

Intended Use Cases

Research: Ideal for studies on domain adaptation of large language models, particularly for low-resource legal languages like Ukrainian.
Fine-tuning: Serves as a strong foundation for further fine-tuning on specific Ukrainian legal NLP tasks.
Scaling Law Analysis: Contributes to the analysis of scaling laws in continued pretraining across different model sizes.
Perplexity Evaluation: Useful for evaluating language model perplexity on specialized Ukrainian legal texts.

Limitations

As a base model, it is not instruction-tuned and will not respond to instructions or chat prompts.
Its training is exclusively on Ukrainian court decisions, which may limit its generalization to other legal systems or general-purpose tasks.

Overview

Model Overview

Key Characteristics

Intended Use Cases

Limitations

Full Model Card (README)