overthelex/qwen2.5-1.5b-edrsr-legal-uk
The overthelex/qwen2.5-1.5b-edrsr-legal-uk model is a 1.5 billion parameter Qwen2.5-based language model developed by overthelex, specifically fine-tuned for the Ukrainian legal domain. It was created through continued pretraining on 161.4 billion tokens from the Unified State Register of Court Decisions of Ukraine (EDRSR). This base model excels at understanding and processing Ukrainian legal text, making it suitable for research in domain adaptation and downstream fine-tuning for legal NLP tasks.
Loading preview...
Model Overview
The overthelex/qwen2.5-1.5b-edrsr-legal-uk is a 1.5 billion parameter base language model, part of a scaling experiment conducted by overthelex for a PhD dissertation. It is built upon the Qwen2.5-1.5B architecture and has undergone extensive continued pretraining (CPT) on a specialized Ukrainian legal corpus.
Key Characteristics
- Domain-Specific: Continuously pretrained on 161.4 billion tokens from the Unified State Register of Court Decisions of Ukraine (EDRSR), comprising 33.9 million court decisions.
- Performance: Achieved a significant perplexity reduction of 71.5% on Ukrainian legal text, lowering it from a base perplexity of 4.61 to 1.31 after CPT.
- Training Efficiency: Trained for 17.8 hours on 8x NVIDIA H100 GPUs, processing 10 billion tokens with a throughput of 140K tokens/sec.
- Base Model: This is a base model, meaning it is not instruction-tuned and will not follow conversational prompts or instructions directly.
Intended Use Cases
- Research: Ideal for studies on domain adaptation of large language models, particularly for low-resource legal languages like Ukrainian.
- Fine-tuning: Serves as a strong foundation for further fine-tuning on specific Ukrainian legal NLP tasks.
- Scaling Law Analysis: Contributes to the analysis of scaling laws in continued pretraining across different model sizes.
- Perplexity Evaluation: Useful for evaluating language model perplexity on specialized Ukrainian legal texts.
Limitations
- As a base model, it is not instruction-tuned and will not respond to instructions or chat prompts.
- Its training is exclusively on Ukrainian court decisions, which may limit its generalization to other legal systems or general-purpose tasks.