Mecellem-Qwen3-4B-TR: Turkish Legal Language Model
Mecellem-Qwen3-4B-TR is a 4 billion parameter decoder-only language model developed by newmindai, specifically adapted for the Turkish legal domain. Built upon the Qwen3-4B architecture, this model underwent a single-phase, large-scale Continual Pre-training (CPT) process on an extensive dataset of approximately 270.8 billion tokens. This dataset comprises Turkish legal sources (such as Yargıtay, Danıştay, and YÖKTEZ) combined with general Turkish web data (FineWeb2, CulturaX), ensuring both domain specificity and general language proficiency.
Key Capabilities
- Domain-Specific Expertise: Deeply adapted for Turkish legal language, preserving general language understanding while injecting specialized legal knowledge.
- Large-Scale CPT: Trained on ~270.8 billion tokens using a single-phase CPT strategy, leveraging a 4B parameter capacity for effective domain adaptation.
- Extended Context Window: Supports a maximum position embedding of 40,960 tokens, enabling processing of longer legal documents.
- Enhanced Legal Performance: Consistently outperforms the base Qwen3-4B model across various Turkish legal quality objectives, as evaluated by the Muhakim reward model.
Good for
- Turkish Legal Text Generation: Creating accurate and contextually relevant legal documents in Turkish.
- Legal Document Summarization: Efficiently summarizing lengthy Turkish legal texts.
- Legal Question Answering: Providing precise answers to queries within the Turkish legal framework.
- Retrieval-Augmented Generation (RAG) Applications: Enhancing RAG systems with domain-specific Turkish legal understanding.