VLSP2025-LegalSML/qwen3-1.7b-legal-pretrain: Vietnamese Legal Domain Model
This model is a specialized 1.7 billion parameter language model, continually pretrained from the Qwen3-1.7B architecture by the VLSP 2025 LegalSLM Task Organizers. Its primary focus is on Vietnamese legal text understanding and legal question answering.
Key Capabilities & Training
- Domain Specialization: Adapted specifically for the Vietnamese legal domain through extensive continual pretraining.
- Training Data: Utilizes a curated corpus of approximately 144,000 Vietnamese legal texts, comprising:
- ~96,000 official legal documents (laws, decrees, circulars).
- ~48,000 legal news articles and commentary.
- Base Architecture: Built upon
Qwen/Qwen3-1.7B. - Context Length: Supports a maximum sequence length of
4096 during training, with a stated context length of 32768 tokens. - Training Method: Employed full-parameter fine-tuning for continual pretraining, without quantization or LoRA.
Intended Use & Limitations
- Good for: Developers and researchers working on legal AI applications in Vietnamese, particularly for tasks requiring deep understanding of legal documents or answering legal queries.
- License: Released for research purposes only under the scope of the VLSP 2025 Evaluation Campaign. Usage outside this competition must adhere to relevant licensing agreements.
This model offers a robust foundation for developing legal-specific NLP solutions within the Vietnamese context, leveraging a substantial and relevant dataset.