ubitech-edg/mistral-12b-cpt-sft

TEXT GENERATIONConcurrency Cost:1Model Size:12BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Oct 19, 2025License:apache-2.0Architecture:Transformer Open Weights Cold

ubitech-edg/mistral-12b-cpt-sft is a 12 billion parameter causal language model developed by ubitech-edg, built upon the mistral-12b-cpt base model. It leverages a two-stage LoRA fine-tuning process, combining continual pretraining (CPT) for extended general knowledge and supervised fine-tuning (SFT) for enhanced instruction-following on synthetic QA. This approach improves coherence, factual recall, and reasoning, making it suitable for applications requiring robust question-answering and general text generation.

Loading preview...

Overview

ubitech-edg/mistral-12b-cpt-sft is a 12 billion parameter causal language model that integrates continual pretraining (CPT) and supervised fine-tuning (SFT). This two-stage LoRA fine-tuning process aims to enhance the model's general knowledge and instruction-following capabilities, particularly for question-answering tasks.

Key Capabilities & Training

  • Two-Stage Fine-Tuning: The model first undergoes CPT to expand its general knowledge using diverse domain-specific datasets like arxiv.jsonl, gov.jsonl, news.jsonl, and wiki.jsonl. Subsequently, SFT is applied using axolotl_deduplicated_synthetic_qa.jsonl to improve its ability to follow instructions and generate coherent, factual responses.
  • LoRA Efficiency: The fine-tuning utilizes an 8-bit LoRA adapter with specific hyperparameters (r=16, alpha=32, dropout=0.05) targeting q_proj, k_proj, v_proj, and o_proj layers, ensuring efficient adaptation.
  • Hardware & Framework: Training was conducted on Leonardo EuroHPC, utilizing 8 × 2 × A100 64 GB GPUs with Axolotl, DeepSpeed, PyTorch 2.5.1, and CUDA 12.1.
  • Context Length: The model supports a sequence length of 2048 tokens.

Use Cases

This model is well-suited for applications requiring improved coherence, factual recall, and reasoning, especially in question-answering scenarios, due to its specialized two-stage training approach.