arbenilazi/dpo-mbpp-merged

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Feb 22, 2026Architecture:Transformer Cold

The arbenilazi/dpo-mbpp-merged model is a 7.6 billion parameter Qwen2.5-Coder-7B-Instruct variant, fine-tuned using Direct Preference Optimization (DPO) on the MBPP dataset. With a 32K context length, this model is specifically optimized for enhanced code generation tasks. It leverages 4-bit QLoRA training and is provided as a fully merged bf16 checkpoint for direct inference.

Loading preview...

Overview

arbenilazi/dpo-mbpp-merged is a 7.6 billion parameter model based on Qwen2.5-Coder-7B-Instruct, specifically fine-tuned for code generation tasks. It utilizes the Direct Preference Optimization (DPO) method, trained on the google-research-datasets/mbpp dataset.

Key Characteristics

  • Base Model: Qwen/Qwen2.5-Coder-7B-Instruct, known for its coding capabilities.
  • Training Method: DPO using Hugging Face TRL's DPOTrainer, with a separate frozen reference model.
  • Efficiency: Trained with 4-bit QLoRA (nf4 quantization with bf16 compute) for efficient fine-tuning.
  • Deployment Ready: The LoRA adapters were merged into the base model using PEFT, resulting in a fully merged safetensors checkpoint in bf16 precision. This means no LoRA or quantization is required at inference time, simplifying deployment.
  • Context Length: Supports a context length of 32,768 tokens.

Training Details

The model underwent 2 epochs of training with a batch size of 4 and a learning rate of 2e-5. LoRA configuration included target modules q_proj, k_proj, v_proj, o_proj with a rank (r) of 16 and alpha of 32.

Good For

  • Developers and researchers focused on improving code generation performance.
  • Applications requiring a specialized model for programming-related tasks.
  • Scenarios where a fully merged, ready-to-use bf16 model is preferred for ease of inference.