jordanpainter/diallm-qwen-dpo-aus

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 16, 2026Architecture:Transformer Cold

The jordanpainter/diallm-qwen-dpo-aus model is an 8 billion parameter, instruction-tuned causal language model fine-tuned from jordanpainter/diallm-qwen-sft-aus. It was trained using Direct Preference Optimization (DPO) with a context length of 32768 tokens. This model is designed for generating conversational text based on user prompts, leveraging preference-based learning to improve response quality.

Loading preview...

Overview

This model, jordanpainter/diallm-qwen-dpo-aus, is an 8 billion parameter language model that has been fine-tuned from the jordanpainter/diallm-qwen-sft-aus base model. It utilizes Direct Preference Optimization (DPO), a method introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," to enhance its conversational capabilities and align its outputs with preferred responses. The training process leveraged the TRL (Transformers Reinforcement Learning) framework.

Key Capabilities

  • Instruction Following: Designed to generate responses based on specific user instructions or questions.
  • Preference-Based Fine-tuning: Benefits from DPO training, which aims to produce higher-quality, more aligned conversational outputs.
  • Large Context Window: Supports a context length of 32768 tokens, allowing for processing and generating longer, more coherent dialogues.

Training Details

The model's training involved the DPO method, which optimizes a language model directly using human preferences without requiring an explicit reward model. This approach helps in refining the model's ability to generate more desirable and helpful responses. The training was conducted using TRL version 0.28.0, with Transformers 4.57.6 and PyTorch 2.5.1+cu121.