W-61/mistral-7b-base-margin-dpo-hh-helpful-4xh200-batch-64

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 18, 2026Architecture:Transformer Cold

W-61/mistral-7b-base-margin-dpo-hh-helpful-4xh200-batch-64 is a 7 billion parameter language model fine-tuned from a Mistral-7B base model. This model was specifically optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, focusing on helpfulness. It is designed for tasks requiring helpful and aligned responses, building upon its Mistral-7B foundation with a 4096 token context length.

Loading preview...

Model Overview

This model, mistral-7b-base-margin-dpo-hh-helpful-4xh200-batch-64, is a 7 billion parameter language model derived from a Mistral-7B base. It has been fine-tuned using a Direct Preference Optimization (DPO) approach on the Anthropic/hh-rlhf dataset, with a specific emphasis on generating helpful responses.

Key Characteristics

  • Base Model: Fine-tuned from a Mistral-7B base model.
  • Fine-tuning Method: Utilizes Direct Preference Optimization (DPO) with a beta of 0.1000.
  • Training Data: Optimized on the Anthropic/hh-rlhf dataset, indicating a focus on human helpfulness preferences.
  • Performance Metrics: Achieved a final validation loss of 0.3349, with specific DPO margin metrics recorded during training.
  • Context Length: Inherits the 4096 token context window from its Mistral-7B foundation.

Training Details

The model was trained with a learning rate of 5e-07, a total batch size of 64 (across 4 GPUs), and a cosine learning rate scheduler with a 0.1 warmup ratio over 1 epoch. The training process involved 600 steps, showing consistent improvement in validation loss and DPO-related metrics.