W-61/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 18, 2026Architecture:Transformer Cold

W-61/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64 is a 7 billion parameter language model fine-tuned from a Mistral-7B base model. This model has been optimized using Epsilon DPO on the Anthropic/hh-rlhf dataset, focusing on helpfulness and alignment. It is designed for tasks requiring helpful and aligned responses, building upon its Mistral-7B foundation.

Loading preview...

Model Overview

This model, W-61/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64, is a 7 billion parameter language model derived from a Mistral-7B base. It has undergone a specific fine-tuning process using Epsilon DPO (Direct Preference Optimization) on the Anthropic/hh-rlhf dataset. This training methodology aims to enhance the model's helpfulness and alignment with human preferences.

Key Training Details

The fine-tuning process involved a learning rate of 5e-07, a total batch size of 64, and was run for 1 epoch. Evaluation metrics from the training show a final loss of 0.5823 and a rewards accuracy of 0.7038, indicating its performance in distinguishing preferred responses. The training utilized Transformers 4.51.0, Pytorch 2.3.1+cu121, Datasets 2.21.0, and Tokenizers 0.21.4.

Intended Use Cases

Given its fine-tuning on a helpfulness dataset, this model is particularly suited for applications where generating helpful, aligned, and preference-aware text is crucial. It can be considered for tasks requiring nuanced responses that prioritize user assistance and ethical considerations, leveraging the foundational capabilities of the Mistral-7B architecture.