ahmadhehe/tinyllama-1.1b-dpo-hh-rlhf

TEXT GENERATIONConcurrency Cost:1Model Size:1.1BQuant:BF16Ctx Length:2kPublished:Jun 4, 2026Architecture:Transformer Cold

ahmadhehe/tinyllama-1.1b-dpo-hh-rlhf is a 1.1 billion parameter language model, based on TinyLlama-1.1B-Chat-v1.0, that has been fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. This model is specifically aligned to human preferences, making it suitable for generating responses that are helpful and harmless. It excels in conversational tasks where preference alignment is crucial, offering improved response quality over its SFT-tuned base model.

Loading preview...

Model Overview

ahmadhehe/tinyllama-1.1b-dpo-hh-rlhf is a 1.1 billion parameter language model developed by Ahmad Murtaza and Simra Sheikh. It is built upon the TinyLlama-1.1B-Chat-v1.0 base model, which was initially instruction-tuned on the dolly-15k dataset. The key differentiator of this model is its subsequent alignment via Direct Preference Optimization (DPO) using the Anthropic/hh-rlhf dataset, aiming to produce responses that are more aligned with human preferences.

Key Capabilities & Training

  • Preference Alignment: Utilizes DPO on a comprehensive human preference dataset to enhance response quality and safety.
  • Base Model: Starts from TinyLlama-1.1B-Chat-v1.0, which provides a strong foundation for chat-based interactions.
  • Training Configuration: Trained with a beta of 0.5, a learning rate of 5e-05, and 1 epoch, building on an SFT-tuned base (SFT-T4).

Performance Highlights

Evaluation on a 10-prompt test set demonstrates the impact of DPO:

  • BLEU-4 Score: Achieved 4.2200, significantly outperforming the Base (2.1400) and Best SFT (2.4200) models.
  • BERTScore F1: Maintained a strong 86.9600, comparable to the Best SFT model (87.1100).

Ideal Use Cases

This model is particularly well-suited for applications requiring:

  • Chatbots and Conversational AI: Where generating helpful, harmless, and preference-aligned responses is critical.
  • Preference-Aligned Generation: For tasks where output quality benefits from fine-tuning on human feedback data.
  • Resource-Constrained Environments: Its 1.1 billion parameters make it a lightweight option for deployment.