AIPlans/tinyllama-1.1b-dpo-pku-saferlhf

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.1BQuant:BF16Ctx Length:2kPublished:May 11, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

AIPlans/tinyllama-1.1b-dpo-pku-saferlhf is a 1.1 billion parameter language model fine-tuned from TinyLlama/TinyLlama-1.1B-Chat-v1.0. This model utilizes Direct Preference Optimization (DPO) and aims to align with safer Reinforcement Learning from Human Feedback (RLHF) principles. It is designed for general conversational tasks within its compact size and 2048 token context window.

Loading preview...

Overview

This model, AIPlans/tinyllama-1.1b-dpo-pku-saferlhf, is a compact 1.1 billion parameter language model. It is a fine-tuned variant of the TinyLlama/TinyLlama-1.1B-Chat-v1.0 base model, indicating an optimization for chat-based interactions. The fine-tuning process involved Direct Preference Optimization (DPO), a method used to align models with human preferences without requiring a separate reward model.

Training Details

The model was trained for 1.0 epoch with a learning rate of 5e-07 and a total batch size of 16 (achieved with a train_batch_size of 4 and gradient_accumulation_steps of 4). The optimizer used was Adam with standard betas and epsilon, and a cosine learning rate scheduler with a 0.1 warmup ratio. Evaluation metrics during training show a final loss of 0.6742, with Rewards/chosen at 0.0508 and Rewards/rejected at 0.0098, suggesting a preference for chosen responses.

Potential Use Cases

Given its small size and DPO fine-tuning, this model is suitable for resource-constrained environments where a capable, preference-aligned conversational agent is needed. Its compact nature makes it efficient for deployment on edge devices or applications requiring low latency and memory footprint, particularly for general chat or instruction-following tasks where safety alignment is a consideration.