Overview

This model, gemma-3-1b-it-4bit-lora-dpo-aligned, is a fine-tuned version of Google's gemma-3-1b-it base model. It has undergone Direct Preference Optimization (DPO) using the HuggingFaceH4/ultrafeedback_binarized dataset to enhance its alignment with human preferences for text generation tasks. The model incorporates 4-bit NF4 quantization and LoRA (Low-Rank Adaptation) for efficient fine-tuning and deployment.

Key Capabilities

Improved Alignment: Fine-tuned with DPO to generate responses that are better aligned with human preferences.
Efficient Fine-tuning: Utilizes LoRA with a rank of 16 and alpha of 32, targeting key attention and feed-forward modules (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj).
Memory Optimized: Training incorporated gradient checkpointing, an 8-bit AdamW optimizer, and pre-computed reference log probabilities to reduce memory footprint.
Text Generation: Primarily intended for various text generation tasks, inheriting the capabilities of the base Gemma 3 model.

Training Details

The model was trained for 1 epoch on 1000 samples from the ultrafeedback_binarized dataset, using a learning rate of 5e-5 and a DPO beta of 0.1. It supports a maximum context length of 1024 tokens during training, with the base model supporting 32768 tokens.

Intended Use

This model is suitable for applications requiring text generation where human preference alignment is crucial. It aims to provide more desirable outputs while maintaining the core functionalities of the Gemma 3 base model.

Limitations

Users should be aware that the model inherits limitations from the original Gemma 3 base model. While DPO improves alignment, it may not address all edge cases or diverse human preferences. Performance can also vary depending on the hardware environment.