gshasiri/SmolLM3-DPO-Second-Round

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1BQuant:BF16Ctx Length:32kPublished:Nov 27, 2025Architecture:Transformer Warm

gshasiri/SmolLM3-DPO-Second-Round is a 1 billion parameter language model fine-tuned by gshasiri using Direct Preference Optimization (DPO). This model is a DPO-tuned version of gshasiri/SmolLM3-SFT-Second-Round, designed to align with human preferences. With a context length of 32768 tokens, it is suitable for general text generation tasks where preference alignment is beneficial.

Loading preview...

Model Overview

gshasiri/SmolLM3-DPO-Second-Round is a 1 billion parameter language model developed by gshasiri. It is a fine-tuned iteration of the gshasiri/SmolLM3-SFT-Second-Round model, specifically enhanced using Direct Preference Optimization (DPO). This training methodology aims to align the model's outputs more closely with human preferences, making its responses potentially more desirable or helpful.

Key Training Details

  • Base Model: Fine-tuned from gshasiri/SmolLM3-SFT-Second-Round.
  • Optimization Method: Utilizes Direct Preference Optimization (DPO), a technique introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (paper link).
  • Framework: Trained using the TRL library (Transformer Reinforcement Learning).
  • Context Length: Supports a context window of 32768 tokens.

Potential Use Cases

This model is well-suited for applications requiring:

  • General Text Generation: Producing coherent and contextually relevant text.
  • Preference-Aligned Responses: Generating outputs that are more aligned with human preferences due to DPO training.
  • Interactive AI Systems: Where the quality and desirability of generated responses are important.