CEIA-RL/energy-exp1-dpo-offline

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:May 30, 2026Architecture:Transformer Cold

The CEIA-RL/energy-exp1-dpo-offline model is a 4 billion parameter language model, fine-tuned from the CEIA-RL/Energy base model using Direct Preference Optimization (DPO). With a context length of 32768 tokens, this model is designed for text generation tasks, leveraging DPO to align its outputs with human preferences. It is particularly suited for generating nuanced and preferred responses in conversational or question-answering scenarios.

Loading preview...

Overview

CEIA-RL/energy-exp1-dpo-offline is a 4 billion parameter language model, fine-tuned from the CEIA-RL/Energy base model. This model leverages the Direct Preference Optimization (DPO) method, as introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (Rafailov et al., 2023). The training was conducted using the TRL (Transformers Reinforcement Learning) framework.

Key Capabilities

  • Preference-aligned Text Generation: Optimized through DPO to generate responses that align with human preferences, making it suitable for tasks requiring nuanced or preferred outputs.
  • Instruction Following: Capable of generating text based on user prompts, as demonstrated by the quick start example.
  • Large Context Window: Supports a context length of 32768 tokens, allowing for processing and generating longer sequences of text.

Training Details

The model was trained using the DPO method, which directly optimizes a language model to align with human preferences without requiring an explicit reward model. The training utilized specific versions of key frameworks:

  • TRL: 0.29.0
  • Transformers: 4.57.6
  • Pytorch: 2.10.0
  • Datasets: 4.7.0
  • Tokenizers: 0.22.2

Good for

  • Generating preferred responses in interactive AI applications.
  • Tasks where output quality and alignment with human judgment are critical.
  • Exploring DPO-based fine-tuning for language models.