Model Overview
Lambent/Gilded-Arsenic-12B is a 12 billion parameter language model developed by Lambent, fine-tuned from the Lambent/arsenic-nemo-unleashed-12B base model. This model leverages Direct Preference Optimization (DPO), a training method introduced in "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," to align its outputs more closely with human preferences.
Key Capabilities
- Preference Alignment: Trained with DPO on a diverse set of datasets, including
gutenberg-moderne-dpo, Purpura-DPO, Arkhaios-DPO, Math-Step-DPO-10K, rp-teacher-synth-dpo, gutenberg2-dpo, and darkside-dpo. - Enhanced Conversational Quality: The DPO fine-tuning process aims to improve the model's ability to generate responses that are preferred by humans, making it suitable for interactive applications.
- Broad Application Potential: The varied training data suggests applicability across different domains, from general text generation to more specialized tasks like mathematical reasoning and roleplay scenarios.
Training Details
The model was trained using the TRL (Transformer Reinforcement Learning) framework, specifically version 0.12.1, with Transformers 4.47.0 and PyTorch 2.3.1+cu121. The DPO method helps the model learn directly from preference data without the need for a separate reward model.