Lambent/Gilded-Arsenic-12B
Lambent/Gilded-Arsenic-12B is a 12 billion parameter language model fine-tuned from Lambent/arsenic-nemo-unleashed-12B. It was trained using Direct Preference Optimization (DPO) on a diverse dataset including DPO-formatted text, mathematical steps, and roleplay synthesis. This model is designed for enhanced conversational quality and preference alignment across various text generation tasks, leveraging its 32768 token context length.
Loading preview...
Model Overview
Lambent/Gilded-Arsenic-12B is a 12 billion parameter language model developed by Lambent, fine-tuned from the Lambent/arsenic-nemo-unleashed-12B base model. This model leverages Direct Preference Optimization (DPO), a training method introduced in "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," to align its outputs more closely with human preferences.
Key Capabilities
- Preference Alignment: Trained with DPO on a diverse set of datasets, including
gutenberg-moderne-dpo,Purpura-DPO,Arkhaios-DPO,Math-Step-DPO-10K,rp-teacher-synth-dpo,gutenberg2-dpo, anddarkside-dpo. - Enhanced Conversational Quality: The DPO fine-tuning process aims to improve the model's ability to generate responses that are preferred by humans, making it suitable for interactive applications.
- Broad Application Potential: The varied training data suggests applicability across different domains, from general text generation to more specialized tasks like mathematical reasoning and roleplay scenarios.
Training Details
The model was trained using the TRL (Transformer Reinforcement Learning) framework, specifically version 0.12.1, with Transformers 4.47.0 and PyTorch 2.3.1+cu121. The DPO method helps the model learn directly from preference data without the need for a separate reward model.