Lambent/Arsenic-Shahrazad-12B-v4.3.1
Arsenic-Shahrazad-12B-v4.3.1 by Lambent is a 12 billion parameter language model created by merging five distinct models using the Karcher Mean method. This model underwent a DPO (Direct Preference Optimization) pass, incorporating rewritten low-scoring RLVR turn samples with judge feedback. It exhibits some influence from Gemma 4 31B due to the data used for rewriting turn samples, making it suitable for tasks benefiting from preference optimization and refined response generation.
Loading preview...
Overview
Arsenic-Shahrazad-12B-v4.3.1 is a 12 billion parameter language model developed by Lambent. It was created using the Karcher Mean merge method, combining five different pre-trained language models. This model has undergone a Direct Preference Optimization (DPO) pass, which involved using several random seeds and taking the mean of the results.
Key Characteristics
- Merge Method: Utilizes the Karcher Mean for combining multiple models.
- DPO Pass: Enhanced through Direct Preference Optimization, incorporating data from rewritten low-scoring RLVR (Reinforcement Learning from Human Feedback) turn samples.
- Data Influence: The rewriting of turn samples was performed using Gemma 4 31B, indicating a degree of Gemma influence in the model's training data.
Training Details
The model's DPO pass included data derived from rewriting low-scoring RLVR turn samples, where original rejected samples were replaced with judge-feedback-driven rewritten versions. The merge process involved five distinct models, each originating from a 'baked_v43' output with different seeds, as detailed in the mergekit configuration.