Zachary1150/merge_cosfmt_MRL4096_ROLLOUT4_LR5e-7_w0.5_ties_density0.2
Zachary1150/merge_cosfmt_MRL4096_ROLLOUT4_LR5e-7_w0.5_ties_density0.2 is a 1.5 billion parameter language model merged using the TIES method, based on deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B. This model combines two fine-tuned actors, 'cos_MRL4096_ROLLOUT4_LR5e-7' and 'accfmt_MRL4096_ROLLOUT4_LR5e-7', with a context length of 131072 tokens. It is designed for tasks benefiting from the combined strengths of its constituent models, leveraging a specific merging configuration for optimized performance.
Loading preview...
Model Overview
This model, merge_cosfmt_MRL4096_ROLLOUT4_LR5e-7_w0.5_ties_density0.2, is a 1.5 billion parameter language model created by Zachary1150. It was developed using the TIES (Trimming, Iterative Retraining, and Sparsity) merge method, which combines multiple pre-trained models into a single, more capable model. The base model for this merge is deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, indicating its foundation in the Qwen architecture.
Merge Details
The merge process involved two specific fine-tuned models, both identified as 'actor' checkpoints from distinct training runs:
/local/scratch/zli2255/workspace/MergeExpert/checkpoints/baselines_openrs/cos_MRL4096_ROLLOUT4_LR5e-7/global_step_54/actor/huggingface/local/scratch/zli2255/workspace/MergeExpert/checkpoints/baselines_openrs/accfmt_MRL4096_ROLLOUT4_LR5e-7/global_step_54/actor/huggingface
Each of these models contributed with a weight of 0.5 and a density of 0.2 during the TIES merge, as configured in the provided YAML. This specific merging strategy aims to consolidate the strengths of the individual models while maintaining efficiency. The model supports a substantial context length of 131072 tokens.
Potential Use Cases
Given its origin as a merge of fine-tuned 'actor' models, this model is likely suitable for tasks where the combined capabilities of its constituent models are beneficial. Developers looking for a compact yet capable model derived from the DeepSeek-R1-Distill-Qwen-1.5B base, enhanced through a TIES merge, may find this model useful for various language generation and understanding tasks.