Zachary1150/merge_cosfmt_MRL4096_ROLLOUT4_LR2e-6_w0.1_linear

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Dec 24, 2025Architecture:Transformer Warm

Zachary1150/merge_cosfmt_MRL4096_ROLLOUT4_LR2e-6_w0.1_linear is a 1.5 billion parameter language model created by Zachary1150 through a linear merge of two pre-trained models. This model leverages the mergekit framework to combine distinct base models, resulting in a specialized architecture. With a substantial context length of 131072 tokens, it is designed for applications requiring extensive contextual understanding and processing. Its primary differentiator lies in its creation method, offering a unique blend of capabilities from its constituent models.

Loading preview...

Overview

This model, merge_cosfmt_MRL4096_ROLLOUT4_LR2e-6_w0.1_linear, is a 1.5 billion parameter language model developed by Zachary1150. It was constructed using the mergekit tool, specifically employing the Linear merge method to combine the strengths of two distinct pre-trained language models. The merge process involved assigning different weights to each base model, with one contributing 10% and the other 90% of its parameters, as defined by a bfloat16 configuration.

Key Characteristics

  • Merge Method: Utilizes the Linear merge technique, as described in the arXiv paper.
  • Base Models: Composed from two internal models: /local/scratch/zli2255/workspace/MergeExpert/checkpoints/baselines_openrs/cos_MRL4096_ROLLOUT4_LR2e-6/global_step_40/actor/huggingface and /local/scratch/zli2255/workspace/MergeExpert/checkpoints/baselines_openrs/accfmt_MRL4096_ROLLOUT4_LR2e-6/global_step_30/actor/huggingface.
  • Parameter Weighting: The merge configuration applied a weight of 0.1 to the first model and 0.9 to the second, with normalization enabled.
  • Context Length: Features a notable context window of 131072 tokens, allowing for processing and understanding of very long inputs.

Potential Use Cases

Given its merged nature and large context window, this model is likely suitable for applications that benefit from a blend of capabilities from its constituent models and require extensive contextual understanding, such as:

  • Long-form content generation and summarization.
  • Advanced reasoning over large documents.
  • Tasks where combining specific strengths of different models is advantageous.