Model Overview

This model, qwen3-8b-full-sft-prm-opus-distill-32k-lr5e6-flattened, is an 8 billion parameter language model derived from the Qwen/Qwen3-8B architecture. It has been fine-tuned using a combination of supervised fine-tuning (SFT) and Proximal Policy Optimization (PPO) on the prm_sft_train dataset, with a notable context length of 32,768 tokens.

Training Details

The fine-tuning process involved specific hyperparameters to achieve its current state:

Learning Rate: 5e-06
Batch Size: 1 (train), 8 (eval)
Optimizer: ADAMW_TORCH_FUSED
Scheduler: Cosine with 0.1 warmup ratio
Epochs: 3.0

This configuration, utilizing 8 GPUs, aimed to enhance the model's capabilities for instruction-following and general language understanding within its substantial context window.

Intended Use Cases

Given its fine-tuning methodology, this model is likely suitable for:

Instruction Following: Excelling in tasks where precise adherence to prompts is critical.
Long Context Applications: Handling and generating coherent text over extended inputs, up to 32k tokens.
General Language Tasks: Performing well in a variety of natural language processing applications due to its Qwen3-8B base and subsequent fine-tuning.

Overview

Model Overview

Training Details

Intended Use Cases

Full Model Card (README)