W-61/qwen3-8b-base-cpo-ultrafeedback-4xh200-batch-128-20260422-131855

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 23, 2026Architecture:Transformer Cold

W-61/qwen3-8b-base-cpo-ultrafeedback-4xh200-batch-128-20260422-131855 is an 8 billion parameter language model fine-tuned from W-61/qwen3-8b-base-sft-ultrachat-4xh200-batch-128. This model utilizes a CPO (Constitutional Preference Optimization) training approach on the HuggingFaceH4/ultrafeedback_binarized dataset, demonstrating a rewards accuracy of 0.5280. With a context length of 32768 tokens, it is optimized for tasks requiring nuanced understanding of preferences and alignment with human feedback.

Loading preview...

Overview

This model, qwen3-8b-base-cpo-ultrafeedback-4xh200-batch-128-20260422-131855, is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the W-61/qwen3-8b-base-sft-ultrachat-4xh200-batch-128 base model, specifically trained using a Constitutional Preference Optimization (CPO) method.

Training Details

The model was fine-tuned on the HuggingFaceH4/ultrafeedback_binarized dataset. Key training hyperparameters included a learning rate of 5e-07, a total batch size of 128 (with a train batch size of 4 and gradient accumulation steps of 8), and a cosine learning rate scheduler with a 0.1 warmup ratio. The training consisted of 1 epoch.

Evaluation Results

On the evaluation set, the model achieved a loss of 2.0046. Notable reward metrics include a rewards accuracy of 0.5280 and a rewards margin of -0.1083, indicating its performance in distinguishing between preferred and rejected responses based on the feedback dataset.

Intended Use

While specific intended uses and limitations require further information, the CPO fine-tuning on a feedback dataset suggests its suitability for tasks where aligning with human preferences and generating high-quality, preferred responses is critical. Its 32768-token context length supports processing longer inputs.