jackf857/llama-3-8b-base-kto-ultrafeedback-4xh200-batch-128-20260427-194056

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

The jackf857/llama-3-8b-base-kto-ultrafeedback-4xh200-batch-128-20260427-194056 model is an 8 billion parameter Llama 3 base model fine-tuned by jackf857. It was fine-tuned using the HuggingFaceH4/ultrafeedback_binarized dataset, building upon a previously supervised fine-tuned Llama 3 variant. This model is optimized for preference alignment, demonstrating improved reward scores on evaluation sets, making it suitable for tasks requiring nuanced response generation.

Loading preview...

Model Overview

This model, llama-3-8b-base-kto-ultrafeedback-4xh200-batch-128-20260427-194056, is an 8 billion parameter Llama 3 base model that has undergone further fine-tuning. It is based on the W-61/llama-3-8b-base-sft-ultrachat-8xh200 model, indicating an initial supervised fine-tuning phase.

Key Characteristics

  • Base Architecture: Llama 3 8B parameters.
  • Fine-tuning Dataset: Utilizes the HuggingFaceH4/ultrafeedback_binarized dataset, suggesting a focus on preference alignment and reinforcement learning from human feedback (RLHF) or similar techniques.
  • Training Objective: The fine-tuning process aimed to optimize for reward signals, as evidenced by the evaluation metrics showing improved rewards for chosen responses and margins over rejected responses.

Training Details

  • Learning Rate: 5e-07
  • Batch Size: A total_train_batch_size of 128 was achieved with train_batch_size of 8 and gradient_accumulation_steps of 4 across 4 devices.
  • Optimizer: AdamW with standard betas and epsilon.
  • Scheduler: Cosine learning rate scheduler with a 0.1 warmup ratio.
  • Epochs: Trained for 1 epoch.

Performance Insights

During evaluation, the model achieved a loss of 0.4319. Notably, it demonstrated a Rewards/margins score of 0.8773, indicating its ability to differentiate between preferred and non-preferred responses effectively. This suggests the model is well-aligned with human preferences captured in the ultrafeedback_binarized dataset.

Intended Use Cases

This model is particularly well-suited for applications where generating responses that align with human preferences is crucial. This includes tasks such as:

  • Chatbots and Conversational AI: Generating more helpful, harmless, and honest responses.
  • Content Generation: Producing text that is preferred by users based on quality and relevance.
  • Instruction Following: Improving adherence to complex instructions by learning from preference data.