longtermrisk/Qwen3-8B-reward-hacks-last-third

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:May 19, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The longtermrisk/Qwen3-8B-reward-hacks-last-third is an 8 billion parameter Qwen3 model, fine-tuned by longtermrisk. This model was trained using Unsloth and Huggingface's TRL library, achieving 2x faster training speeds. With a 32768 token context length, it is optimized for efficient fine-tuning and deployment.

Loading preview...

Model Overview

The longtermrisk/Qwen3-8B-reward-hacks-last-third is an 8 billion parameter language model developed by longtermrisk. It is fine-tuned from the unsloth/Qwen3-8B base model, leveraging the Unsloth library in conjunction with Huggingface's TRL library.

Key Characteristics

  • Base Model: Qwen3-8B architecture.
  • Parameter Count: 8 billion parameters.
  • Context Length: Supports a context window of 32768 tokens.
  • Training Efficiency: Noteworthy for being trained 2x faster due to the integration of Unsloth, which specializes in efficient fine-tuning.
  • License: Distributed under the Apache-2.0 license.

Intended Use Cases

This model is particularly suitable for developers and researchers looking for:

  • Efficient Fine-tuning: Its development process highlights optimized training, making it a good candidate for further domain-specific fine-tuning where speed is a factor.
  • Qwen3-based Applications: Ideal for applications requiring the capabilities of the Qwen3 architecture at the 8B scale.
  • Research and Development: Provides a foundation for experimenting with reward hacking or similar fine-tuning strategies, given its name implies such a focus in its last training phase.