longtermrisk/Qwen3-8B-reward-hacks-last-third
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:May 19, 2026License:apache-2.0Architecture:Transformer Open Weights Warm
The longtermrisk/Qwen3-8B-reward-hacks-last-third is an 8 billion parameter Qwen3 model, fine-tuned by longtermrisk. This model was trained using Unsloth and Huggingface's TRL library, achieving 2x faster training speeds. With a 32768 token context length, it is optimized for efficient fine-tuning and deployment.
Loading preview...
Model Overview
The longtermrisk/Qwen3-8B-reward-hacks-last-third is an 8 billion parameter language model developed by longtermrisk. It is fine-tuned from the unsloth/Qwen3-8B base model, leveraging the Unsloth library in conjunction with Huggingface's TRL library.
Key Characteristics
- Base Model: Qwen3-8B architecture.
- Parameter Count: 8 billion parameters.
- Context Length: Supports a context window of 32768 tokens.
- Training Efficiency: Noteworthy for being trained 2x faster due to the integration of Unsloth, which specializes in efficient fine-tuning.
- License: Distributed under the Apache-2.0 license.
Intended Use Cases
This model is particularly suitable for developers and researchers looking for:
- Efficient Fine-tuning: Its development process highlights optimized training, making it a good candidate for further domain-specific fine-tuning where speed is a factor.
- Qwen3-based Applications: Ideal for applications requiring the capabilities of the Qwen3 architecture at the 8B scale.
- Research and Development: Provides a foundation for experimenting with reward hacking or similar fine-tuning strategies, given its name implies such a focus in its last training phase.