Jarrodbarnes/qwen3-0.6B-interleaved-thinking
Jarrodbarnes/qwen3-0.6B-interleaved-thinking is an experimental 0.8 billion parameter research model derived from Qwen/Qwen3-0.6B-Base, designed to explore whether a small base model can learn an interleaved thought interface and make it rewardable through Reinforcement Learning Mid-Training (RLMT). This model focuses on studying the emergence of a thought-conditioned continuation interface, rather than serving as a production assistant. It is optimized for small-scale thinking mid-training experiments and causal thought-use probes, demonstrating that thought channels can behaviorally influence suffix prediction.
Loading preview...
What is Jarrodbarnes/qwen3-0.6B-interleaved-thinking?
This is a small, experimental 0.8 billion parameter research model based on Qwen/Qwen3-0.6B-Base. Its primary purpose is to investigate whether a base model can develop an "interleaved thought" interface during pretraining and mid-training, making this interface rewardable. It's not an instruction-tuned assistant but a research artifact for studying novel training methodologies.
Key Training & Findings
The model's development involved a unique training pipeline:
- Self-improving continued pretraining: Selected for judged-better continuations using an Online DPO-style approach.
- Interleaved-thinking SFT: Taught the model to integrate short, local thoughts within ordinary text.
- Reinforcement Learning Mid-Training (RLMT): Rewarded thought-conditioned suffix prediction, demonstrating that the thought interface became behaviorally relevant.
Key findings indicate that continued pretraining improved judged continuation quality, SFT successfully installed the thought interface (reducing thought-token NLL), and RLMT made this interface rewardable. Causal thought-use probes confirmed that thought text was not merely formatting, as swapped unrelated thoughts sharply reduced suffix reward.
When to Use This Model
This model is particularly useful for:
- Conducting small-scale thinking mid-training experiments.
- Performing causal thought-use probes.
- Studying the mechanics of self-improving pretraining, interleaved SFT, and RLMT.
- Reproducing the associated research blog results.
Limitations
It's important to note that this is a 0.6B parameter model with a short 200-step RLMT budget. It is a research artifact, not a production-ready assistant. Generated thoughts are not consistently superior to generic scaffolds, and downstream reasoning improvements were mixed. Claims should be limited to the documented small-scale experimental setup.