Jarrodbarnes/qwen3-0.6B-interleaved-thinking

TEXT GENERATIONConcurrency Cost:1Model Size:0.8BQuant:BF16Ctx Length:32kPublished:Apr 27, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

Jarrodbarnes/qwen3-0.6B-interleaved-thinking is an experimental 0.8 billion parameter research model derived from Qwen/Qwen3-0.6B-Base, designed to explore whether a small base model can learn an interleaved thought interface and make it rewardable through Reinforcement Learning Mid-Training (RLMT). This model focuses on studying the emergence of a thought-conditioned continuation interface, rather than serving as a production assistant. It is optimized for small-scale thinking mid-training experiments and causal thought-use probes, demonstrating that thought channels can behaviorally influence suffix prediction.

Loading preview...

What is Jarrodbarnes/qwen3-0.6B-interleaved-thinking?

This is a small, experimental 0.8 billion parameter research model based on Qwen/Qwen3-0.6B-Base. Its primary purpose is to investigate whether a base model can develop an "interleaved thought" interface during pretraining and mid-training, making this interface rewardable. It's not an instruction-tuned assistant but a research artifact for studying novel training methodologies.

Key Training & Findings

The model's development involved a unique training pipeline:

  • Self-improving continued pretraining: Selected for judged-better continuations using an Online DPO-style approach.
  • Interleaved-thinking SFT: Taught the model to integrate short, local thoughts within ordinary text.
  • Reinforcement Learning Mid-Training (RLMT): Rewarded thought-conditioned suffix prediction, demonstrating that the thought interface became behaviorally relevant.

Key findings indicate that continued pretraining improved judged continuation quality, SFT successfully installed the thought interface (reducing thought-token NLL), and RLMT made this interface rewardable. Causal thought-use probes confirmed that thought text was not merely formatting, as swapped unrelated thoughts sharply reduced suffix reward.

When to Use This Model

This model is particularly useful for:

  • Conducting small-scale thinking mid-training experiments.
  • Performing causal thought-use probes.
  • Studying the mechanics of self-improving pretraining, interleaved SFT, and RLMT.
  • Reproducing the associated research blog results.

Limitations

It's important to note that this is a 0.6B parameter model with a short 200-step RLMT budget. It is a research artifact, not a production-ready assistant. Generated thoughts are not consistently superior to generic scaffolds, and downstream reasoning improvements were mixed. Claims should be limited to the documented small-scale experimental setup.