Klingspor/StarPO-1.7B

Warm
Public
2B
BF16
40960
Jan 14, 2026
License: apache-2.0
Hugging Face
Overview

Overview

Klingspor/StarPO-1.7B is a specialized 1.7 billion parameter language model, a reinforcement learning (RL) fine-tuned version of Qwen3-1.7B. Its primary function is to serve as a Questioner in the classic 20 Questions game, where it asks strategic yes-or-no questions to identify a secret common English noun. This model was developed and released as a baseline for the paper "Intrinsic Credit Assignment for Long Horizon Interaction."

Key Capabilities

  • 20 Questions Game Agent: Designed to play the role of the questioner, formulating deductive questions.
  • Multi-turn Interaction: Optimized for sequential, interactive dialogue through its StarPO training.
  • Research Baseline: Serves as a comparative model for studies on intrinsic credit assignment in multi-step RL and interactive language agents.

Training Details

The model was trained using StarPO, a variant of Group Relative Policy Optimization (GRPO) adapted for multi-turn scenarios. It started from a Qwen3-1.7B SFT checkpoint and utilized 1,000 words from the COCA+ RL training set. A Qwen3-14B model with chain-of-thought reasoning acted as the judge/oracle during training, which was conducted using the VERL framework.

Intended Use Cases

  • Playing 20 Questions: Directly usable as an agent for the 20 Questions game.
  • RL Research: Ideal for research into multi-turn interactive language agents and the application of RL to LLMs.
  • Credit Assignment Studies: Provides a valuable baseline for comparing different credit assignment methods in multi-step reinforcement learning.