Overview
Klingspor/StarPO-1.7B is a specialized 1.7 billion parameter language model, a reinforcement learning (RL) fine-tuned version of Qwen3-1.7B. Its primary function is to serve as a Questioner in the classic 20 Questions game, where it asks strategic yes-or-no questions to identify a secret common English noun. This model was developed and released as a baseline for the paper "Intrinsic Credit Assignment for Long Horizon Interaction."
Key Capabilities
- 20 Questions Game Agent: Designed to play the role of the questioner, formulating deductive questions.
- Multi-turn Interaction: Optimized for sequential, interactive dialogue through its StarPO training.
- Research Baseline: Serves as a comparative model for studies on intrinsic credit assignment in multi-step RL and interactive language agents.
Training Details
The model was trained using StarPO, a variant of Group Relative Policy Optimization (GRPO) adapted for multi-turn scenarios. It started from a Qwen3-1.7B SFT checkpoint and utilized 1,000 words from the COCA+ RL training set. A Qwen3-14B model with chain-of-thought reasoning acted as the judge/oracle during training, which was conducted using the VERL framework.
Intended Use Cases
- Playing 20 Questions: Directly usable as an agent for the 20 Questions game.
- RL Research: Ideal for research into multi-turn interactive language agents and the application of RL to LLMs.
- Credit Assignment Studies: Provides a valuable baseline for comparing different credit assignment methods in multi-step reinforcement learning.