ZeterMordio/anchor-negotiation-sdpo-qwen35-smoke
ZeterMordio/anchor-negotiation-sdpo-qwen35-smoke is a 9 billion parameter Qwen3.5-based model developed by ZeterMordio, specifically fine-tuned for bilateral negotiation tasks using Reinforcement Learning with Verifiable Rewards (RLVR) and Self-Distillation Policy Optimization (SDPO). This model focuses on training a buyer policy to negotiate effectively under incomplete information, utilizing a frozen seller model as a counterparty. It is optimized for efficient negotiation dialogues, particularly in scenarios requiring strategic interaction and hidden information management.
Loading preview...
Overview
This model, anchor-negotiation-sdpo-qwen35-smoke, is a 9 billion parameter Qwen3.5-based model developed by ZeterMordio. It is a replication and extension of the research paper "Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards" (paper 2604.09855). The primary goal is to train a small open model to negotiate effectively under incomplete information with a tight compute budget. It specifically focuses on training a buyer policy against a frozen, regulated seller, incorporating advanced techniques like Self-Distillation Policy Optimization (SDPO) and Group-level Reinforcement Policy Optimization (GRPO).
Key Capabilities
- Bilateral Negotiation: Specialized in training a buyer policy to negotiate with a frozen seller, where the buyer starts each episode and receives policy-gradient updates.
- Incomplete Information Handling: Opponent-visible dialogue strips
Thought:blocks, preserving the hidden-scratchpad / incomplete-information assumption from the original paper. - Reinforcement Learning with Verifiable Rewards (RLVR): Utilizes a reward system based on
(budget - P_final) / |budget - cost|, clipped to[-1, 1], with penalties for errors or no-deal scenarios. - Self-Distillation Policy Optimization (SDPO): Employs hindsight verifier feedback and on-policy rollout demos to compute dense SDPO token advantages, making it a ref-free/on-policy objective.
- Qwen3.5 Reasoning Ablations: Supports different reasoning modes (
option_afor explicitThought:fields andoption_bfor native Qwen thinking with<think>...</think>blocks).
Good for
- Research in Negotiation AI: Ideal for researchers exploring advanced RL techniques like SDPO and GRPO in strategic dialogue settings.
- Developing Negotiation Agents: Suitable for building and evaluating AI agents capable of complex bilateral negotiations under constrained information.
- Understanding RLVR Implementations: Provides a practical example of implementing Reinforcement Learning with Verifiable Rewards for specific task optimization.