ZeterMordio/anchor-negotiation-sdpo-qwen35-smoke

VISIONConcurrency Cost:1Model Size:9BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:May 20, 2026Architecture:Transformer Cold

ZeterMordio/anchor-negotiation-sdpo-qwen35-smoke is a 9 billion parameter Qwen3.5-based model developed by ZeterMordio, specifically fine-tuned for bilateral negotiation tasks using Reinforcement Learning with Verifiable Rewards (RLVR) and Self-Distillation Policy Optimization (SDPO). This model focuses on training a buyer policy to negotiate effectively under incomplete information, utilizing a frozen seller model as a counterparty. It is optimized for efficient negotiation dialogues, particularly in scenarios requiring strategic interaction and hidden information management.

Loading preview...

Overview

This model, anchor-negotiation-sdpo-qwen35-smoke, is a 9 billion parameter Qwen3.5-based model developed by ZeterMordio. It is a replication and extension of the research paper "Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards" (paper 2604.09855). The primary goal is to train a small open model to negotiate effectively under incomplete information with a tight compute budget. It specifically focuses on training a buyer policy against a frozen, regulated seller, incorporating advanced techniques like Self-Distillation Policy Optimization (SDPO) and Group-level Reinforcement Policy Optimization (GRPO).

Key Capabilities

  • Bilateral Negotiation: Specialized in training a buyer policy to negotiate with a frozen seller, where the buyer starts each episode and receives policy-gradient updates.
  • Incomplete Information Handling: Opponent-visible dialogue strips Thought: blocks, preserving the hidden-scratchpad / incomplete-information assumption from the original paper.
  • Reinforcement Learning with Verifiable Rewards (RLVR): Utilizes a reward system based on (budget - P_final) / |budget - cost|, clipped to [-1, 1], with penalties for errors or no-deal scenarios.
  • Self-Distillation Policy Optimization (SDPO): Employs hindsight verifier feedback and on-policy rollout demos to compute dense SDPO token advantages, making it a ref-free/on-policy objective.
  • Qwen3.5 Reasoning Ablations: Supports different reasoning modes (option_a for explicit Thought: fields and option_b for native Qwen thinking with <think>...</think> blocks).

Good for

  • Research in Negotiation AI: Ideal for researchers exploring advanced RL techniques like SDPO and GRPO in strategic dialogue settings.
  • Developing Negotiation Agents: Suitable for building and evaluating AI agents capable of complex bilateral negotiations under constrained information.
  • Understanding RLVR Implementations: Provides a practical example of implementing Reinforcement Learning with Verifiable Rewards for specific task optimization.