Name: ZeterMordio/anchor-negotiation-sdpo-qwen35-smoke API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ZeterMordio

Overview

This model, anchor-negotiation-sdpo-qwen35-smoke, is a 9 billion parameter Qwen3.5-based model developed by ZeterMordio. It is a replication and extension of the research paper "Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards" (paper 2604.09855). The primary goal is to train a small open model to negotiate effectively under incomplete information with a tight compute budget. It specifically focuses on training a buyer policy against a frozen, regulated seller, incorporating advanced techniques like Self-Distillation Policy Optimization (SDPO) and Group-level Reinforcement Policy Optimization (GRPO).

Key Capabilities

Bilateral Negotiation: Specialized in training a buyer policy to negotiate with a frozen seller, where the buyer starts each episode and receives policy-gradient updates.
Incomplete Information Handling: Opponent-visible dialogue strips Thought: blocks, preserving the hidden-scratchpad / incomplete-information assumption from the original paper.
Reinforcement Learning with Verifiable Rewards (RLVR): Utilizes a reward system based on (budget - P_final) / |budget - cost|, clipped to [-1, 1], with penalties for errors or no-deal scenarios.
Self-Distillation Policy Optimization (SDPO): Employs hindsight verifier feedback and on-policy rollout demos to compute dense SDPO token advantages, making it a ref-free/on-policy objective.
Qwen3.5 Reasoning Ablations: Supports different reasoning modes (option_a for explicit Thought: fields and option_b for native Qwen thinking with <think>...</think> blocks).

Good for

Research in Negotiation AI: Ideal for researchers exploring advanced RL techniques like SDPO and GRPO in strategic dialogue settings.
Developing Negotiation Agents: Suitable for building and evaluating AI agents capable of complex bilateral negotiations under constrained information.
Understanding RLVR Implementations: Provides a practical example of implementing Reinforcement Learning with Verifiable Rewards for specific task optimization.

Overview

Overview

Key Capabilities

Good for

Full Model Card (README)