jevonmao/llama31-8b-poker-mix-v1-step10k

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Jun 3, 2026License:llama3.1Architecture:Transformer Cold

jevonmao/llama31-8b-poker-mix-v1-step10k is an 8 billion parameter supervised fine-tune of Meta's Llama-3.1-8B-Instruct model, specifically optimized for heads-up no-limit Texas hold'em poker at 200 big blinds. This model distills GTO Wizard equilibrium solver's mixed strategies into a chat-template language model, supporting both direct action emission and chain-of-thought reasoning. It achieves 83.96% top-1 action-type accuracy on a held-out evaluation split, outperforming other 8B models and zero-shot frontier reasoning models in postflop action prediction.

Loading preview...

Overview

jevonmao/llama31-8b-poker-mix-v1-step10k, or PokerLlama-4, is an 8 billion parameter supervised fine-tune of Meta's Llama-3.1-8B-Instruct. It specializes in heads-up no-limit Texas hold'em poker at 200 big blinds, distilling mixed strategies from the GTO Wizard equilibrium solver. The model supports both direct action emission and chain-of-thought reasoning, making it suitable for studying GTO-policy distillation into small language models.

Key Capabilities

  • High Accuracy: Achieves 83.96% top-1 action-type accuracy on a 31,105-decision held-out evaluation split (HoldemEval-31k), a significant improvement over other 8B poker fine-tunes and base Llama-3.1-8B-Instruct.
  • Specialized Performance: Outperforms frontier reasoning models like DeepSeek-V4-Pro and GPT-4.1/5.4-pro in postflop action prediction, demonstrating the effectiveness of specialized distillation.
  • Tool-Use Support: Emits well-formed preflop_gto tool calls with 99.9% argument correctness, facilitating tool-use and function-call experiments in a poker domain.
  • Chain-of-Thought: Capable of generating chain-of-thought reasoning traces, aiding in the study of GTO-policy distillation.

Intended Use Cases

  • Research Artifact: Primarily intended for reproducing project evaluation results and studying GTO-policy distillation into small language models.
  • Tool-Use Experiments: Useful for experiments involving tool-use and function-calling within a poker context.

Limitations

  • Greedy Decoding Bias: Greedy decoding targets the modal action of the solver's mixed strategy, which may not be EV-maximizing against non-equilibrium opponents.
  • Game Format Scope: Trained exclusively on heads-up no-limit Texas hold'em at 200 BB depth; not evaluated for other stack depths, player counts, or poker variants.
  • Not for Gambling: Explicitly not intended for real-money gambling or any context where financial harm could result from mis-prediction.