jevonmao/llama31-8b-poker-mix-v1-step10k
jevonmao/llama31-8b-poker-mix-v1-step10k is an 8 billion parameter supervised fine-tune of Meta's Llama-3.1-8B-Instruct model, specifically optimized for heads-up no-limit Texas hold'em poker at 200 big blinds. This model distills GTO Wizard equilibrium solver's mixed strategies into a chat-template language model, supporting both direct action emission and chain-of-thought reasoning. It achieves 83.96% top-1 action-type accuracy on a held-out evaluation split, outperforming other 8B models and zero-shot frontier reasoning models in postflop action prediction.
Loading preview...
Overview
jevonmao/llama31-8b-poker-mix-v1-step10k, or PokerLlama-4, is an 8 billion parameter supervised fine-tune of Meta's Llama-3.1-8B-Instruct. It specializes in heads-up no-limit Texas hold'em poker at 200 big blinds, distilling mixed strategies from the GTO Wizard equilibrium solver. The model supports both direct action emission and chain-of-thought reasoning, making it suitable for studying GTO-policy distillation into small language models.
Key Capabilities
- High Accuracy: Achieves 83.96% top-1 action-type accuracy on a 31,105-decision held-out evaluation split (HoldemEval-31k), a significant improvement over other 8B poker fine-tunes and base Llama-3.1-8B-Instruct.
- Specialized Performance: Outperforms frontier reasoning models like DeepSeek-V4-Pro and GPT-4.1/5.4-pro in postflop action prediction, demonstrating the effectiveness of specialized distillation.
- Tool-Use Support: Emits well-formed
preflop_gtotool calls with 99.9% argument correctness, facilitating tool-use and function-call experiments in a poker domain. - Chain-of-Thought: Capable of generating chain-of-thought reasoning traces, aiding in the study of GTO-policy distillation.
Intended Use Cases
- Research Artifact: Primarily intended for reproducing project evaluation results and studying GTO-policy distillation into small language models.
- Tool-Use Experiments: Useful for experiments involving tool-use and function-calling within a poker context.
Limitations
- Greedy Decoding Bias: Greedy decoding targets the modal action of the solver's mixed strategy, which may not be EV-maximizing against non-equilibrium opponents.
- Game Format Scope: Trained exclusively on heads-up no-limit Texas hold'em at 200 BB depth; not evaluated for other stack depths, player counts, or poker variants.
- Not for Gambling: Explicitly not intended for real-money gambling or any context where financial harm could result from mis-prediction.