endishai/qwen2.5-32b-lexenvs-grpo
The endishai/qwen2.5-32b-lexenvs-grpo model is a 32.8 billion parameter variant of Qwen2.5-32B-Instruct, specialized for credit card optimization reasoning. Developed by endishai, this model utilizes GRPO training to achieve superior performance in financial portfolio selection tasks. It demonstrates an average reward of ~0.51 on a held-out test set, outperforming larger models like Claude Opus 4.6 and GPT-4o in its specific domain. This model is designed for complex financial decision-making support, particularly in credit card strategy.
Loading preview...
Overview
This model, endishai/qwen2.5-32b-lexenvs-grpo, is a specialized 32.8 billion parameter language model based on the Qwen/Qwen2.5-32B-Instruct architecture. It has been fine-tuned using the GRPO (Generalized Reinforcement Learning with Policy Optimization) method to excel specifically in credit card optimization reasoning and financial portfolio selection.
Key Capabilities & Performance
- Specialized Reasoning: Optimized for complex credit card optimization scenarios.
- Superior Performance: Achieves an average reward of ~0.51 on a held-out test set of 30 tasks, significantly outperforming:
- Claude Opus 4.6 (~0.41)
- Claude Sonnet 4.6 (0.396)
- GPT-4o (0.363)
- The base Qwen 32B model (~0.24)
- Training Details: Trained with GRPO via TRL, utilizing a LoRA adapter (rank 32) on 2x A100-80GB GPUs, using the endishai/lexenvs-tasks dataset.
Intended Use Cases
- Credit Card Optimization: Ideal for tasks requiring reasoning about credit card rewards, benefits, and spending strategies.
- Financial Portfolio Selection: Suitable for applications involving the selection and optimization of financial instruments related to credit.
Important Considerations
- This model is not intended for live consumer financial advice but rather for analytical and reasoning support in financial contexts.
- A LoRA adapter-only version is also available at endishai/qwen2.5-32b-lexenvs-grpo-lora.