Overview
This model, developed by Shaik Abdul Fahad, is a fine-tuned Qwen2-0.5B-Instruct variant specifically engineered to play the popular word game Wordle. Unlike traditional supervised learning, it leverages Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm, to learn optimal strategies directly from reward signals over 20 training games.
Key Capabilities
- Strategic Wordle Play: Learns and applies effective Wordle strategies, such as starting with vowel-rich words (e.g., CRANE, SLATE), using green letter positions, repositioning yellow letters, and avoiding repeated guesses.
- Reinforcement Learning: Trained purely on reward signals, demonstrating an ability to learn complex game strategies without human-provided examples.
- Compact Size: Built on a 0.5 billion parameter base model, making it efficient for its specialized task.
Training Details
The model was trained using a reward system that incentivizes winning the game (+1.0), identifying green letters (+0.3), yellow letters (+0.1), making new guesses (+0.3), and using valid 5-letter words (+0.2). The training pipeline involved connecting to a live Wordle environment (TextArena) via OpenEnv, generating guesses, receiving feedback, calculating rewards, and updating the model using GRPO.
Limitations
- Limited training (20 games) and model size (0.5B parameters) restrict its current performance.
- Occasionally repeats guesses despite built-in penalties.
Good for
- Research and experimentation in applying reinforcement learning to language models for game-playing.
- Understanding how LLMs can learn complex strategic tasks from reward signals.
- Demonstrating specialized AI agents for specific game environments.