Overview

This model, developed by Shaik Abdul Fahad, is a fine-tuned Qwen2-0.5B-Instruct variant specifically engineered to play the popular word game Wordle. Unlike traditional supervised learning, it leverages Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm, to learn optimal strategies directly from reward signals over 20 training games.

Key Capabilities

Strategic Wordle Play: Learns and applies effective Wordle strategies, such as starting with vowel-rich words (e.g., CRANE, SLATE), using green letter positions, repositioning yellow letters, and avoiding repeated guesses.
Reinforcement Learning: Trained purely on reward signals, demonstrating an ability to learn complex game strategies without human-provided examples.
Compact Size: Built on a 0.5 billion parameter base model, making it efficient for its specialized task.

Training Details

The model was trained using a reward system that incentivizes winning the game (+1.0), identifying green letters (+0.3), yellow letters (+0.1), making new guesses (+0.3), and using valid 5-letter words (+0.2). The training pipeline involved connecting to a live Wordle environment (TextArena) via OpenEnv, generating guesses, receiving feedback, calculating rewards, and updating the model using GRPO.

Limitations

Limited training (20 games) and model size (0.5B parameters) restrict its current performance.
Occasionally repeats guesses despite built-in penalties.

Good for

Research and experimentation in applying reinforcement learning to language models for game-playing.
Understanding how LLMs can learn complex strategic tasks from reward signals.
Demonstrating specialized AI agents for specific game environments.

Overview

Overview

Key Capabilities

Training Details

Limitations

Good for

Full Model Card (README)