ftajwar/qwen3_1.7B_Base_GRPO_Polaris_1000_steps
ftajwar/qwen3_1.7B_Base_GRPO_Polaris_1000_steps is a 2 billion parameter Qwen3-1.7B-Base model fine-tuned by Fahim Tajwar and collaborators using the GRPO baseline, a method derived from the Maximum Likelihood Reinforcement Learning (MaxRL) framework. This model is specifically trained on the POLARIS-53K dataset, focusing on optimizing maximum likelihood in reinforcement learning settings. It is designed for research and development in RL-based language model fine-tuning, demonstrating the application of MaxRL principles.
Loading preview...
Model Overview
This model, ftajwar/qwen3_1.7B_Base_GRPO_Polaris_1000_steps, is a 2 billion parameter variant of the Qwen3-1.7B-Base architecture. It has been fine-tuned by Fahim Tajwar and his team using the GRPO (Generalized Reinforcement Policy Optimization) baseline, which is part of their novel "Maximum Likelihood Reinforcement Learning" (MaxRL) framework. The fine-tuning process involved 1000 steps on the POLARIS-53K dataset, utilizing 32 NVIDIA H200 GPUs.
Key Characteristics
- Base Model: Fine-tuned from Qwen3/Qwen3-1.7B-Base.
- Fine-tuning Method: Employs GRPO, a baseline approach within the MaxRL framework, which focuses on optimizing maximum likelihood in reinforcement learning contexts.
- Training Data: Trained on the POLARIS-53K dataset.
- Research Focus: Developed as part of the research presented in the paper "Maximum Likelihood Reinforcement Learning" (arXiv:2602.02710).
Intended Use Cases
This model is primarily intended for:
- Research and Development: Exploring the application and effectiveness of the MaxRL framework and GRPO fine-tuning methods.
- Reproducibility: Serving as a checkpoint for researchers interested in reproducing or extending the work presented in the associated paper.
- Comparative Studies: Benchmarking against other fine-tuning techniques for language models in RL settings.