ftajwar/qwen3_1.7B_Base_GRPO_Polaris_1000_steps

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Feb 26, 2026License:mitArchitecture:Transformer Open Weights Cold

ftajwar/qwen3_1.7B_Base_GRPO_Polaris_1000_steps is a 2 billion parameter Qwen3-1.7B-Base model fine-tuned by Fahim Tajwar and collaborators using the GRPO baseline, a method derived from the Maximum Likelihood Reinforcement Learning (MaxRL) framework. This model is specifically trained on the POLARIS-53K dataset, focusing on optimizing maximum likelihood in reinforcement learning settings. It is designed for research and development in RL-based language model fine-tuning, demonstrating the application of MaxRL principles.

Loading preview...

Model Overview

This model, ftajwar/qwen3_1.7B_Base_GRPO_Polaris_1000_steps, is a 2 billion parameter variant of the Qwen3-1.7B-Base architecture. It has been fine-tuned by Fahim Tajwar and his team using the GRPO (Generalized Reinforcement Policy Optimization) baseline, which is part of their novel "Maximum Likelihood Reinforcement Learning" (MaxRL) framework. The fine-tuning process involved 1000 steps on the POLARIS-53K dataset, utilizing 32 NVIDIA H200 GPUs.

Key Characteristics

  • Base Model: Fine-tuned from Qwen3/Qwen3-1.7B-Base.
  • Fine-tuning Method: Employs GRPO, a baseline approach within the MaxRL framework, which focuses on optimizing maximum likelihood in reinforcement learning contexts.
  • Training Data: Trained on the POLARIS-53K dataset.
  • Research Focus: Developed as part of the research presented in the paper "Maximum Likelihood Reinforcement Learning" (arXiv:2602.02710).

Intended Use Cases

This model is primarily intended for:

  • Research and Development: Exploring the application and effectiveness of the MaxRL framework and GRPO fine-tuning methods.
  • Reproducibility: Serving as a checkpoint for researchers interested in reproducing or extending the work presented in the associated paper.
  • Comparative Studies: Benchmarking against other fine-tuning techniques for language models in RL settings.