Name: ftajwar/qwen3_1.7B_Base_GRPO_Polaris_1000_steps API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ftajwar

Model Overview

This model, ftajwar/qwen3_1.7B_Base_GRPO_Polaris_1000_steps, is a 2 billion parameter variant of the Qwen3-1.7B-Base architecture. It has been fine-tuned by Fahim Tajwar and his team using the GRPO (Generalized Reinforcement Policy Optimization) baseline, which is part of their novel "Maximum Likelihood Reinforcement Learning" (MaxRL) framework. The fine-tuning process involved 1000 steps on the POLARIS-53K dataset, utilizing 32 NVIDIA H200 GPUs.

Key Characteristics

Base Model: Fine-tuned from Qwen3/Qwen3-1.7B-Base.
Fine-tuning Method: Employs GRPO, a baseline approach within the MaxRL framework, which focuses on optimizing maximum likelihood in reinforcement learning contexts.
Training Data: Trained on the POLARIS-53K dataset.
Research Focus: Developed as part of the research presented in the paper "Maximum Likelihood Reinforcement Learning" (arXiv:2602.02710).

Intended Use Cases

This model is primarily intended for:

Research and Development: Exploring the application and effectiveness of the MaxRL framework and GRPO fine-tuning methods.
Reproducibility: Serving as a checkpoint for researchers interested in reproducing or extending the work presented in the associated paper.
Comparative Studies: Benchmarking against other fine-tuning techniques for language models in RL settings.

Overview

Model Overview

Key Characteristics

Intended Use Cases

Full Model Card (README)