lsteno/Qwen3-4B-Instruct-2507-RLM-RLVR-FullFT-lr5e-6-depth1-v1

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:May 23, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The lsteno/Qwen3-4B-Instruct-2507-RLM-RLVR-FullFT-lr5e-6-depth1-v1 is a 4 billion parameter instruction-tuned language model based on the Qwen3 architecture. This model is a full-parameter RLM RLVR checkpoint, indicating fine-tuning with Reinforcement Learning from Human Feedback (RLHF) techniques for improved instruction following. It is derived from the Qwen/Qwen3-4B-Instruct-2507 base model and is optimized for specific prompt variants and runtime environments, making it suitable for applications requiring robust instruction adherence.

Loading preview...

Model Overview

The lsteno/Qwen3-4B-Instruct-2507-RLM-RLVR-FullFT-lr5e-6-depth1-v1 is a 4 billion parameter instruction-tuned language model. It is a specialized checkpoint that has undergone full-parameter fine-tuning using Reinforcement Learning from Human Feedback (RLHF) techniques, specifically RLM (Reinforcement Learning from Model) and RLVR (Reinforcement Learning from Very-Rare) methods.

Key Characteristics

  • Base Model: Built upon the Qwen/Qwen3-4B-Instruct-2507 architecture.
  • Fine-tuning: Utilizes a full-parameter RLM RLVR fine-tuning approach, suggesting enhanced instruction following and response quality.
  • Training Details: The model is a checkpoint from step 150 of a training run, indicating a specific stage of its optimization process.
  • Prompt Variant: Optimized for the sanjaya_text_depth1_llm_only_v1 prompt variant, which implies a focus on single-turn, LLM-only interactions.
  • Runtime Environment: Designed for a depth-1 LLM-only RLM harness with plain Gemini subcalls and disabled recursive child RLMs, pointing to a streamlined and controlled inference environment.

Good For

  • Applications requiring a 4B parameter model with strong instruction-following capabilities due to RLHF fine-tuning.
  • Use cases that align with the sanjaya_text_depth1_llm_only_v1 prompt structure.
  • Environments where a depth-1 LLM-only RLM harness is preferred for controlled and efficient inference.