lsteno/Qwen3-4B-Instruct-2507-RLM-RLVR-FullFT-lr5e-6-depth1-v1
The lsteno/Qwen3-4B-Instruct-2507-RLM-RLVR-FullFT-lr5e-6-depth1-v1 is a 4 billion parameter instruction-tuned language model based on the Qwen3 architecture. This model is a full-parameter RLM RLVR checkpoint, indicating fine-tuning with Reinforcement Learning from Human Feedback (RLHF) techniques for improved instruction following. It is derived from the Qwen/Qwen3-4B-Instruct-2507 base model and is optimized for specific prompt variants and runtime environments, making it suitable for applications requiring robust instruction adherence.
Loading preview...
Model Overview
The lsteno/Qwen3-4B-Instruct-2507-RLM-RLVR-FullFT-lr5e-6-depth1-v1 is a 4 billion parameter instruction-tuned language model. It is a specialized checkpoint that has undergone full-parameter fine-tuning using Reinforcement Learning from Human Feedback (RLHF) techniques, specifically RLM (Reinforcement Learning from Model) and RLVR (Reinforcement Learning from Very-Rare) methods.
Key Characteristics
- Base Model: Built upon the
Qwen/Qwen3-4B-Instruct-2507architecture. - Fine-tuning: Utilizes a full-parameter RLM RLVR fine-tuning approach, suggesting enhanced instruction following and response quality.
- Training Details: The model is a checkpoint from step 150 of a training run, indicating a specific stage of its optimization process.
- Prompt Variant: Optimized for the
sanjaya_text_depth1_llm_only_v1prompt variant, which implies a focus on single-turn, LLM-only interactions. - Runtime Environment: Designed for a depth-1 LLM-only RLM harness with plain Gemini subcalls and disabled recursive child RLMs, pointing to a streamlined and controlled inference environment.
Good For
- Applications requiring a 4B parameter model with strong instruction-following capabilities due to RLHF fine-tuning.
- Use cases that align with the
sanjaya_text_depth1_llm_only_v1prompt structure. - Environments where a depth-1 LLM-only RLM harness is preferred for controlled and efficient inference.