Jianwen/Search-7B-SFT

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Feb 3, 2026License:mitArchitecture:Transformer Open Weights Cold

Jianwen/Search-7B-SFT is a 7.6 billion parameter cold-start checkpoint for search-based reinforcement learning environments, developed by Jianwen. This model specializes in search tasks by distilling successful trajectories into strategic patterns and failed ones into lessons. It features a hierarchical SKILLBANK for organizing knowledge and recursive skill evolution, achieving 10-20% token compression while enhancing reasoning utility for RL agents.

Loading preview...

Overview

Jianwen/Search-7B-SFT is a 7.6 billion parameter model designed as a cold-start checkpoint for reinforcement learning agents operating in search environments. It is specifically fine-tuned for search tasks (SFT stage) and incorporates several innovative features to enhance learning and efficiency.

Key Capabilities

  • Experience-based Skill Distillation: The model processes successful trajectories to extract strategic patterns and analyzes failures to derive concise lessons.
  • Hierarchical SKILLBANK: It organizes learned knowledge into General Skills for broad strategic guidance and Task-Specific Skills for category-level heuristics, providing a structured approach to skill management.
  • Recursive Skill Evolution: A dynamic mechanism allows the skill library to co-evolve with the agent's policy during reinforcement learning, continuously improving by analyzing validation failures.
  • Context Efficiency: Achieves significant token compression (10-20%) compared to raw trajectory storage, which not only saves computational resources but also improves the utility of the agent's reasoning process.

Good For

  • Developing RL agents for search tasks: Provides a strong foundation for agents that need to navigate and solve problems within search environments.
  • Research in Skill-Augmented Reinforcement Learning: Ideal for exploring and implementing advanced techniques like experience-based skill distillation and recursive skill evolution.
  • Optimizing context usage in RL: Useful for scenarios where efficient token management and enhanced reasoning from compressed trajectories are critical.

For further details and to run training scripts, refer to the SkillRL GitHub repository. More information on the underlying research can be found in the model paper.