M134pra/neon-syndicate-qwen25-sft

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Apr 25, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

M134pra/neon-syndicate-qwen25-sft is a 0.5 billion parameter Qwen2.5-Instruct model, supervised fine-tuned for generating JSON actions within the Neon Syndicate OpenEnv environment. This model specializes in producing heuristic-policy trajectories for multi-agent, long-horizon tasks. It is designed to demonstrate an end-to-end training pipeline for environment interaction, focusing on next-token prediction over action JSON suffixes.

Loading preview...

Model Overview

M134pra/neon-syndicate-qwen25-sft is a supervised fine-tuned (SFT) version of the Qwen/Qwen2.5-0.5B-Instruct base model. It has been specifically trained to generate JSON-formatted actions based on heuristic-policy trajectories from the Neon Syndicate OpenEnv environment. This model, with 0.5 billion parameters and a 32768 token context length, is designed for tasks involving multi-agent interaction and long-horizon planning.

Key Capabilities

  • Action Generation: Specializes in predicting and generating structured JSON actions for environment interaction.
  • Environment Interaction: Fine-tuned on prompt-action pairs from the Neon Syndicate OpenEnv, enabling it to follow heuristic policies.
  • Training Pipeline Demonstration: Serves as a CPU-friendly "smoke run" to showcase the complete training process for such models.

Training Details

The model was trained using 46 (prompt, action_json) pairs collected from rolling out a heuristic policy across 6 environment tasks. It utilized a causal LM loss for next-token prediction over the action JSON suffix, with AdamW optimizer for 1 epoch. While this checkpoint is primarily for demonstrating the training pipeline, a more performant PPO recipe is available for competitive results on GPU.

Limitations

This specific checkpoint is a preliminary version, trained on a limited dataset (46 examples, 1 epoch) and intended for demonstrating the training pipeline. Consequently, it under-performs the heuristic baseline in average task score. For production use or competitive performance, retraining with the provided PPO recipe on a GPU is recommended.