M134pra/neon-syndicate-qwen25-sft
M134pra/neon-syndicate-qwen25-sft is a 0.5 billion parameter Qwen2.5-Instruct model, supervised fine-tuned for generating JSON actions within the Neon Syndicate OpenEnv environment. This model specializes in producing heuristic-policy trajectories for multi-agent, long-horizon tasks. It is designed to demonstrate an end-to-end training pipeline for environment interaction, focusing on next-token prediction over action JSON suffixes.
Loading preview...
Model Overview
M134pra/neon-syndicate-qwen25-sft is a supervised fine-tuned (SFT) version of the Qwen/Qwen2.5-0.5B-Instruct base model. It has been specifically trained to generate JSON-formatted actions based on heuristic-policy trajectories from the Neon Syndicate OpenEnv environment. This model, with 0.5 billion parameters and a 32768 token context length, is designed for tasks involving multi-agent interaction and long-horizon planning.
Key Capabilities
- Action Generation: Specializes in predicting and generating structured JSON actions for environment interaction.
- Environment Interaction: Fine-tuned on prompt-action pairs from the Neon Syndicate OpenEnv, enabling it to follow heuristic policies.
- Training Pipeline Demonstration: Serves as a CPU-friendly "smoke run" to showcase the complete training process for such models.
Training Details
The model was trained using 46 (prompt, action_json) pairs collected from rolling out a heuristic policy across 6 environment tasks. It utilized a causal LM loss for next-token prediction over the action JSON suffix, with AdamW optimizer for 1 epoch. While this checkpoint is primarily for demonstrating the training pipeline, a more performant PPO recipe is available for competitive results on GPU.
Limitations
This specific checkpoint is a preliminary version, trained on a limited dataset (46 examples, 1 epoch) and intended for demonstrating the training pipeline. Consequently, it under-performs the heuristic baseline in average task score. For production use or competitive performance, retraining with the provided PPO recipe on a GPU is recommended.