baya1116/Phase15-DeepSeek-FFT

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.1BQuant:BF16Ctx Length:2kPublished:Feb 19, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

baya1116/Phase15-DeepSeek-FFT is a work-in-progress on-device reasoning model based on TinyLlama-1.1B, designed for iPhone deployment with a ~3GB RAM limit. It utilizes a HyperNetwork-driven soft prompt and a small raw-token window, distilled from DeepSeek-R1 traces. This model is optimized for efficient inference on resource-constrained mobile devices, focusing on coherent prose for advice questions.

Loading preview...

Overview

baya1116/Phase15-DeepSeek-FFT is an experimental, work-in-progress training snapshot of an on-device reasoning model. It's built upon a TinyLlama-1.1B base and incorporates a novel architecture featuring a HyperNetwork-driven soft prompt and a dynamic raw-token window. The primary goal is deployment on iPhones, targeting a ~3GB RAM limit, achieved by distilling knowledge from DeepSeek-R1 traces.

Key Capabilities & Architecture

  • On-device optimization: Designed for resource-constrained environments like iPhones.
  • Hybrid input: Combines a 128-soft-token prompt generated by a HyperNetwork with a small, curriculum-trained raw-token window (currently at 8 tokens, progressing to 16).
  • Recurrent soft-prompt update: The soft prompt (sp_k) is updated recurrently based on the previous soft prompt and the last raw token.
  • Distillation: Trained using traces from cognitivecomputations/dolphin-r1, which is a DeepSeek-R1 derivative.
  • Curriculum learning: The raw_window size increases (1 -> 2 -> 4 -> 8 -> 16 -> 32) upon reaching performance plateaus.
  • Auxiliary loss: Applied at the last soft prompt position and each raw token position to enhance training.

Current Status & Limitations

  • Work-in-progress: This is a training snapshot (step 484), not a final release.
  • Coherent prose: Currently shows promise in generating coherent prose for advice-related questions.
  • Arithmetic/Code: Struggles with math and code generation due to the TinyLlama base model's limitations.
  • Closure problem: The model sometimes fails to reliably close <think> tags.
  • Training: Trained on a single RTX 3090 GPU with a batch size of 24-32.