baya1116/Phase15-DeepSeek-FFT
baya1116/Phase15-DeepSeek-FFT is a work-in-progress on-device reasoning model based on TinyLlama-1.1B, designed for iPhone deployment with a ~3GB RAM limit. It utilizes a HyperNetwork-driven soft prompt and a small raw-token window, distilled from DeepSeek-R1 traces. This model is optimized for efficient inference on resource-constrained mobile devices, focusing on coherent prose for advice questions.
Loading preview...
Overview
baya1116/Phase15-DeepSeek-FFT is an experimental, work-in-progress training snapshot of an on-device reasoning model. It's built upon a TinyLlama-1.1B base and incorporates a novel architecture featuring a HyperNetwork-driven soft prompt and a dynamic raw-token window. The primary goal is deployment on iPhones, targeting a ~3GB RAM limit, achieved by distilling knowledge from DeepSeek-R1 traces.
Key Capabilities & Architecture
- On-device optimization: Designed for resource-constrained environments like iPhones.
- Hybrid input: Combines a 128-soft-token prompt generated by a HyperNetwork with a small, curriculum-trained raw-token window (currently at 8 tokens, progressing to 16).
- Recurrent soft-prompt update: The soft prompt (
sp_k) is updated recurrently based on the previous soft prompt and the last raw token. - Distillation: Trained using traces from
cognitivecomputations/dolphin-r1, which is a DeepSeek-R1 derivative. - Curriculum learning: The
raw_windowsize increases (1 -> 2 -> 4 -> 8 -> 16 -> 32) upon reaching performance plateaus. - Auxiliary loss: Applied at the last soft prompt position and each raw token position to enhance training.
Current Status & Limitations
- Work-in-progress: This is a training snapshot (step 484), not a final release.
- Coherent prose: Currently shows promise in generating coherent prose for advice-related questions.
- Arithmetic/Code: Struggles with math and code generation due to the TinyLlama base model's limitations.
- Closure problem: The model sometimes fails to reliably close
<think>tags. - Training: Trained on a single RTX 3090 GPU with a batch size of 24-32.