geodesic-research/nemotron_30b_warm_start_sft_200k_think
The geodesic-research/nemotron_30b_warm_start_sft_200k_think model is a 30 billion parameter language model developed by Geodesic Research, fine-tuned from NVIDIA's Nemotron-3-Nano-30B-A3B-BF16. It is specifically designed for reasoning tasks, utilizing a 'think' tokenizer that preserves ... reasoning tags. This model serves as a warm-start baseline for further research into self-fulfilling models and inoculation campaigns, excelling in chat-format conversations with a 32768 token context length.
Loading preview...
Model Overview
This model, geodesic-research/nemotron_30b_warm_start_sft_200k_think, is a 30 billion parameter language model developed by Geodesic Research. It is a supervised fine-tuned (SFT) version of the nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 base model, specifically optimized for reasoning-oriented tasks. It utilizes a unique 'think' tokenizer that maintains <think>...</think> reasoning tags when present in the input, distinguishing it from standard instruction-tuned models.
Key Characteristics
- Warm-Start Baseline: Serves as a foundational checkpoint for Geodesic Research's SFM and inoculation campaigns, enabling comparable studies across different model sizes and SFT types.
- Reasoning-Focused Tokenizer: Employs
geodesic-research/nemotron-think-tokenizerwhich preserves explicit reasoning tags, facilitating models that can articulate their thought processes. - Training Data: Fine-tuned on the
geodesic-research/sft-warm-start-200kdataset, comprising 200,000 chat-format conversations (509M tokens) over a single epoch. - Context Length: Supports a substantial context window of 32768 tokens, allowing for processing longer inputs and generating more extensive responses.
Use Cases and Limitations
This model is ideal for research into reasoning capabilities and for developing models that require explicit thought processes. It is a strong candidate for applications where understanding and generating structured reasoning is crucial. However, it's important to note that it was trained on a single epoch of 200k examples, offering narrower coverage compared to the upstream NVIDIA instruct release. The Multi-Token-Prediction (MTP) head weights are randomly initialized due to the SFT process, though this does not impact standard inference.