PursuitOfDataScience/llama3.2-1b-thinking
PursuitOfDataScience/llama3.2-1b-thinking is a 1 billion parameter Llama 3.2-based model fine-tuned through a three-stage process including SFT, reasoning training, and DPO. It is optimized for instruction-following chat, multi-turn conversations, and enhanced step-by-step reasoning using Chain of Thought (CoT) with tags. This model aims to provide helpful and concise responses, particularly for tasks requiring logical thought processes.
Loading preview...
Overview
PursuitOfDataScience/llama3.2-1b-thinking is a 1 billion parameter language model built upon the meta-llama/Llama-3.2-1B base. It has undergone a comprehensive three-stage fine-tuning process to enhance its conversational and reasoning capabilities.
Key Capabilities
- Instruction Following: Supervised fine-tuning (SFT) on
HuggingFaceH4/ultrachat_200kenables the model to generate helpful and concise responses in an instruction-style, multi-turn chat format. - Enhanced Reasoning: Specialized training using the
open-r1/Mixture-of-Thoughtsdataset significantly improves its step-by-step reasoning and Chain of Thought (CoT) capabilities, allowing it to process complex problems with explicit thought processes indicated by<think>tags. - Preference Alignment: Direct Preference Optimization (DPO) with
mlabonne/orpo-dpo-mix-40krefines response quality, aligning outputs with human preferences for safety, helpfulness, and adherence to user constraints. - Chat-style Interaction: Designed for chat applications, it processes prompts as lists of messages using
tokenizer.apply_chat_template.
Training Details
The model's development involved:
- SFT: Fine-tuning on multi-turn dialogues from
HuggingFaceH4/ultrachat_200k. - Reasoning Training: Focused on
open-r1/Mixture-of-Thoughtsfor CoT enhancement. - DPO Alignment: Optimized with
mlabonne/orpo-dpo-mix-40kto improve response quality and alignment.
Limitations
As a relatively small 1B parameter model, it may exhibit limitations such as hallucination or difficulty with highly complex, multi-step reasoning tasks. Users should verify critical information, as outputs may occasionally be inaccurate, unsafe, or biased.