NhatHoang2002/llama3.1-8b-instruct-step-dpo
NhatHoang2002/llama3.1-8b-instruct-step-dpo is an 8 billion parameter instruction-tuned language model, fine-tuned from Meta's Llama-3.1-8B-Instruct. This model specializes in mathematical reasoning, having been optimized using the xinlai/Math-Step-DPO-10K dataset. It features a 32768 token context length, making it suitable for tasks requiring detailed step-by-step problem-solving.
Loading preview...
Model Overview
This model, llama3.1-8b-instruct-step-dpo, is an 8 billion parameter instruction-tuned language model. It is a fine-tuned version of the robust meta-llama/Llama-3.1-8B-Instruct base model, developed by NhatHoang2002.
Key Capabilities
- Mathematical Reasoning: The model has been specifically fine-tuned using the
xinlai/Math-Step-DPO-10Kdataset, indicating an optimization for tasks that require step-by-step mathematical problem-solving and logical deduction. - Instruction Following: As an instruction-tuned model, it is designed to follow user prompts and instructions effectively, leveraging its Llama 3.1 base.
- Extended Context: With a context length of 32768 tokens, it can process and generate longer sequences of text, beneficial for complex problems or multi-turn conversations.
Training Details
The model was trained with a learning rate of 5e-07, using a total batch size of 64 across 4 GPUs. The training involved 4 epochs with an Adam optimizer and a cosine learning rate scheduler. This DPO (Direct Preference Optimization) fine-tuning approach on a specialized mathematical dataset aims to enhance its performance in structured reasoning tasks.
Good For
- Applications requiring detailed mathematical problem-solving.
- Educational tools for explaining mathematical concepts step-by-step.
- Tasks where logical reasoning and instruction adherence are critical.