NhatHoang2002/llama3.1-8b-instruct-step-dpo

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Dec 14, 2025License:llama3.1Architecture:Transformer Cold

NhatHoang2002/llama3.1-8b-instruct-step-dpo is an 8 billion parameter instruction-tuned language model, fine-tuned from Meta's Llama-3.1-8B-Instruct. This model specializes in mathematical reasoning, having been optimized using the xinlai/Math-Step-DPO-10K dataset. It features a 32768 token context length, making it suitable for tasks requiring detailed step-by-step problem-solving.

Loading preview...

Model Overview

This model, llama3.1-8b-instruct-step-dpo, is an 8 billion parameter instruction-tuned language model. It is a fine-tuned version of the robust meta-llama/Llama-3.1-8B-Instruct base model, developed by NhatHoang2002.

Key Capabilities

  • Mathematical Reasoning: The model has been specifically fine-tuned using the xinlai/Math-Step-DPO-10K dataset, indicating an optimization for tasks that require step-by-step mathematical problem-solving and logical deduction.
  • Instruction Following: As an instruction-tuned model, it is designed to follow user prompts and instructions effectively, leveraging its Llama 3.1 base.
  • Extended Context: With a context length of 32768 tokens, it can process and generate longer sequences of text, beneficial for complex problems or multi-turn conversations.

Training Details

The model was trained with a learning rate of 5e-07, using a total batch size of 64 across 4 GPUs. The training involved 4 epochs with an Adam optimizer and a cosine learning rate scheduler. This DPO (Direct Preference Optimization) fine-tuning approach on a specialized mathematical dataset aims to enhance its performance in structured reasoning tasks.

Good For

  • Applications requiring detailed mathematical problem-solving.
  • Educational tools for explaining mathematical concepts step-by-step.
  • Tasks where logical reasoning and instruction adherence are critical.