deepseek-ai/DeepSeek-R1-Distill-Llama-70B

Warm
Public
70B
FP8
32768
License: mit
Hugging Face
Overview

DeepSeek-R1-Distill-Llama-70B: Reasoning-Enhanced Language Model

DeepSeek-R1-Distill-Llama-70B is a 70 billion parameter model from DeepSeek-AI, part of their DeepSeek-R1 series focused on advanced reasoning. This model is a distillation of the larger DeepSeek-R1, which itself was developed using large-scale reinforcement learning (RL) directly on a base model, without initial supervised fine-tuning (SFT), to foster complex reasoning behaviors like self-verification and reflection.

Key Capabilities & Features

  • Reasoning Distillation: Leverages reasoning patterns from the powerful DeepSeek-R1 model, enabling smaller models to achieve superior performance in reasoning tasks compared to direct RL on smaller architectures.
  • Strong Performance: Achieves competitive results across various benchmarks, including:
    • AIME 2024 (Pass@1): 70.0
    • MATH-500 (Pass@1): 94.5
    • GPQA Diamond (Pass@1): 65.2
    • LiveCodeBench (Pass@1): 57.5
  • Llama-Based Architecture: Built upon the Llama-3.3-70B-Instruct model, ensuring a familiar and robust foundation.
  • Extended Context Length: Supports a context window of 32,768 tokens, beneficial for handling longer and more complex inputs.

Usage Recommendations

  • Optimal Settings: For best performance, use a temperature between 0.5-0.7 (0.6 recommended) and avoid system prompts, placing all instructions within the user prompt.
  • Reasoning Prompts: For mathematical problems, include directives like "Please reason step by step, and put your final answer within \boxed{}".
  • Enforced Reasoning: To ensure thorough reasoning, it's recommended to enforce the model to start its response with "\n" to prevent it from bypassing its thinking process.

Good For

  • Applications requiring advanced mathematical problem-solving.
  • Complex code generation and analysis tasks.
  • Scenarios demanding robust logical reasoning and chain-of-thought capabilities.
  • Research and development in distilling large model capabilities into more manageable sizes.