zai-org/LongReward-llama3.1-8b-DPO

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Oct 22, 2024Architecture:Transformer0.0K Cold

LongReward-llama3.1-8b-DPO is an 8 billion parameter DPO-tuned causal language model developed by THUDM, based on the Llama 3.1 architecture. It is specifically optimized for long-context understanding and generation, supporting a maximum context window of up to 64K tokens. This model is fine-tuned using the LongReward-10k preference dataset, making it suitable for tasks requiring extensive contextual comprehension.

Loading preview...

Overview

LongReward-llama3.1-8b-DPO is an 8 billion parameter language model from THUDM, fine-tuned using Direct Preference Optimization (DPO) on the dpo_llama3.1_8b split of the LongReward-10k dataset. This model is built upon the Llama 3.1 architecture and is designed to excel in tasks requiring a deep understanding of long contexts.

Key Capabilities

  • Extended Context Window: Supports a maximum context window of up to 64K tokens, significantly enhancing its ability to process and generate long-form content.
  • DPO Fine-tuning: Leverages DPO training on a specialized long-context preference dataset, which improves its performance in generating coherent and relevant responses over extended inputs.
  • Llama 3.1 Base: Benefits from the robust capabilities of the Llama 3.1 foundational model.

Good For

  • Long-form Question Answering: Answering complex queries that require synthesizing information from very long documents or conversations.
  • Document Summarization: Generating concise summaries from extensive texts.
  • Contextual Chatbots: Developing conversational agents that maintain context over prolonged interactions.
  • Information Extraction: Extracting specific details from large bodies of text where relevant information might be spread out.

For more technical details, refer to the LongReward Paper and the GitHub Repository.