TheBloke/Baize-v2-7B-SuperHOT-8K-fp16

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kLicense:otherArchitecture:Transformer0.0K Cold

TheBloke/Baize-v2-7B-SuperHOT-8K-fp16 is a 7 billion parameter LLaMA-based causal language model, created by TheBloke by merging Project Baize's Baize 7B v2 with Kaio Ken's SuperHOT 8K. This model is specifically designed for extended context applications, supporting an 8K context length through a merged LoRA and custom scaling. It is optimized for GPU inference and serves as a base for further conversions, excelling in detailed conversational tasks with an emphasis on longer interactions.

Loading preview...

Model Overview

This model, TheBloke/Baize-v2-7B-SuperHOT-8K-fp16, is a 7 billion parameter LLaMA-based language model. It is a merge of two distinct projects:

  • Project Baize's Baize 7B v2: An open-source chat model fine-tuned with LoRA, utilizing supervised fine-tuning (SFT) and self-distillation with feedback (SDF). Baize models are designed for detailed conversational AI, requiring a specific prompt format ([|Human|] and [|AI|]).
  • Kaio Ken's SuperHOT 8K: A prototype LoRA focused on extending context, specifically to 8K tokens, using a technique described in a GitHub blog. This version is noted for having no RLHF.

Key Capabilities & Features

  • Extended Context Window: Supports an 8K context length, enabled by the SuperHOT 8K merge and custom scaling during inference.
  • Conversational AI: Inherits the conversational fine-tuning from Project Baize, making it suitable for detailed chat interactions.
  • LLaMA Base: Built upon the LLaMA architecture, providing a robust foundation.
  • FP16 Format: Provided in fp16 pytorch format, ideal for GPU inference and as a base for further quantization or conversions.

Usage Considerations

  • Prompt Format: When using the Baize component, adhere to the [|Human|] and [|AI|] prompt format for optimal performance.
  • Context Scaling: Requires trust_remote_code=True or a monkey patch to properly utilize the 8K context length, with config.max_position_embeddings set to 8192.
  • No RLHF: The SuperHOT component was trained without RLHF, which may influence its response characteristics.