TheBloke/Vicuna-7B-CoT-SuperHOT-8K-fp16

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kLicense:otherArchitecture:Transformer0.0K Cold

TheBloke/Vicuna-7B-CoT-SuperHOT-8K-fp16 is a 7 billion parameter language model created by merging Kevin Pro's Vicuna 7B CoT with Kaio Ken's SuperHOT 8K. This fp16 PyTorch model is designed for GPU inference and features an extended context length of 8192 tokens. It is optimized for conversational tasks with Chain-of-Thought capabilities and enhanced long-context understanding.

Loading preview...

Model Overview

This model, TheBloke/Vicuna-7B-CoT-SuperHOT-8K-fp16, is a 7 billion parameter language model resulting from the merge of Kevin Pro's Vicuna 7B CoT and Kaio Ken's SuperHOT 8K. It is provided in fp16 PyTorch format, suitable for GPU inference.

Key Capabilities

  • Extended Context Window: Achieves an 8K (8192 token) context length, significantly longer than typical 4K models, by integrating Kaio Ken's SuperHOT 8K LoRA and utilizing trust_remote_code=True during inference.
  • Chain-of-Thought (CoT) Enhancement: Incorporates Kevin Pro's Vicuna 7B CoT, which is specifically fine-tuned to improve Chain-of-Thought reasoning capabilities.
  • Flexible Configuration: The config.json is set to 8192 sequence length by default, but can be adjusted to 4096 if a smaller sequence length is desired.
  • Scalability: The provided modeling code automatically sets the scale parameter based on max_position_embeddings, e.g., scale=4 for 8192 tokens.

Good For

  • Long-form text generation: Ideal for applications requiring extensive context understanding and generation, such as detailed conversations, document analysis, or creative writing with complex narratives.
  • Reasoning tasks: Benefits from the Chain-of-Thought fine-tuning, making it suitable for tasks that require multi-step reasoning.
  • GPU-based inference: Optimized for performance on GPUs due to its fp16 PyTorch format.

Usage Notes

To leverage the 8K context, users must ensure trust_remote_code=True is enabled during model loading. For exllama or exllama_hf loaders, arguments like --max_seq_len 8192 --compress_pos_emb 4 are recommended.