TheBloke/Vicuna-7B-CoT-SuperHOT-8K-fp16
TheBloke/Vicuna-7B-CoT-SuperHOT-8K-fp16 is a 7 billion parameter language model created by merging Kevin Pro's Vicuna 7B CoT with Kaio Ken's SuperHOT 8K. This fp16 PyTorch model is designed for GPU inference and features an extended context length of 8192 tokens. It is optimized for conversational tasks with Chain-of-Thought capabilities and enhanced long-context understanding.
Loading preview...
Model Overview
This model, TheBloke/Vicuna-7B-CoT-SuperHOT-8K-fp16, is a 7 billion parameter language model resulting from the merge of Kevin Pro's Vicuna 7B CoT and Kaio Ken's SuperHOT 8K. It is provided in fp16 PyTorch format, suitable for GPU inference.
Key Capabilities
- Extended Context Window: Achieves an 8K (8192 token) context length, significantly longer than typical 4K models, by integrating Kaio Ken's SuperHOT 8K LoRA and utilizing
trust_remote_code=Trueduring inference. - Chain-of-Thought (CoT) Enhancement: Incorporates Kevin Pro's Vicuna 7B CoT, which is specifically fine-tuned to improve Chain-of-Thought reasoning capabilities.
- Flexible Configuration: The
config.jsonis set to 8192 sequence length by default, but can be adjusted to 4096 if a smaller sequence length is desired. - Scalability: The provided modeling code automatically sets the
scaleparameter based onmax_position_embeddings, e.g.,scale=4for 8192 tokens.
Good For
- Long-form text generation: Ideal for applications requiring extensive context understanding and generation, such as detailed conversations, document analysis, or creative writing with complex narratives.
- Reasoning tasks: Benefits from the Chain-of-Thought fine-tuning, making it suitable for tasks that require multi-step reasoning.
- GPU-based inference: Optimized for performance on GPUs due to its fp16 PyTorch format.
Usage Notes
To leverage the 8K context, users must ensure trust_remote_code=True is enabled during model loading. For exllama or exllama_hf loaders, arguments like --max_seq_len 8192 --compress_pos_emb 4 are recommended.