Name: TheBloke/WizardLM-13B-V1-1-SuperHOT-8K-fp16 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: TheBloke

Model Overview

This model, TheBloke/WizardLM-13B-V1-1-SuperHOT-8K-fp16, is a 13 billion parameter language model derived from a merge of WizardLM's WizardLM 13B V1.1 and Kaio Ken's SuperHOT 8K LoRA. It is provided in fp16 PyTorch format, optimized for GPU inference.

Key Capabilities & Features

Extended Context Window: Achieves an 8K (8192 token) context length, significantly larger than the base WizardLM's 4096 tokens, enabled by the SuperHOT 8K merge and specific configuration settings (trust_remote_code=True).
WizardLM Base: Built upon WizardLM-13B V1.1, which has demonstrated strong performance on benchmarks like MT-Bench (6.74), AlpacaEval (86.32%), and WizardLM Eval (99.3%).
SuperHOT Merge: Incorporates Kaio Ken's SuperHOT LoRA, which is described as a NSFW-focused prototype, suggesting potential specialization in generating content for such applications.
Flexible Deployment: Available in various formats including 4-bit GPTQ for GPU, 2-8 bit GGML for CPU, and this unquantized fp16 PyTorch version for high-fidelity inference and further model conversions.

Usage Considerations

Context Handling: Requires trust_remote_code=True in Hugging Face Transformers to properly utilize the 8K context window, which automatically sets the scale parameter for position embeddings.
Monkey Patch: A llama_rope_scaled_monkey_patch.py is provided for environments that do not natively support the required scaling, allowing for manual application of the scaling factor.
Oobabooga Compatibility: Specific arguments (--max_seq_len 8192 --compress_pos_emb 4) are recommended for optimal performance with Oobabooga's Exllama loader.

Overview

Model Overview

Key Capabilities & Features

Usage Considerations

Full Model Card (README)