Model Compatibility
What models run on Featherless?
Compatibility
Featherless aims to provide serverless inference for all AI models. We currently support +2,400 text generation models which are fine-tunes of
Qwen 2 72B
Llama 3 70B
Mistral Nemo 12B
Llama 3 8B
Llama 2 13B
Llama 2 7B
Mistral v2 7B
We also support fine-tunes of the following depth up-scale architectures
Llama 3 15B
Llama2 SOLR (11B)
HuggingFace Repo Requirements
For models to be loaded in featherless, we require
a model card in the hugging face hub
full weights (not LoRA or QLoRA)
weights in safetensors format (not GGUF, not pickled torch vectors)
fp16 precision (though we quant to fp8 as part of model boot)
no variation of tensor shape relative to one of the above base models (e.g. no change in embedding tensor size)
Model Availability
Any public model from hugging face with 100+ downloads will automatically be available for inference on featherless. Users may request public models with fewer downloads either by email or through the #model-suggestions channel discord.
Private models meeting the compatibility requirements outlined here can be run on featherless by scale customers that have connected their hugging face account. Please visit the private models page in the profile section of the web-app.
Context Lengths
All models are served at one of 4k, 8k or 16k context length - i.e. the total of token count of the prompt plus the completion cannot exceed the context length of the model.
What context length a model can be used at depends on it’s architecture and the following table.
Context Length | Model Architectures Serving this Length |
4k |
|
8k |
|
16k |
|
e.g. since Anthracite’s Magnum is a Qwen 2 72B fine-tune, it’s context length is 16k
e.g. since Sao10K’s fimbulvetr is a fine-tune of the Llama2 11B, it’s context length is 4k
We aim to operate the models at maximum useable context, however continue to make tradeoffs to ensure sufficiently low TTFT and a consistent token throughput of > 10 tok/s for all models.
Quantization
Though our model ingestion pipeline requires weights in safetensors format with FP16 precision, all models are served at FP8 precision (they are quanted before loading). This is a tradeoff to balance output quality with inference speeds.