Models & Model Compatibility

What models run on Featherless?

Model Compatibility

Featherless aims to provide serverless inference for all AI models. We currently support over 4,000 text generation models which are fine-tunes of the following base architectures:

Family	Model Class	Context Length	Concurrency Cost
Deepseek 3	deepseek-v3-lc	32768	4
	kimi-k2	32768	4
Gemma	gemma-2b	8192	1
	gemma-7b	8192	1
Gemma 2	gemma2-2b	8192	1
	gemma2-27b	32768	2
Gemma 3	gemma3t-1b	32768	1
	gemma3-4b	32768	1
	gemma3-12b	32768	1
	gemma3t-12b	32768	1
	gemma3-27b	32768	2
	gemma3t-27b	32768	2
GLM 4	glm4-9b	32768	1
	glm4-32b	32768	2
GLM 4.6	glm46-357b	32768	4
GLM 4.7	glm47-357b	32768	4
GPT OSS	gpt-oss-20b	32768	2
	gpt-oss-120b	16384	2
GPT-SW3	gpt-sw3-356m	2048	1
	gpt-sw3-1b3	2048	1
	gpt-sw3-20b	2048	1
GPT2-SW3	gpt2-sw3-126m	2048	1
	gpt2-sw3-6b7	2048	1
Ling 2	ling2-1t	32768	4
Llama 2	tinyllama-1b1	2048	1
	llama2-7b	4096	1
	llama2-solar-10b7	4096	1
	llama2-13b	4096	1
Llama 3	llama3-8b	8192	1
	llama3-15b	8192	1
	llama3-70b	8192	4
Llama 3.1	llama31-8b	32768	1
	llama31-70b	32768	4
Llama 3.2	llama32-1b	32768	1
	llama32-3b	32768	1
Llama 3.3	llama33-70b	32768	4
Mellum	mellum-4b	32768	1
Minimax	minimax-m2	32768	4
Mistral	mistral-v01-7b	4096	1
	mistral-v02-7b	8192	1
	mistral-nemo	32768	1
	mistral-large	32768	4
	mixtral-8x22b	32768	4
Mistral 3	mistral-24b	32768	2
Mistral 3.1	mistral-24b-2503	32768	2
Phi	phi-1b4	2048	1
Phi 2	phi2-3b	2048	1
Phi 3	phi3-4b	4096	1
Phi 4	phi4-3b8	131072	1
QRWKV	qrwkv-32b-32k	32768	1
	qrwkv-72b-32k	65536	1
Qwen 1.5	qwen15-0b5	32768	1
	qwen15-1b8	32768	1
	qwen15-4b	32768	1
	qwen15-7b	32768	1
	qwen15-14b	32768	1
	qwen15-32b	32768	2
	qwen15-72b	32768	4
Qwen 2	qwen2-0b5	131072	1
	qwen2-1b5	131072	1
	qwen2-7b	131072	1
	qwen2-14b-lc	32768	1
	qwen2-32b	32768	2
	qwen2-72b	131072	4
Qwen 2.5	qwen25-0b5	32768	1
	qwen25-1b5	131072	1
	qwen25-3b	32768	1
	qwen25-7b	131072	1
	qwen25-14b	131072	1
	qwen25-32b	131072	2
	qwen25-72b	131072	4
Qwen 3	qwen3-0b6	40960	1
	qwen3-1b7	40960	1
	qwen3-4b	40960	1
	qwen3-8b	32768	1
	qwen3-14b	32768	1
	qwen3moe-30b	32768	4
	qwen3-32b	32768	2
RWKV 5	rwkv5-7b	16384	1
RWKV 6	rwkv6-7b	16384	1
	rwkv6-14b	16384	1
	rwkv6moe-37b	16384	1

HuggingFace Repository Requirements

For models to be loaded in featherless, we require

a model card in the hugging face hub
full weights (not LoRA or QLoRA)
weights in safetensors format (not GGUF, not pickled torch vectors)
fp16 precision (though we quant to fp8 as part of model boot)
no variation of tensor shape relative to one of the above base models (e.g. no change in embedding tensor size)

Model Availability

Any public model from hugging face with 100+ downloads will automatically be available for inference on featherless. Users may request public models with fewer downloads either by email or through the #model-suggestions channel discord.

Private models meeting the compatibility requirements outlined here can be run on featherless by scale customers that have connected their hugging face account. Please visit the private models page in the profile section of the web-app.

Context Lengths

All models are served at one of 4k, 8k or 16k context length - i.e. the total of token count of the prompt plus the completion cannot exceed the context length of the model.

What context length a model can be used at depends on it’s architecture and the following table.

*Context Length*	*Model Architectures Serving this Length*
4k	Llama 2 (7B, 11B, 13B)
8k	Llama 3 (8B, 15B, 70B) Mistral v2 (7B)
16k	Llama 3.1, 3.3 (8B, 70B) Mistral Nemo (12B) Mistral 3, 3.1 (24B) Qwen (1.5-32B, 2-72B, 2.5-72B, 3-32B) Kimi-K2
32k	QRWKV (32B, 72B) Deepseek (V3, R1)

e.g. since Anthracite’s Magnum is a Qwen 2 72B fine-tune, it’s context length is 16k

e.g. since Sao10K’s fimbulvetr is a fine-tune of the Llama2 11B, it’s context length is 4k

We aim to operate the models at maximum useable context, however continue to make tradeoffs to ensure sufficiently low TTFT and a consistent token throughput of > 10 tok/s for all models.

Quantization

Though our model ingestion pipeline requires weights in safetensors format with FP16 precision, all models are served at FP8 precision (they are quanted before loading).
An exception to this rule are models under 5B, these will be run at FP16 precision
This is a tradeoff to balance output quality with inference speeds.

Last edited: Jul 18, 2025