Dec 4, 2024

The hidden costs of local LLM inference

Drop the "thousand"; just "ten dollars" - it's cleaner.

The allure of running large language models (LLMs) locally is understandable—impressive hardware, full control, and no reliance on third-party services. But beneath the surface lies a more complex reality: the cost of local inference often outweighs the perceived benefits. Enter Featherless.ai, a service designed to make LLM inference effortless, cost-effective, and accessible.

Most users running LLMs locally adopt batch sizes of 1 (BS=1), often aiming for the best-case scenario of minimal latency. Yet, this approach doesn’t account for hidden costs, particularly energy consumption. In fact, our analysis shows that for typical BS=1 usage, energy expenses alone can exceed the price of our $25/month premium tier—a fraction of the cost of maintaining high-end hardware.

In this post, we dive into benchmarks of popular models across a range of hardware setups, showcasing the complexities and costs of local inference. From GPU splits and layer offloading to the challenges of multi-threaded CPU performance, we break down the numbers to reveal how Featherless.ai eliminates these headaches with a single, predictable pricing model.

Benchmarks: How Local Inference Stacks Up

Using llama.cpp’s llama-bench, we evaluated some of the most popular models in different quantizations using both high-end and enthusiast-grade hardware:

2× NVIDIA RTX 4090, Ryzen 9 7950x3D, 64GB DDR5 (~$6,000)
14” M3 Max MacBook Pro with 36GB of unified memory ($2,899)

To replicate the 2×4090 results, we use the following command with arguments:

-p 0 - number of tokens already present in the prompt. we chose 0 to get best case results for local inference
-n 128 - number of tokens to generate during a single run.
-ngl 999 - number of layers to be offloaded to the gpu(s) we chose 999 arbitrarily as it gets clamped to the actual number of layers in the model
-ts “50/50” - the distribution of layers between multiple GPUs, 50/50 meaning both GPUs process the same number of layers
-mg 0 - ID of the main GPU
-fa 1 - we chose to use flash attention, as it gave us a noticeable improvement on NVIDIA GPUs.

2x4090

./llama.cpp/llama-bench\
  -m "models/Meta-Llama-3.1-70B-Instruct-IQ2_M.gguf"\
  -m "models/Meta-Llama-3.1-70B-Instruct-IQ3_M.gguf"\
  -m "models/Meta-Llama-3.1-70B-Instruct-IQ4_XS.gguf"\
  -m "models/Meta-Llama-3.1-70B-Instruct-Q5_K_S.gguf"\
  -m "models/Qwen-QwQ-32B-Preview-Q8_0.gguf"\
  -m "models/Qwen2.5-72b-instruct-q4_0.gguf"\
  -p 0 -n 128 -ngl 999 -ts "50/50" -mg 0 -fa 1

When measuring the 1×4090 performance, the command is simpler as we don’t have to specify the split ratio and the main gpu.

1x4090

CUDA_VISIBLE_DEVICES=0 ./llama.cpp/llama-bench\
    -m "models/Meta-Llama-3.1-70B-Instruct-IQ2_M.gguf"\
    -m "models/Meta-Llama-3.1-8B-Instruct-Q4_K_L.gguf"\
    -m "models/Meta-Llama-3.1-8B-Instruct-Q5_K_L.gguf"\
    -m "models/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf"\
    -m "models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf"\
    -m "models/Mistral-Nemo-Instruct-2407-Q8_0.gguf"\
    -m "models/Qwen-QwQ-32B-Preview-Q4_K_L.gguf"\
    -p 0 -n 128 -ngl 999 -fa 1

Things get a bit more complicated for 4090 with CPU offloading case as we have to manually specify the number of layers that are processed on the GPU and because we’re using different quantizations, this number is different on a case by case basis.

1x4090 + 7950x3D

CUDA_VISIBLE_DEVICES=0 ./llama.cpp/llama-bench\
  -m "models/Meta-Llama-3.1-70B-Instruct-IQ2_M.gguf"\
  -p 0 -n 128 -ngl 76 -fa 1

CUDA_VISIBLE_DEVICES=0 ./llama.cpp/llama-bench \
  -m "models/Meta-Llama-3.1-70B-Instruct-IQ3_M.gguf"\
  -p 0 -n 128 -ngl 60 -fa 1

CUDA_VISIBLE_DEVICES=0 ./llama.cpp/llama-bench \
  -m "models/Meta-Llama-3.1-70B-Instruct-IQ4_XS.gguf"\
  -p 0 -n 128 -ngl 50 -fa 1

CUDA_VISIBLE_DEVICES=0 ./llama.cpp/llama-bench \
  -m "models/Meta-Llama-3.1-70B-Instruct-Q5_K_S.gguf"\
  -p 0 -n 128 -ngl 38 -fa 1

CUDA_VISIBLE_DEVICES=0 ./llama.cpp/llama-bench \
  -m "models/Qwen2.5-72b-instruct-q4_0.gguf"\
  -p 0 -n 128 -ngl 42 -fa 1
  
CUDA_VISIBLE_DEVICES=0 ./llama.cpp/llama-bench \
  -m "models/Qwen-QwQ-32B-Preview-Q8_0.gguf"\
  -p 0 -n 128 -ngl 40 -fa 1

For our CPU-only runs, the situation is a bit different. Here we explicitly tell llama.cpp to use 12 threads using the -t 12 flag. We’ve found that increasing the number further resulted in decreased performance. Most likely option is the multi-CCD nature of our chosen CPU - meaning that only 16 out of the 32 available threads have access to the increased L3 Cache.

7950x3D

./llama.cpp/llama-bench\
  -m "models/Meta-Llama-3.1-8B-Instruct-Q4_K_L.gguf"\
  -m "models/Meta-Llama-3.1-70B-Instruct-IQ2_M.gguf"\
  -m "models/Meta-Llama-3.1-70B-Instruct-IQ3_M.gguf"\
  -m "models/Meta-Llama-3.1-70B-Instruct-IQ4_XS.gguf"\
  -m "models/Meta-Llama-3.1-70B-Instruct-Q5_K_S.gguf"\
  -m "models/Meta-Llama-3.1-8B-Instruct-Q5_K_L.gguf"\
  -m "models/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf"\
  -m "models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf"\
  -m "models/Mistral-Nemo-Instruct-2407-Q8_0.gguf"\
  -m "models/Qwen2.5-72b-instruct-q4_0.gguf"\
  -m "models/Qwen-QwQ-32B-Preview-Q4_0_8_8.gguf"\
  -m "models/Qwen-QwQ-32B-Preview-Q4_K_L.gguf"\
  -m "models/Qwen-QwQ-32B-Preview-Q8_0.gguf"\
  -p 0 -n 128 -t 12

Compiled into a single table:

Model	Size	Params	2×4090 (tk/s)	1×4090 (tk/s)	4090 + 7950×3D (tk/s)	7950×3D (tk/s)	M3 Max 36GB (tk/s)
lama 8B Q4_K - Medium	4.94 GiB	8.03 B	N/A*	156.10	N/A*	14.33	47.63
llama 8B Q5_K - Medium	5.63 GiB	8.03 B	N/A*	138.70	N/A*	12.46	37.77
llama 8B Q6_K	6.37 GiB	8.03 B	N/A*	123.92	N/A*	10.96	38.51
llama 8B Q8_0	7.95 GiB	8.03 B	N/A*	101.92	N/A*	8.72	31.76
llama 13B Q8_0 (Mistral Nemo)	12.12 GiB	12.25 B	N/A*	67.92	N/A*	5.67	21.00
llama 70B IQ2_M - 2.7 bpw	22.46 GiB	70.55 B	32.92	34.31	17.90	1.97	6.82
llama 70B IQ3_S mix - 3.66 bpw	29.74 GiB	70.55 B	25.92	OOM	6.40	1.45	OOM
llama 70B IQ4_XS - 4.25 bpw	35.29 GiB	70.55 B	22.41	OOM	4.22	1.86	OOM
llama 70B Q5_K - Small	45.31 GiB	70.55 B	18.01	OOM	2.54	1.46	OOM
qwen2 70B Q4_0 (2.5)	38.53 GiB	72.96 B	20.70	OOM	3.24	1.70	OOM
qwen2 32B Q4_0_8_8 (QwQ)	17.35 GiB	32.76 B	N/A**	N/A**	N/A**	3.80	N/A**
qwen2 32B Q4_K - Medium (QwQ)	19.02 GiB	32.76 B	N/A*	41.79	N/A*	3.52	12.41
qwen2 32B Q8_0 (QwQ)	32.42 GiB	32.76 B	24.94	OOM	5.00	2.06	OOM

* Not optimal (run with 2 GPUs while the model fits on 1)

** Special quant optimized for AVX512 capable CPUs or non-Apple ARM CPUs

The Featherless.ai Advantage

Featherless.ai simplifies LLM inference, giving you:

• Access to any Hugging Face model with serverless convenience.

• No need for expensive hardware, energy bills, or complex configurations.

Our fixed-rate pricing eliminates the guesswork of per-token billing, empowering developers to build innovative applications without worrying about costs spiraling out of control.

Conclusion

Running LLMs locally might seem appealing at first, but the reality is far from simple. Between high hardware costs, wasted GPU resources, and hidden energy expenses, it’s clear that local inference is often inefficient and impractical. For most developers, Featherless.ai offers a far better alternative.

With Featherless.ai, you can focus on building while we handle the heavy lifting. Start your journey today and unlock the full potential of LLMs—hassle-free and cost-effective.