RADLADS: Dropping the cost of AI architecture experiment by 250x

Why do most large AI research labs swear by scaling and avoid architecture research?

  • What works small often fails big — Architectural innovations that show promise at 1M parameters may break down at 1B or 50B.

  • Validating at scale is expensive — Training from scratch to test a new architecture at meaningful scale can cost at least $5–10M.

  • High risk, uncertain reward — You’re just as likely to degrade performance as improve it—making architecture exploration financially unsustainable for most labs.

burning banknotes
Money Burning Photo - by Jp Valery on Unsplash

Training a state-of-the-art language model from scratch costs roughly $5-10M—just to validate a new attention mechanism, recurrence scheme, or memory system.

From our team experience, it typically takes 20–80 architecture iterations to achieve a 10%+ improvement. We've done this four times over the past two years.

For most AI labs, that level of experimentation would cost around $250 million in research GPU time. From that perspective, it's often more rational to invest in scaling model parameters and datasets for a near-guaranteed performance gain of ~10%.

At Featherless, we believe this bottleneck in architecture validation has slowed progress—not only in capabilities but in reliability.

But what if the cost to validate an architecture dropped from $5 million to $20K?

With that same $250 million, we could run over 12,500 iterations, uncovering 100+ architecture improvements, each with 10%+ gains. Compounded, that’s a theoretical 1,378,000% improvement in performance.

That’s why we’re excited about RADLADS.


Introducing RADLADS

RADLADS (Rapid Attention Distillation to Linear Attention Decoders at Scale) is a new method for converting massive transformer models (e.g., Qwen-72B) into new AI models with alternative attention mechanisms—at a fraction of the original training cost.

  • Total cost: $2,000–$20,000

  • Tokens used: ~500 million

  • Training time: A few days on accessible cloud GPUs (8× MI300)

  • Cost reduction: ~250× reduction in the cost of scientific experimentation

Instead of training from scratch, we convert existing models to new attention architectures in three steps:

  1. Align hidden states between the original transformer and the target attention architecture

  2. Distill output behavior (logits) from the original model

  3. Fine-tune for long-context performance

You can read about the process details from our paper review on huggingface and arxiv. This is the same technique that allowed us to train our latest 72B1 attention-free, with only 8 GPU’s.

🪶QRWKV-72B and 32B : Training large attention free models, with only 8 GPU's Eugene Cheah · March 24, 2025 Read full story

What does this mean for research?

RADLADS is already changing how we explore AI architecture. We can now:

  • Rapidly test novel attention mechanisms and hybrid designs

  • Iterate on model structures in days, not months

  • Validate alignment and interpretability hypotheses at scale

This isn’t just about RWKV—it opens doors for advancing Transformers, State Space models, xLSTMs, and architectures yet to be imagined. Its about accelerating our pace of research.

And we’re not doing it alone. Since announcing our work, we've collaborated with other researchers to validate multiple attention mechanisms, including Transformer-based variants.

Reach out to us if you have any attention alternative your research team or university lab is working on and looking to validate in collaboration.

It’s all part of our mission to make personalized reliable AI — and eventually AGI — a reality

🛣️ Our roadmap to Personalized AI and AGI Eugene Cheah · March 24, 2025 Read full story

One more thing:
QRWKV2, based on the RWKV architecture & Qwen 3 models, is already training...

Translation:
A linear GPT-4o class text model is on its way...
After that its O1, and O3 class

Start building under 3 minutes