Fine-Tuned Conversations • Jul 16, 2025

Fine-Tuned Conversations II - Mastering the Art of LLM Customization with Auri

For our second interview in the series Fine-Tuned Conversations we speak with Auri, model creator of Celeste and the EVA series

“I don’t think it’s not getting enough attention, but it’s early in this stage. I think it’s reasoning fine-tuning. Since we now have both sources of reasoning data and relatively inexpensive alignment algorithms like GRPO.”

This forward-looking perspective encapsulates Auri’s approach to LLM fine-tuning, always looking ahead to the next frontier while building on hard-won experience. In this interview, Auri shares their journey from early AI experiments to becoming a pioneering fine-tuning expert, offering invaluable insights into data curation, technical optimization and the future of specialized AI models.

From Diffusion Models to Language Generation

Auri’s entry into the world of AI began in early 2022 with image generation. “I became interested in AI when I first encountered BigGAN+CLIP, an early OSS imagegen diffusion model,” they recall. “Back then most of my foray into AI started and ended with that setup running on a free Google Colab Tesla T4 GPU.”

The landscape shifted dramatically with ChatGPT's release in late 2022. While many were captivated by OpenAI's offering, Auri's interest quickly pivoted to the emerging world of local models. "My interest rather quickly went towards (then very new) KoboldAI and some of the first local models, mostly OPT based back then.”

These early models were far from perfect. "Those early 13Bs were nowhere close to gpt-3.5-turbo in terms of capabilities, and simple creative text generation was basically the peak of their abilities," Auri admits. "But the promise of self-hosting an entire LLM seemed so impossibly futuristic that it gripped my attention." The journey from user to creator began after LLaMA3's release, when Auri met Lemmy, another LLM enthusiast, on Discord. Together, they embarked on an ambitious project that would change everything.

The Birth of Zodiac and Evolution to EVA

"We got an idea to make a small but very human-like conversational model, based on then new LLaMA3-8B," Auri explains. This project, called Zodiac, had modest beginnings but ambitious goals.

The dataset was tiny by today's standards - just 300 rows consisting of handwritten multiturn dialogue examples and conversational data scraped from the internet. The setup was equally modest: "Unsloth, LoRA, L40 from Runpod, a couple of epochs, and we got a result."

Zodiac succeeded at its core mission of creating natural conversation but had significant limitations. "That model was decent at what we wanted - basically an online chat emulator, but it was quite dumb and unstable otherwise."

Rather than seeing this as a failure, the team viewed it as a stepping stone. They decided to transfer the natural prose capabilities into a more conventional roleplay model, similar to popular models like Stheno by Sao10k. This evolution gave birth to Celeste, trained on a mixture of natural and synthetic data, which eventually became the EVA series

Hard Lessons from Celeste: The Importance of Optimization

The Celeste project provided crucial lessons about the technical complexities of fine-tuning. "Not staying on budget was one of the main problems with Celeste," Auri reveals. "We basically had to use more compute than we needed because we didn't really research the optimizations that were available at the time."

The team made a classic newcomer mistake: throwing hardware at the problem. "The first version of Celeste used four H100s, which is absurd for an 8B model," Auri admits. This experience highlighted a critical gap in the fine-tuning ecosystem.

"One of the biggest issues is the lack of documentation on optimizations. I would recommend really researching various optimizations such as cross entropy and optimizer settings," Auri advises. "They're not documented anywhere. You'll have to look at actual implementations in Axolotl, pull requests, and so on."

The Art and Science of Data Curation

When asked about natural versus synthetic data, Auri provides a nuanced perspective that challenges common assumptions. "Both are useful, but for different things," they explain.

Synthetic data offers predictability: "It's usually low-loss, easy to learn, training on it is easy and very stable, it's good for regularization." However, it comes with limitations. "It's not as diverse as natural data, so you need to filter it to avoid too low-loss rows and partial duplicates to avoid memorization."

Natural data presents the opposite profile: "Natural data is noisy, hard to learn and often problematic to train on, but it's significantly more diverse. Its noisiness allows for kicking the model out of synthetic patterns."

For synthetic data creation, Auri employs sophisticated techniques. They use LLM-as-a-judge systems for data rating and filtering, pointing to their RP_Filter tool on GitHub as an example. This tool rates data by three criteria: Vocabulary Diversity, Prose Quality, and Context Length Usage, each scored from 1 to 10.

"Another technique that would be interesting to highlight is generating multiturn data in one or two turns via structured output like JSON," Auri shares. "This avoids the problem of data becoming more repetitive and structurally similar at scale."

Data preparation varies by type. For synthetic data: checking for overused n-grams, identifying soft duplicates, and LLM-based quality rating. For natural data: removing "poison" like symbol spam and links, grammar and punctuation checks, and format adaptation for chat models.

Technical Challenges: Context Length and Model Capabilities

One persistent challenge in fine-tuning is handling longer contexts. "It's hard to get longer data. It almost always caps out at like 16K maximum, and there's usually only a few examples even of that," Auri explains. Most data falls in the 6-8K token range.

Even base models struggle with this limitation. "Recent Mistral models break at 16K, break hard, start spamming tokens," they note. "Gemma 3 seems to be a bit better - it goes to 22K."

When discussing what's easiest and hardest to instill through fine-tuning, Auri provides clear guidance. Easy targets include general prose style and knowledge addition. "Any generalized, abstract quality that doesn't depend on a specific behavior is usually not that hard to add to the model."

The real challenges come with specific behaviors: "Specific behavior, especially generalized specific behavior akin to inline tool calls or especially reasoning, is usually quite tough." With reasoning, the complexity multiplies: "You not only have to make that behavior consistent, you also have to make the model not depend on previous reasoning traces in context to reason, to make it not dissociate reasoning from response, to not reason in circles."

The Reinforcement Learning Revolution

Auri sees reinforcement learning as a game-changer for the fine-tuning community. "I'd say reinforcement learning is not getting enough attention in the more individual training community. It's pretty hard, but it's really rewarding when done properly."

The evolution of RL techniques has been rapid. From early DPO implementations to RLHF for safety, and now advanced methods like PPO and GRPO. The cost barriers have fallen dramatically: "GRPO, while still computationally expensive compared to offline RL, is significantly cheaper than PPO and is better implemented in open finetuning toolkits."

The Democratization of Fine-Tuning

Perhaps most exciting is how accessible fine-tuning has become. "In the span of a year, fine-tuning costs became more than 4x lower thanks to Liger kernels, cut cross entropy, FSDP2, sequence parallelism, and many other things," Auri explains.

The hardware requirements have plummeted. "When I started, you had QLoRA and 8bit optimizer as your only real cost cutters, so tuning a 70B was basically prohibitively expensive for an individual. Now it's quite doable, even on a HEDT home setup with say 4x or even 2x 3090."

Looking ahead, Auri is particularly excited about state tuning for recursive models and automated fine-tuning systems. "You basically just give a PDF and get a model specialized in it. It's amazing," they marvel about emerging automated approaches.

The Future of Specialization

Despite improvements in base models, Auri believes specialized fine-tuning will remain crucial. "I think there will always be a place for them since base models almost always play it safe or have their own quirks."

They draw a parallel to image generation: "It's basically like Stable Diffusion fine-tunes - they cater for specific style usually or something you exactly want. I think specialization will always be the thing in LLMs."

The fundamental limitation of generalist models ensures a future for fine-tuning: "I don't really believe in perfect generalists. A model can't know everything and can't do everything ideally. There will always be something that needs to be nudged a bit towards."

Practical Wisdom for Aspiring Fine-Tuners

Throughout the interview, Auri shares hard-won insights for those starting their fine-tuning journey:

On Getting Started: "Use Axolotl, look for configs that people posted. I post them for that exact reason because there's not much guidance online besides that. Use what other people use. Look at what they do. Try to understand why they do that."

On Hardware: "Don't go for H100 right away. You don't need it. 3090s are enough for most things, really, and all things you would want when starting out."

On Experimentation: "Do small fine-tunes, 5,000 rows or so of data, tune on them, look at the results, evaluate it, save that, adjust your configs and so on."

On Failure and Persistence: "Even failed tunes can be salvaged. Merging is basically magic. The importance of iteration cannot be overstated. "Also save your iterations, even if you produce five broken fine-tunes, you might be able to make a working model from them via merging," Auri emphasizes. This approach helps combat the demotivation that often plagues beginners.

Community Feedback and Continuous Improvement

Working with the passionate RP community has taught Auri valuable lessons about user feedback. "I try to interpret this feedback into more generalized form," they explain. "When users describe their specific experience, there's always some points that can be generalized into specific behaviors of the model."

This process involves understanding that models don't interpret training data literally. "You can usually pinpoint spots in the data which cause specific behaviors. By using this feedback, you can find desirable and undesirable behaviors in the data, even if they don't quite look like the experience you've been described."

Current Projects and Future Directions

Auri remains committed to pushing boundaries. The next EVA model will likely be a reasoning model, incorporating the latest advances in RL and reasoning capabilities. Smaller experimental projects on Mistral Small 3.2 and GLM4 models are also in the works.

For those interested in following Auri's work, the primary sources are Hugging Face and two organizations: EVA-UNIT-01 and Allura Org. "Also on Allura Org, my friends do cool stuff too. I recommend checking them out," Auri adds.

A Vision for Accessible AI Innovation

As our conversation concludes, Auri's enthusiasm for the democratization of AI shines through. From those early days experimenting with OPT models on Google Colab to creating sophisticated fine-tunes that push the boundaries of what's possible, their journey exemplifies the power of community-driven innovation.

"It's like an ever-learning process," Auri reflects. And in that process, they're not just creating better models - they're helping to ensure that the future of AI remains open, accessible, and driven by passionate individuals who refuse to accept that cutting-edge AI is the exclusive domain of tech giants.

The message is clear: with the right knowledge, modest hardware, and persistence, anyone can contribute to advancing the state of AI. As Auri's work demonstrates, sometimes the most innovative solutions come not from massive corporations, but from dedicated individuals willing to experiment, fail, learn, and share their knowledge with the community.