Fine-Tuned Conversations • Jul 2, 2025

Fine-Tuned Conversations - Building Sawalni: How Omar is Creating AI for Moroccan Darija

In the first part of our series Fine-Tuned Conversations we have an interview with Omar, creator of Sawalni, sharing his journey from childhood coder to AI innovator

“I grew up in this digital divide, and this thing is poised to just accelerate it exponentially. Countries who control this technology will have a crazy advantage over others.” This observation drove Omar, a self-taught developer from Morocco, to create Sawalni, an AI assistant built specifically for Moroccan Darija. In this interview, Omar shares his journey from childhood coder to AI innovator, the technical challenges behind creating AI for underrepresented languages, and his vision for Sawalni’s impact on Moroccos’ technological landscape.

From Childhood Coder to AI innovator

Omar’s journey with technology began early. “I started coding when I was six,” he explains. “I was fortunate to have computers around me growing up in the 90s. My father was into digital press and publishing, a hot topic for computers at the time, so I had computers lying around and enough magazines that I got interested in coding.”

This early exposure sparked a lifelong passion. Unlike his gamer friends, Omar found his thrill in programming challenges. “That feeling of overcoming something difficult, I’m getting it at a very high level from programming.”

Growing up in Morocco’s multilingual environment presented both opportunities and obstacles. Students study in Arabic during early education, then switch to French in high school, with university being a mix of languages. This linguistic complexity extended to technology.

“When I was a kid, these computers I had were always in a foreign language,” Omar recalls. “It was both a blessing and a curse. I had a valid reason to learn foreign languages, but it always felt a little bit like an outsider thing.”

This disconnect inspired Omar's early attempts to create technology that reflected his cultural reality. At just thirteen, he tried building an operating system designed for his language. Later, he worked on translating Ubuntu and other initiatives to formalize a Moroccan alphabet.

"I quickly got challenged by the sheer difficulty and scope of the problem," Omar explains, "especially for a language with a complex background like Arabic and Moroccan." These early efforts became formative experiences that shaped his future initiatives and taught him valuable lessons about localization challenges, including internationalization (i18n) and localization (l10n) formats, and particularly the complexities of developing technical vocabulary in Arabic.

The mobile phone revolution in the early 2000s introduced additional complications. "People started sending SMS, but on these initial phones, there was no Arabic alphabet. So people used Latin alphabet approximations of Arabic," Omar recalls. "The problem is this caught on and now it's actually alongside Arabic how many Moroccans write and communicate on a daily basis." This adaptation created a new digital dialect, a phonetic, non-standardized writing system that would later present unique challenges for development.

The Birth of Sawalni

The release of ChatGPT in November 2022 was a turning point. Omar was immediately captivated, spending hours exploring what the AI knew about everything, including his own language and culture, and poking at its limits.

“The initial ChatGPT only had a very high-level of understanding of standard Arabic, but no understanding at all of Moroccan,” he observed. This revelation coincided with a growing concern about the digital divide.

“I realized this technology is crazy. I grew up in this digital divide, and this thing is poised to just accelerate it exponentially. Countries who master the use of this technology will have a crazy advantage over others.”

For Omar, the stakes were existential. "In 5, 10, 30 years, what's going to happen? Either many of these countries will remain much further behind, or we will get culturally absorbed by the dominant language.

This led to a simple question: "How hard can this be?" As Omar jokes, these were "famous last words." It took six months before he had anything remotely working, but his expertise in data collection gave him a crucial advantage.

The Technical Journey: Building AI for a Low-Resource Language

Before diving into Omar's specific challenges, it helps to understand what fine-tuning a language model involves. At its core, adapting an AI model like Sawalni requires three key components: powerful computing resources (typically GPUs), a technical framework to manage the training process (such as Axolotl), and most critically, a high-quality dataset in the target language.

While hardware and software frameworks might seem standardized for big tech companies, Omar emphasizes that "when you're training on 'guerilla hardware' like we have to, i.e., whatever you can get your hands on, then the software part is definitely not figured out." This creates layers of technical complexity beyond just finding data.

"There's a lot of challenges that arise, from finding the right versions of libraries that work with the hardware you have at hand, to determining which optimizations can be enabled, which ones cannot, what ends up improving the output and what messes it up. And you have to figure this out on a budget," Omar explains.

This meant extensive experimentation with training hyperparameters like number of epochs, batch size and learning rate, where incorrect values could "overcook" the limited data they had painstakingly collected. Omar developed several practical approaches to efficiently navigate these constraints:

"You can't afford one hypothesis per trial most likely. So I recommend developing an intuition and trying multiple changes at once. It's tougher to get used to but it saves more time in the longer run," he advises. "Start from something that's confirmed to work, don't reinvent the wheel in the beginning. Make it work, then adapt to your specific problem gradually. If it stops working at some point, you'll know why."

For teams with minimal resources, Omar suggests maintaining even modest local testing capabilities: "Aim to have some very small training hardware at home, whenever possible. It's not to train real LLMs, but to validate/confirm small hypotheses that do not require a full run. Even a 3060, a mobile 4070 on your university laptop, or a Google Colab might be helpful."

His most important advice comes from hard-won experience: "In general, don't change something if you don't know why. Basically don't be 'too smart.', otherwise prepare for the rabbit holes."

Beyond these technical constraints, the dataset remains the fundamental challenge for low-resource languages like Moroccan Darija. The concept behind Sawalni was clear: Create an AI that is fully Moroccan, capturing cultural values and perspectives typical of Morocco. "Instead of trying to bring Moroccans to the world, what if we could bring the world to Moroccans, on their own terms?"

The first challenge was finding data. Extracting the data was the easy part, Omar recalls. “We develop browser automation infrastructure, so that was a piece of cake”. While Moroccans are actively digital and speak their language online, identifying authentic Moroccan Darija content proved difficult. "It's like turtles all the way down," Omar explains, referring to the measly NLP ecosystem in his target language. "At each step, I found that dependencies and critical foundations don't exist yet."

The problem shifted from scraping data to identifying the language itself. Existing language identification models performed poorly with Moroccan Arabic. This led to a cascading set of technical hurdles:

Linguistic complexity: “Moroccan is a morphologically complex language with complicated inflections. Variations on a word are actually new words. ‘My phone,’ ‘your phone’, ‘their phone’, in English, these are two words. But in Moroccan, they are separate, one-word terms, blowing up the vocabulary size.”
Non-standardized writing: “The way a word is said, there are many ways it could be written because Moroccan is never standardized. People write phonetically. For a single word, there might be 10, 20, 30, 50 variations.”
Informal data: “There are almost no books written in Moroccan, people write in Arabic or French. So most of our data is informal with no corrections, it’s very messy.”

Ironically, training the LLM itself proved to be the easiest part. “Most of the project was about data collection, ethno-linguistic considerations, and sociological aspects, understanding where Moroccan is spoken, what topics Moroccans talk about, and putting this mirror back to understanding our culture through this big data.”

Lessons for Low-Resource Language AI Development

“Low-resource languages” are those with limited digital presence, languages that have few digitized texts, standardized datasets, or computational tools compared to dominant languages like English. For Moroccan Darija, this meant not only a scarcity of formal written content but also the challenge of identifying and extracting the language from a vast sea of online data.

A critical part of Omar’s work involved developing sophisticated data filtering techniques to strain Darija content out of massive datasets like Common Crawl (an open repository of web content). “Any language is very likely there if you can identify it,” Omar explains. “Any society that has been exposed to the internet, that uses the internet to some measure – and that includes a vast majority of low-resource language societies, the data is there. The challenge is identifying it.”

This filtering process required building custom language identification models to distinguish Moroccan Darija from Standard Arabic, French, and other similar languages in the region. It also meant developing classifiers to filter out problematic content, ensure data quality, clean up typos and other preprocessing tasks.

Through trial and error, Omar developed effective approaches for training AI on low-resource languages:

Start small: “I recommend training small models first. Many people are tempted to throw a big model at it as a one-shot solution, but if it doesn’t work for a small model, it likely won’t work well for a big one. If it works with a small model, just get more data and scale up.” When Omar talks about a model that "doesn't work," he's referring to specific behaviors: the model fails to properly capture the language patterns, produces incoherent text in the target language, or relies on prior knowledge rather than learning from the training data. As he explains, "The bigger models can give you false assumptions. Since they're large, they might have some prior data that makes it look like your training is working, but it's not. Many people treat LLMs like magic black boxes, but they're just a set of distributions. If the data you have is well-expressed in the distribution and the LLM models it well, you should see output from that distribution, even with a million parameters.”

Embrace classic NLP: “Many classic NLP techniques were invented at a time where data was scarce. While it is tempting to focus on the current trend, training and using LLMs to solve every NLP problem under the sun, there is significant value to uncover from older and proven approaches. We found ourselves progressively reproducing landmark NLP papers, sometimes as far back as the 50s, before tackling more modern developments, supporting and compounding our progress along the way

Clean data beats big data: “I should emphasize the importance of clean versus noisy data. It’s better to train on less clean data than on more noisy data. With noisy data, every time you train, it takes your model in an off direction.” For social media sources, cleaning involved removing extraneous elements like mentions, hashtags, and URLs. The team has also faced the challenge of filtering out content from similar languages that had been incorrectly classified: “In the early stages, we learned the painful way that our classifier wasn’t good at distinguishing similar languages. We found pockets of Turkish or Asian and African languages mixed in because of some historical commonalities in Arabic loanwords.”

Find parallel data: “Another helpful tip is to find parallel data, data that bridges between languages. Even simple translations or word dictionaries are good. They give you secondary paths that take your model from one language to another through these concepts. This is how your model can decently answer questions about quantum mechanics in your native language even though it was never part of the training data.”

Building supporting infrastructure: “We ended up building seven more tools before we managed to train properly, tools for language identification, embedding models for understanding topics and similarity between sentences, moderation for cleaning up insults, and a long chain of supporting tools. The majority of work to train an LLM is the pipeline supporting it.”

Evaluate your model smartly: “With this level of uncertainty and amount of experiments, keeping track of what works and what doesn’t is essential. Beyond mechanical metrics such as perplexity, there are two principal evaluation paths you can follow depending on how supported your language is by other SOTA and frontier models. If your target language is sufficiently understood by another model (it doesn’t have to speak it fluently), then you can follow an LLM-as-a-judge approach. Otherwise, you can make use of scoring techniques based on dictionaries, adapt BLEU and ROUGE metrics, and other classic ML derived techniques, or first train simpler models, easier to interpret, and that can help with evaluation. Regardless of the specific approach, nothing beats human evaluation, to make sure the model can capture the nuances of your language and culture, so expose it to your community early on. Oh and don’t forget alignment, especially around culturally relevant topics, but that’s too long of a topic to get into now…”

Don’t work in isolation: “It’s already challenging as an endeavor, don’t make it worse by isolating yourself. Exchange with other people working on similar initiatives. Find self-minded people who think about similar problems, it does wonders for motivation and who knows, they might have just solved what you’re currently stuck on. I benefitted from more knowledgeable people, and had a lot of fun helping out a team working on Indic languages even though I speak none of it. I always welcome collaboration or simply bouncing even the wildest ideas.”

Omar found the sweet spot for his models between 7-30 billion parameters. “Chances are, the model will speak the language well enough and it’s not too dumb for you to actually use.”

Beyond Language: Cultural Alignment

Perhaps the most fascinating challenge Omar encountered wasn’t linguistic but cultural. LLMs have worldviews encoded in their embeddings that reflect specific cultural perspectives.

“I come from a culture in a warm climate where we have cockroaches. The way we deal with cockroaches is with flip-flops[. In our minds, when you see a cockroach, the immediate associated concept is a flip-flop. In other regions where they don’t have cockroach problems, they don’t have this association.”

This cultural specificity is embedded in AI systems. “When you fine-tune a model, you don’t fix these associations very well,” Omar explains, “Transfer learning helps you bridge concepts, but it does it through the mindset of that original language. It’s like a foreigner learning your language but still using their own expressions.”

This led Omar’s team to explore what they call “cultural alignment”, mining associations, understanding them and mapping them into a different association space without retraining the whole LLM. The process involves identifying culturally specific concept pairs, analyzing how they're connected in the target culture, and then carefully adjusting these relationships in the model's representation space. "For example, you just want to change some concept associations but keep everything else - it knows the language, but now you just want it to bring up these two things together. You shouldn't have to train it again, or at least not the whole model." Omar explains.

Early User Feedback and Future plans

Though Sawalni hasn’t officially launched, early test users who got to try its chat app have yielded promising feedback. A friend said that Sawalni “feels just like chatting with a friend over a cup of coffee”. Omar recalls a farmer who asked about growing crops and animal issues. “One of the added values for such an AI, beyond knowing things, is having a helper with all the patience in the world, something people sorely lack. Many of these people can be vulnerable in this sense; they don't have somebody to validate things with them or comfort them."

Beyond isolated tests, Sawalni was also exposed to a wider and more international audience, as the official AI companion for the FIEP, an international culture festival in Rabat, Morocco with more than 30 different countries participating.

For future development, voice capabilities top the priority list. “A lot of people in Morocco have this oral tradition, and that would enable the technology a lot.” Regional dialect support is also in the works, with data already collected and awaiting proper identification and separation, another upcoming journey of its own.

The biggest challenge remains building a model with three key qualities: “enough knowledge about the world, good cultural alignment, and small enough that it can be run economically.”

As for feedback management, Omar said that “the most basic approach is to simply collect thumbs-up or thumbs-down reactions, but you don’t have to stop there. Consider what kind of feedback matters to you, and how you plan to act on it. You will need to build or set up an appropriate user interface for your model. We developed our own chat and translation platform at sawalni.com to meet specific needs we have, but you don’t have to go through the same path. In fact, tools like Huggingchat can take you a long way.

Localized AI in a World of Giants

When asked about the future of localized AI models versus general-purpose giants, Omar offers an insightful perspective:

“You have frontier models in terms of size, and you have frontier models in terms of efficiency, how small can you get? For many low-resource language communities, they don’t have the resources to compete on frontier models in terms of size. But you can definitely compete on efficiency.”

For teams working on languages like Moroccan Darija, competing in the "size race" against tech giants with billion-dollar budgets isn't feasible. However, by focusing on efficiency, creating leaner, more specialized models tailored to specific languages and cultural contexts,these smaller teams can develop solutions that might outperform larger models for their specific use cases.

"An LLM knows so many facts, so many useless things, HTML source code, random numbers from Wikipedia pages," Omar explains. "You don't need all that all the time. If there are ways to trim the fat, there are probably ways to make LLMs much more efficient." This approach is particularly crucial for communities with limited resources who still want to benefit from AI advancements.

He observes that while models like GPT-4 have some ability to understand Moroccan Darija, they haven’t significantly improved in this area over time. “My take is that it’s not part of their scoring or evaluation pipeline. Unless there’s some particular economic reason, I don’t think we should expect to see more drastic improvements from these models.”

“With that said, regardless of pure ability, having independent local players is essential for an inclusive AI ecosystem that does not exacerbate the digital divide even further. Nobody benefits otherwise except Big Tech, whose incentives might not be exactly aligned with your best interests. It is simply a matter of good policy and national sovereignty to support homegrown generative AI capabilities, as it is quickly becoming the primary way new generations learn about the world, eventually subsuming most of the economic, creative activity and the information economy as a whole. This is the informational equivalent of importing all electricity from a foreign nation. Feasible, but not sustainable and strategically not the wisest.”

This underscores the need for dedicated efforts to serve specific linguistic communities. “Knowing the local data, what people are actually thinking about, and what people need in those regions is very important.”

It doesn’t end when training is over, as Omar highlights. “Don’t forget inference costs as well. It is so tempting to train the largest model you can get your hands on, but do you (or your users) have the resources to run it beyond your initial tests? Smaller models have the benefit of running in all sorts of hardware, and for many developing economies it might be the perfect sweet spot, adding value to users at a reasonable operational cost.”

A Vision Beyond Technology

For Omar, Sawalni represents much more than a technical achievement, "I want Sawalni to be an engine of growth [for Moroccan Darija speakers], for people who want to improve themselves, learn things, grow a business, and need expertise. I want the Moroccan GDP to be positively impacted.”

Looking ahead to the 2030 World Cup in Morocco, he envisions “a virtual Moroccan friend available for any visitor who comes to Morocco, one that can guide them, show them context about things so they’re not just tourists with an outsider perspective.”

Most importantly, he sees Sawalni as an educational equalizer. “With the Internet and so many high quality resources available everywhere, the only barrier for learning is the English language. We want to remove that barrier so the only limitation is your desire to spend time learning. I want Sawalni to be the lifelong tutor, the meticulous assistant, the informed and attentive career coach that people want to have but don’t.”

Connecting with Culture through Technology

Perhaps unexpectedly, this project has deepened Omar’s connection to his heritage. “One thing I really liked going through this project is the way it made me look back at my culture, my society, my language and re-evaluate all these things. It made me re-appreciate a lot of things I hadn’t connected with in a long time.”

This cultural reconnection became integral to the development process. “I found myself reading research papers, social studies, historical studies, old books, just to find glimpses of how certain words were used. It’s a valuable process that I’m enjoying, maybe as much or more than the LLM development itself.”

For Omar, this underscores a fundamental truth about AI development: “I don’t think you can train an AI to do something if you don’t know how to do it yourself. If I don’t understand something well, I’m not able to make the LLM express it.”

Sawalni is currently self-funded by a small team of passionate developers. If you’re looking to accelerate bringing Sawalni to life with time or resources, whether as an individual or business, please reach out to [email protected]