
By Gretel Kahn
“These newsrooms desperately need the help these technologies provide, but they’re the ones being left out because they work in languages that are considered low-resource, so they are not a big priority for tech companies to support.”
Since the launch of ChatGPT in 2022, newsrooms have been grappling with both the promise and the peril posed by generative AI. But not every publisher is equally prepared to pursue these opportunities. While newsrooms in the U.S. and Europe innovate and experiment with large language models (LLMs), many newsrooms in the Global South are being left behind.
While AI models in languages like English, Spanish, and French have ample training and resources, profound linguistic and cultural biases embedded within many mainstream AI tools pose substantial challenges for newsrooms and communities operating outside of dominant Western languages and cultural contexts.
For these newsrooms, gaps in data aren’t merely technical glitches but existential threats to their ability to participate equitably in this evolving digital ecosystem. What threats does this lack of data present to newsrooms in the Global South? How can these AI gaps be narrowed? To answer these questions, I spoke to six journalists and experts from India, the Philippines, Belarus, Nigeria, Paraguay, and Mali who are aiming to level the field.
What AI can (and can’t) do for low-resource languages
AI tools do not work well (or at all) for local, regional, and indigenous languages, according to all the sources I spoke to. This is particularly evident in countries where many of these languages are spoken beyond dominant ones like English, Spanish, or French. This language gap exacerbates inequalities for newsrooms and communities that operate in non-dominant languages.
Jaemark Tordecilla is a journalist, media advisor, and technologist from the Philippines focusing on AI and newsroom innovation. Having worked as a consultant for newsrooms across Asia, Tordecilla has seen a lot of curiosity among journalists about AI, but uptake has been mixed. For transcriptions — the AI functionality he has seen used the most — cost and language have often been an issue.
“For the longest time, [transcription] worked well for English and it didn’t work at all for Filipino, so journalists in the Philippines are only very recently getting to use these tools for their own reporting, but cost is an issue,” he said.
He described instances where journalists in the Philippines were forced to share accounts for paid subscriptions to transcription tools, which can create security issues. Things are even worse for journalists who use regional languages, for which these tools are essentially useless.
“The roll-out for support for regional languages has been slow and they are being left behind,” he said. “If your interview is in a regional language, then obviously you can’t use AI to process that. You can’t use AI to translate that into another language. You can’t use AI to say monitor town hall meetings in a regional language and make [them] more accessible to the public.”
Indian journalist Sannuta Raghu, who heads Scroll.in’s AI Lab and is now a journalist fellow at the Reuters Institute, has documented what these linguistic and cultural inequities look like in practice for newsrooms like hers.
AI tools don’t work very efficiently for most of the 22 official languages in India. Raghu listed issues like inaccurate outputs, hallucinations, and incorrect translations. A big issue, she said, is that, for a long time, AI tools were unable to account for nuances in language. Unlike English, for example, many Indian languages have large differences between spoken and written discourse.
“Asian countries speak in code-mixed language — for example, using multiple Hindi words and English words in a normal conversation,” she said, “which means we need rich enough data to be able to understand that.”
If there isn’t sufficient “good data” to train models on these specific languages and contexts, Raghu said, linguistic and cultural inequities are going to happen. Raghu attributed this lack of training data to a combination of complexity and lack of interest by Big Tech. But she also said the situation is starting to improve.
“Is it really a priority for you to optimize for all those languages in India from a tech product sales perspective? As you move eastward, the complexity of how societies use language changes. It becomes far more multilingual. It becomes far more complex. It becomes far more code-mixed. Those are the complexities that we live with, and that is not reflected in any of the models,” Raghu said.
AI ignores political biases
Beyond these inefficiencies, my sources also pointed out the cultural and political nuances that are missing from these models, which make them even more problematic for newsrooms.
For example, Raghu said that newsrooms are already noticing an American slant in everything AI generates. She described an instance where they were testing a tool to see how useful it’d be to help them write copy on cricket. For something as simple as explaining the sport, she says there were hallucinations, with players being made up, and the model simply not understanding the rules of the game.
“Up to 2.6 billion people follow cricket. It’s a huge cultural thing for us, Australia, Bangladesh, England…but the U.S. doesn’t play cricket, which is why a lot of the cultural aspects of this are not included in the models,” she said. “There is a lack of contextual training data. Cricket is very important for us, but we’re not able to do this with the LLMs because these models don’t quite understand the rules.”
Daria Minsky is a Belarusian media innovation specialist focusing on AI applications in journalism. After working with several newsrooms in exile, she has seen a lot of skepticism toward AI use — not because of factual errors, but because some of these models lack nuance in politically sensitive contexts.
Minsky used her own country as an example of how LLMs may simply repeat narratives put forth by authoritarian regimes.
“The word Belarusian is very politically loaded in terms of how you spell it, so I compared different models. ChatGPT actually spells the democratic or version of it, while DeepSeek uses the old, Soviet version of it,” she says. “It’s Belorussian versus Belarusian. Belorussian is this imperialistic term used by the regime. If newsrooms use Belorussian instead of Belarusian, they risk losing their audience and their reputation.”
AI models are trained based on data available online, which is why these models are more fine-tuned to English (and its contexts) than, say, Belarusian. Since the official narrative of authoritarian regimes is what is most available online, AI is being trained to follow that narrative.
These gaps in the training system have already been exploited by bad actors. A recent study by NewsGuard revealed that a Moscow-based disinformation network is deliberately infiltrating the retrieved data of AI chatbots, publishing false claims and propaganda for the purpose of influencing the responses of AI models on topics in the news. The result is more output that propagates propaganda and disinformation.
“I’ve heard of the same problems in Burma, for example, because the opposition uses Burma instead of Myanmar. I can see this type of problem even within the United States where I’m based, with the debate around using ‘Gulf of Mexico’ or ‘Gulf of America’ because Trump started to rename things just like in other dictatorial regimes,” she says.
How to close the gap
Despite these issues, some newsrooms in the Global South are taking the development of AI tools into their own hands.
Tama Media, a francophone West African news outlet, has launched Akili, an app, currently in beta, that provides fact-checking in local African languages through vocal interaction. Moïse Mounkoro, the editorial advisor in charge of Akili, traced the origin of the idea to two things: the fact that misinformation is rampant in West Africa and the realization that many people communicate via voice messages rather than through reading and writing.
“Educated people can read articles and they can fact-check,” Mounkoro said. “But for people who are illiterate or who don’t speak French, English or Spanish, the most common way to communicate is orally. I’m originally from Mali, and most of the conversations there are through WhatsApp voice messages. If you want to touch those people, you need to use their language.”
The Akili app uses AI to fact-check information by taking the question and finding an answer through its database of sources, which range from BBC Africa to Tama Media. The answer is then given orally to the user. To include more African languages like Wolof, Bambara, and Swahili, Akili’s team is experimenting with either using Google Translate’s API or building their own through online dictionaries.
“These AI technologies came from the West so they focus on their own languages. It’s like cameras: in the beginning, they were not made to photograph Black skin,” he says. “People need to be aware that they need to integrate other languages. Google has at least tried to make an effort by integrating many African languages these last couple years.”
In Paraguay digital news outlet El Surti is developing GuaraníAI. While the project is still in development, their goal is to build a chatbot to detect whether someone is speaking in this language and provide them with a response. In order to do that, they are developing a dataset of spoken Guaraní so that LLM engines can recognize oral speech in this Indigenous language, spoken by almost 12 million people.
Sebastián Auyanet, who leads on the project, told me they wanted to explore how those who don’t speak dominant languages are shut out from accessing LLMs. Guaraní is a language spoken still spoken widely throughout the Southern Cone, mainly in Paraguay, where it is an official language along with Spanish. Up to 90% of the non-indigenous population in Paraguay speaks Guaraní.
“Guaraní is an oral, unwritten language,” Auyanet said. “What we need is for any LLM engine to be able to recognize Guaraní speech and to be able to respond to these questions in Spanish. News consumption may be turning into ChatGPT, Perplexity, and other models. So there’s no way to get into that world if you speak in a language none of these systems can use.”
El Surti is organizing hackathons throughout Paraguay to test this dataset. Mozilla’s Common Voice is a platform designed to gather voice data for diverse languages and dialects to be used by LLMs. El Surti is using Common Voice to develop their minimum viable product, which aims to achieve 70% validation for spoken Guaraní within Mozilla’s datasets. With this degree of validation, the chatbot would be able to respond to queries in this Indigenous language.
Ideally, Auyanet said, this project will allow El Surti to build an audience of Guaraní speakers who eventually will be able to interact with the outlet’s coverage by asking the chatbot questions in their own language.
“Right now we are excluding people who are only Guaraní speakers from the reporting that El Surti does,” he said. “This is an effort to bring them closer.”
In Nigeria, The Republic, a digital news outlet, is developing Minim, an AI-powered decentralized text-to-speech platform designed to support diverse languages and voices. They are actively training an AI model on specific African languages such as Nigerian Pidgin, Hausa, and Swahili, with plans to add more over time. The team is aiming to have a minimum viable product by the end of the year.
This model would allow independent creators to lend their own voices and train the AI in their unique vocal quality, including age, regional accents, and other demographics.
Editor-in-chief Wale Lawal told me that their goal is to engage audiences who speak these languages and to make their outlet more relevant to them. “We believe that global media has an indigenous language problem,” he said. “In a place like Africa you have a lot of people who are just automatically locked out of global media because of language.”
Minsky, the media consultant from Belarus, has been working with newsrooms in exile to develop an AI tool that allows for the automation of news monitoring from trusted sources. Her goal is to account for all the cultural and political nuances missing from the current models by allowing newsrooms to monitor very specific contextual sources, including local and hyper-local channels like Telegram.
This would include uploading archival data and historical data to fine-tune the output, and to explicitly prompt to control terminology and spelling (i.e. “don’t call Lukashenko president”).
No newsroom left behind
Newsrooms have been working without AI for decades before the first generative AI tools were released to the public. So what’s the problem if these tools are not available to everyone?
Tordecilla stressed that the gap between newsrooms is getting wider and pointed to things already happening: there are newsrooms in Manila who are doing AI investigations and newsrooms in rural Philippines who could benefit from the efficiencies that AI provides that are struggling for their survival.
“We are talking about newsrooms who have five people on their team, so every minute they spend on transcription is a minute they don’t spend reporting and editing stories,” Tordecilla said. “These newsrooms desperately need the help these technologies provide, but they’re the ones being left out because they work in languages that are considered low-resource, so they are not a big priority for tech companies to support.”
Indian journalist Sannuta Raghu said that addressing the AI digital gap is important for access, scale, and reaching a broader audience. Raghu mentioned a specific goal that her newsroom had when they started their AI lab: to build a multilingual text-to-video tool, as video content is extremely popular in India. With thousands of millions of Indians using smartphones and having access to very cheap internet, there is a significant opportunity to deliver journalism via video in these local languages. They have created Factivo 1.0, a URL-to-MP4 tool that creates accurate videos from news articles in a matter of minutes and which is available for major languages.
“For a small English-language newsroom like us, the universe available to us right now is about 30 million Indians who understand English,” she explains. “But to tap into the larger circle that is India, which is 1.4 billion people, we would need to be able to do language at scale.”
Courtesy: https://www.niemanlab.org/2025/06/these-journalism-pioneers-are-working-to-keep-their-countries-languages-alive-in-the-age-of-ai-news/