Studies on AI transcription and translation in journalism reveal “low-resource” language gap, new report finds
A new report from the Center for News, Technology & Innovation (CNTI) reviewed over 55 studies to better understand the state of AI translation and transcription in journalism. The global research center’s AI and Journalism Working Group has been regularly surveying AI research, with its last installment from July tackling studies on AI literacy and communication.
For this new report, the group combed through the work of social scientists, linguists, and computer scientists. One of the report’s topline findings: there is a major divide between the accessibility and accuracy of AI transcription and translation tools when they’re used for English and other dominant languages, and when they’re used for languages that AI researchers have termed “low-resource.”
English represents more than 50% of the domains on the web. Mainstream language models are largely trained on data scraped from the internet, which is one reason transcription and translation tools perform so well in English. “Low-resource” languages are those that have comparatively little digitized text on the web available to train models. Even some of the most-spoken languages in the world, like Urdu, are considered low-resource.
The working group outlined a few ways this creates accessibility barriers. An AI translation tool may perform very well for a language pair like English and Spanish, but introduce significant errors when it’s used for a pair of less common languages. In particular, AI transcription and translation tools often struggle with “language ambiguity and cultural nuance,” show an inability to perform tasks “at the level of human experts,” and introduce “inherent biases” present in their training data.
Take one study reviewed by the working group that looked at AI translations of international news in Tanzania. Researchers found that 13% of translated sentences reviewed included some kind of mistranslation or inaccuracy. One article, for instance, mistranslated the English term “street food” into Kiswahili as “food of the road.”
The group also found that AI translation tools may struggle to adjust the formality of written statements when moving between different languages. Korean and Japanese, for example, often have stricter rules around formal language. As a result, translations of less formal English into Korean can sometimes be read as “socially inappropriate.”
Some newsrooms in the Global South are working to surmount the challenges they face in using AI transcription and translation tools in low-resource languages. The working group spotlighted Dubawa, a fact-checking project based in Nigeria, that has been training tools on local dialects and accents in order to more accurately transcribe radio broadcasts.
The working group also pointed to the use of AI transcription tools to cover public meetings as particularly promising for all journalists. That’s in part because in these transcripts “figurative language and wordplay” — both types of speech that AI tools struggle to process — are uncommon.
As for translation, the group says that due to how common mistakes are, one of the most promising ways to incorporate AI tools into newsrooms is through “hybrid translation” — that’s when AI translations are reviewed by humans before publication.
You can read the full report from CNTI’s Global AI and Journalism Research Working Group here.
