Research Suggests A Large Proportion Of Web Material In Languages Other Than English Is Machine Translations Of Poor Quality Texts
The latest generative AI tools are certainly impressive, but they bring with them a wide range of complex problems, as numerous posts on Techdirt attest. A new academic paper, published on arXiv, raises more of them, but from a new angle. Entitled “A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism”, it studies the impact of today’s low-cost AI translation tools on the online world:
We explore the effects that the long-term availability of low cost Machine Translation (MT) has had on the web. We show that content on the web is often translated into many languages, and the quality of these multi-way translations indicates they were primarily created using MT.
“Multi-way” in this context means that two or more sentences can be found translated in several different languages. According to the researchers, of the 6.38 billion sentences studied, 2.19 billion are found in multi-way translations. In particular, languages that appear less frequently online had more multi-way sentences, with disproportionately more found among the rarest languages. Another key feature observed is that highly multi-way parallel translations are “significantly worse” than two-way translations. Moreover, the multi-way data consisted of shorter, more predictable sentences compared to two-way translations. Inspecting a random sample of 100 highly multi-way parallel sentences, the researchers found:
the vast majority came from articles that we characterized as low quality, requiring little or no expertise or advance effort to create, on topics like being taken more seriously at work, being careful about your choices, six tips for new boat owners, deciding to be happy, etc. Furthermore, we were unable to find any translationese or other errors that would suggest the articles were being translated into English (either by human translators or MT), suggesting it is instead being generated in English and translated to other languages.
Taking these observations together, the paper suggests that highly multi-way sentences are generated using AI, specifically machine translations of low-quality English-language originals. Further analysis showed that in the languages found less commonly online, most translations are multi-way parallel, which means that AI content dominates translated material in those languages. In addition:
a large fraction of the total sentences in lower resource languages have at least one translation implying that a large fraction of the total web in those languages is MT generated
In other words, however bad the problems are that AI is creating for English-language material, they are probably worse in languages found less commonly online, since a major proportion of the Web in those languages is generated by machines, not humans.
If this conclusion holds true beyond the dataset studied by the researchers, there is another interesting issue. Generative AI depends on large training sets, which often come from the Web. For languages other than English, the new paper suggests that much of the training material will be translations by AI of low-quality, possibly AI-generated texts. This issue of generative AI feeding on itself has been studied in earlier research. One group summarized their results on “The Curse of Recursion” as follows:
We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs [Large Language Models]. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web.
The new research suggests this is likely to be a more serious problem when building generative AI systems in languages for which there is less material online that can be used for training. The good news is the fact that the presence of multi-way sentences in languages other than English is a strong indication that they have been produced by AI, which offers a means to spot them and filter them out. The bad news is that if this technique is applied to improve the quality of training materials and avoid “model collapse”, the already energy-hungry process of training generative AI systems will be even more damaging for the planet.