How a Scots Wikipedia scandal highlighted AI’s data problem
Most of the English language technology you use on a daily basis—voice assistants, spell checkers, translation tools, search functions—share a common origin story. They’re built using AI language models, and many of those models are trained on millions of Wikipedia articles.
But a bizarre discovery made this week by a Scottish Reddit sleuth has highlighted a worrisome problem for that data pipeline. Most of the Scots language edition of Wikipedia was written by an American teenager who doesn’t actually speak the language. Instead, the teen wrote tens of thousands of articles in English with a put-on Scottish accent, ignoring actual Scots grammar and vocabulary.
For a low-resource language like Scots, which has few digital archives of written text to pull from, it could mean that some models base their entire understanding of the language on the phony version written in the Scots Wikipedia. That limits the amount of access native speakers have to tech tools in their language.
Read the rest of this story on qz.com. Become a member to get unlimited access to Quartz’s journalism.