Revealed: The Authors Whose Pirated Books Are Powering Generative AI

20.08.2023 00:04

TheAtlantic.com

Stephen King, Zadie Smith, and Michael Pollan are among thousands of writers whose copyrighted works are being used to train large language models.

One of the most troubling issues around generative AI is simple: It’s being made in secret. To produce humanlike answers to questions, systems such as ChatGPT process huge quantities of written material. But few people outside of companies such as Meta and OpenAI know the full extent of the texts these programs have been trained on.

Some training text comes from Wikipedia and other online writing, but high-quality generative AI requires higher-quality input than is usually found on the internet—that is, it requires the kind found in books. In a lawsuit filed in California last month, the writers Sarah Silverman, Richard Kadrey, and Christopher Golden allege that Meta violated copyright laws by using their books to train LLaMA, a large language model similar to OpenAI’s GPT-4—an algorithm that can generate text by mimicking the word patterns it finds in sample texts. But neither the lawsuit itself nor the commentary surrounding it has offered a look under the hood: We have not previously known for certain whether LLaMA was trained on Silverman’s, Kadrey’s, or Golden’s books, or any others, for that matter.

In fact, it was. I recently obtained and analyzed a dataset used by Meta to train LLaMA. Its contents more than justify a fundamental aspect of the authors’ allegations: Pirated books are being used as inputs for computer programs that are changing how we read, learn, and communicate. The future promised by AI is written with stolen words.

Upwards of 170,000 books, the majority published in the past 20 years, are in LLaMA’s training data. In addition to work by Silverman, Kadrey, and Golden, nonfiction by Michael Pollan, Rebecca Solnit, and Jon Krakauer is being used, as are thrillers by James Patterson and Stephen King and other fiction by George Saunders, Zadie Smith, and Junot Díaz. These books are part of a dataset called “Books3,” and its use has not been limited to LLaMA. Books3 was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J—a popular open-source model—and likely other generative-AI programs now embedded in websites across the internet. A Meta spokesperson declined to comment on the company’s use of Books3; Bloomberg did not respond to emails requesting comment; and Stella Biderman, EleutherAI’s executive director, did not dispute that the company used Books3 in GPT-J’s training data.

As a writer and computer programmer, I’ve been curious about what kinds of books are used to train generative-AI systems. Earlier this summer, I began reading online discussions among academic and hobbyist AI developers on sites such as GitHub and Hugging Face. These eventually led me to a direct download of “the Pile,” a massive cache of training text created by EleutherAI that contains the Books3 dataset, plus material from a variety of other sources: YouTube-video subtitles, documents and transcriptions from European Parliament, English Wikipedia, emails sent and received by Enron Corporation employees before its 2001 collapse, and a lot more. The variety is not entirely surprising. Generative AI works by analyzing the relationships among words in intelligent-sounding language, and given the complexity of these relationships, the subject matter is typically less important than the sheer quantity of text. That’s why The-Eye.eu, a site that hosted the Pile until recently—it received a takedown notice from a Danish anti-piracy group—says its purpose is “to suck up and serve large datasets.”

The Pile is too large to be opened in a text-editing application, so I wrote a series of programs to manage it. I first extracted all the lines labeled “Books3” to isolate the Books3 dataset. Here’s a sample from the resulting dataset:

{"text": "\n\nThis book is a work of fiction. Names, characters, places and incidents are products of the authors' imagination or are used fictitiously. Any resemblance to actual events or locales or persons, living or dead, is entirely coincidental.\n\n | POCKET BOOKS, a division of Simon & Schuster Inc. \n1230 Avenue of the Americas, New York, NY 10020 \nwww.SimonandSchuster.com\n\n---|---

This is the beginning of a line that, like all lines in the dataset, continues for many thousands of words and contains the complete text of a book. But what book? There were no explicit labels with titles, author names, or metadata. Just the label “text,” which reduced the books to the function they serve for AI training. To identify the entries, I wrote another program to extract ISBNs from each line. I fed these ISBNs into another program that connected to an online book database and retrieved author, title, and publishing information, which I viewed in a spreadsheet. This process revealed roughly 190,000 entries: I was able to identify more than 170,000 books—about 20,000 were missing ISBNs or weren’t in the book database. (This number also includes reissues with different ISBNs, so the number of unique books might be somewhat smaller than the total.) Browsing by author and publisher, I began to get a sense for the collection’s scope.

Of the 170,000 titles, roughly one-third are fiction, two-thirds nonfiction. They’re from big and small publishers. To name a few examples, more than 30,000 titles are from Penguin Random House and its imprints, 14,000 from HarperCollins, 7,000 from Macmillan, 1,800 from Oxford University Press, and 600 from Verso. The collection includes fiction and nonfiction by Elena Ferrante and Rachel Cusk. It contains at least nine books by Haruki Murakami, five by Jennifer Egan, seven by Jonathan Franzen, nine by bell hooks, five by David Grann, and 33 by Margaret Atwood. Also of note: 102 pulp novels by L. Ron Hubbard, 90 books by the Young Earth creationist pastor John F. MacArthur, and multiple works of aliens-built-the-pyramids pseudo-history by Erich von Däniken. In an emailed statement, Biderman wrote, in part, “We work closely with creators and rights holders to understand and support their perspectives and needs. We are currently in the process of creating a version of the Pile that exclusively contains documents licensed for that use.”

Although not widely known outside the AI community, Books3 is a popular training dataset. Hugging Face hosted it for more than two and a half years, apparently removing it around the time it was mentioned in lawsuits against OpenAI and Meta earlier this summer. The academic writer Peter Schoppert has tracked its use in his Substack newsletter. Books3 has also been cited in the research papers by Meta and Bloomberg that announced the creation of LLaMA and BloombergGPT. In recent months, the dataset was effectively hidden in plain sight, possible to download but challenging to find, view, and analyze.

Other datasets, possibly containing similar texts, are used in secret by companies such as OpenAI. Shawn Presser, the independent developer behind Books3, has said that he created the dataset to give independent developers “OpenAI-grade training data.” Its name is a reference to a paper published by OpenAI in 2020 that mentioned two “internet-based books corpora” called Books1 and Books2. That paper is the only primary source that gives any clues about the contents of GPT-3’s training data, so it’s been carefully scrutinized by the development community.

From information gleaned about the sizes of Books1 and Books2, Books1 is speculated to be the complete output of Project Gutenberg, an online publisher of some 70,000 books with expired copyrights or licenses that allow noncommercial distribution. No one knows what’s inside Books2. Some suspect it comes from collections of pirated books, such as Library Genesis, Z-Library, and Bibliotik, that circulate via the BitTorrent file-sharing network. (Books3, as Presser announced after creating it, is “all of Bibliotik.”)

Presser told me by telephone that he’s sympathetic to authors’ concerns. But the great danger he perceives is a monopoly on generative AI by wealthy corporations, giving them total control of a technology that’s reshaping our culture: He created Books3 in the hope that it would allow any developer to create generative-AI tools. “It would be better if it wasn’t necessary to have something like Books3,” he said. “But the alternative is that, without Books3, only OpenAI can do what they’re doing.” To create the dataset, Presser downloaded a copy of Bibliotik from The-Eye.eu and updated a program written more than a decade ago by the hacktivist Aaron Swartz to convert the books from ePub format (a standard for ebooks) to plain text—a necessary change for the books to be used as training data. Although some of the titles in Books3 are missing relevant copyright-management information, the deletions were ostensibly a by-product of the file conversion and the structure of the ebooks; Presser told me he did not knowingly edit the files in this way.

Many commentators have argued that training AI with copyrighted material constitutes “fair use,” the legal doctrine that permits the use of copyrighted material under certain circumstances, enabling parody, quotation, and derivative works that enrich the culture. The industry’s fair-use argument rests on two claims: that generative-AI tools do not replicate the books they’ve been trained on but instead produce new works, and that those new works do not hurt the commercial market for the originals. OpenAI made a version of this argument in response to a 2019 query from the United States Patent and Trademark Office. According to Jason Schultz, the director of the Technology Law and Policy Clinic at NYU, this argument is strong.

I asked Schultz if the fact that books were acquired without permission might damage a claim of fair use. “If the source is unauthorized, that can be a factor,” Schultz said. But the AI companies’ intentions and knowledge matter. “If they had no idea where the books came from, then I think it’s less of a factor.” Rebecca Tushnet, a law professor at Harvard, echoed these ideas, and told me the law was “unsettled” when it came to fair-use cases involving unauthorized material, with previous cases giving little indication of how a judge might rule in the future.

This is, to an extent, a story about clashing cultures: The tech and publishing worlds have long had different attitudes about intellectual property. For many years, I’ve been a member of the open-source software community. The modern open-source movement began in the 1980s, when a developer named Richard Stallman grew frustrated with AT&T’s proprietary control of Unix, an operating system he had worked with. (Stallman worked at MIT, and Unix had been a collaboration between AT&T and several universities.) In response, Stallman developed a “copyleft” licensing model, under which software could be freely shared and modified, as long as modifications were re-shared using the same license. The copyleft license launched today’s open-source community, in which hobbyist developers give their software away for free. If their work becomes popular, they accrue reputation and respect that can be parlayed into one of the tech industry’s many high-paying jobs. I’ve personally benefited from this model, and I support the use of open licenses for software. But I’ve also seen how this philosophy, and the general attitude of permissiveness that permeates the industry, can cause developers to see any kind of license as unnecessary.

This is dangerous because some kinds of creative work simply can’t be done without more restrictive licenses. Who could spend years writing a novel or researching a work of deep history without a guarantee of control over the reproduction and distribution of the finished work? Such control is part of how writers earn money to live.

Meta’s proprietary stance with LLaMA suggests that the company thinks similarly about its own work. After the model leaked earlier this year and became available for download from independent developers who’d acquired it, Meta used a DMCA takedown order against at least one of those developers, claiming that “no one is authorized to exhibit, reproduce, transmit, or otherwise distribute Meta Properties without the express written permission of Meta.” Even after it had “open-sourced” LLaMA, Meta still wanted developers to agree to a license before using it; the same is true of a new version of the model released last month. (Neither the Pile nor Books3 is mentioned in a research paper about that new model.)

Control is more essential than ever, now that intellectual property is digital and flows from person to person as bytes through airwaves. A culture of piracy has existed since the early days of the internet, and in a sense, AI developers are doing something that’s come to seem natural. It is uncomfortably apt that today’s flagship technology is powered by mass theft.

Yet the culture of piracy has, until now, facilitated mostly personal use by individual people. The exploitation of pirated books for profit, with the goal of replacing the writers whose work was taken—this is a different and disturbing trend.

Moscow.media

Частные объявления сегодня

Добавить объявление

Москва

Помощь студентам по выполнению курсовых, дипломных работ

Нижний Новгород

Наружная реклама в Нижнем Новгороде от рекламного агентства

Новосибирск

Куплю проволоку OK Autrod 309L

Новосибирск

Куплю проволоку ОК 16.95

Rss.plus

Все новости за 24 часа

Ru24.pro

Покушение на Дональда Трампа в США. Могут проверить "Секретные службы".

"Матрица" от ShantiOlga активирует изобилие

В Москве стартует Восьмая межрегиональная выставка «КРАСНЫЕ ВОРОТА/ПРОТИВ ТЕЧЕНИЯ»

Студия звукозаписи в Москве. Студия звукозаписи цена.

Life24.pro

Итоги конкурса красоты «Miss Europe 2024»

Что будет, если человек съест собачий или кошачий корм? Объясняет гастроэнтеролог

«Из-за поездки на Бали я выглядела беременной и мне пришлось поехать в Грецию, чтобы это исправить» - туристка сообщила, что никогда больше не вернется на этот кошмарный остров

"Матрица" от ShantiOlga активирует изобилие

Today24.pro

The Faculty of International Journalism and Mass Communications of the Eurasian International University conducts an additional set of applicants!

UFC Denver video: Abdul Razak Alhassan vs. Cody Brundage ends in no-contest after illegal blows

Warner will not be considered for 2025 Champions Trophy: Bailey

See the $10M New Orleans mansion with a grisly past that lured a potential new buyer in less than a day

News24.pro

Каменный город

В Жуковском на дороге один водитель убил другого арматурой

В Москве стартует Восьмая межрегиональная выставка «КРАСНЫЕ ВОРОТА/ПРОТИВ ТЕЧЕНИЯ»

«Байкал Сервис» почти вдвое увеличил объемы отправок на маркетплейсы

Game24.pro

Former Bungie lead counsel explains how the studio nailed one of Destiny 2's most infamous leakers

Для Titan Slayer: Idle RPG проходит предрегистрация в Google Play

How well does XCOM: Enemy Within hold up today?

Ash of Gods: The Way перенесут на смартфоны — появилась страница в Play Market

Russia24.pro

Российские аналитики оценили состояние отечественного рынка ЦФА

На платформе iSpring появились пульс-опросы для сбора обратной связи от сотрудников

Компания ICDMC приняла участие в XIV Фармацевтической конференции «Зелёный крест»

Финалист шоу “Голос” Сергей Арутюнов остался без голоса. Артист находится в больнице, состояние тяжёлое.

Другие проекты от SMI24.net

News-life

Роскошная и милая свадьба пингвинов в Китае попала на видео

Посчитано, хватит ли российской зарплаты на покупку жилья

Сергей Собянин: Развиваем умные сервисы

В Москве стартует Восьмая межрегиональная выставка «КРАСНЫЕ ВОРОТА/ПРОТИВ ТЕЧЕНИЯ»

Ru24.net

В Москве впервые конфисковали электросамокат за наезд на пешехода

Рекордный прирост заведений летнего общепита продемонстрировал Воронеж

Предательский маневр: почему половина американских компаний остались в России

Экс-игрока «Манчестер Юнайтед» Канчельскиса избили до потери сознания в Москве

News.tennis

Крейчикова о своем тренере Новотной, которая умерла в 2017-м: «Я даже не мечтала, что однажды выиграю тот же трофей, что и Яна»

Крейчикова выиграла второй турнир «Большого шлема» и вернется в топ-10

Express: Медведеву дали предупреждение за ругань с судьей в полуфинале Уимблдона

Хорошо, но без финала: как Медведев завершил выступление на Уимблдоне

29ru.net

В Москве впервые конфисковали электросамокат за наезд на пешехода

Синоптик Позднякова рассказала о признаках аномальной жары

В городах Свердловской области зафиксировали рост числа стобалльников на ЕГЭ

Экс-игрока «Манчестер Юнайтед» Канчельскиса избили до потери сознания в Москве

Музыкальные новости

Poisk-music.ru

У Тимати угнали элитный автомобиль в центре Москвы

Фото Курта Кобейна продано за $75 тысяч на торгах

«Вот ты мужчина? Ты можешь наплевать на правила?» Ольга Бузова подкатила к участнику «Музыкальной интуиции» в новом выпуске шоу

«Мы рады!»: дочь Самойловой и Джигана устроилась работать официанткой

Ria.city

Заведующий рефракционным отделением клиники микрохирургии глаза АйМед Кирилл Светлаков: как снизить нагрузку на глаза при работе с гаджетами

Российские аналитики оценили состояние отечественного рынка ЦФА

На платформе iSpring появились пульс-опросы для сбора обратной связи от сотрудников

Финалист шоу “Голос” Сергей Арутюнов остался без голоса. Артист находится в больнице, состояние тяжёлое.

Rss.plus

Фазель: Евро-2024 в Германии напоминает чемпионат мира 2018 года в России

Сергачев о матче звезд КХЛ и НХЛ: «Когда обратился к Панарину и Овечкину, понимал, что интерес будет хороший, но такого ажиотажа представить не мог. Болельщики соскучились по играм такого масштаба»

Псковская область вошла в топ-3 по доле первички в ипотечных сделках Сбера среди регионов, подпадающих под расширенные условия «Семейной ипотеки»

В России начали продавать Hyundai i35 китайской сборки

Auto.russia24.pro

В Москве впервые конфисковали электросамокат за наезд на пешехода

Независимый ремонт: как «Грузовичкоф» обеспечивает исправное состояние большого автопарка

В Москве впервые изъяли электросамокат за пьяное вождение и ДТП с пешеходом

Верховая езда, сап-серфинг и йога: летний досуг долголетов в Ленинском округе

Putin.russia24.pro

Началась катастрофа, которую предсказал Путин — Хазин

Молчали больше 20 лет: какую правду скоро озвучит Путин, рассказал Хазин

Спикер Совфеда Матвиенко назвала страны НАТО подневольными вассалами Вашингтона

Выплата для работающих россиян с детьми: кому положена и как получить

Covid.russia24.pro

Ведущие пульмонологи рассказали о новейших достижениях в области терапии постковидного фиброза легких

Navalny.russia24.pro

Басманный суд: защита обжаловала заочный арест Юлии Навальной

Адвокаты Навальной обжаловали ее заочный арест в России

Юлия Навальная обжаловала заочный арест по делу об участии в экстремистском сообществе

Health.russia24.pro

Организаторы события N1 Medical 2024 наградили лучших специалистов медицины и индустрии красоты !

Компания ICDMC приняла участие в XIV Фармацевтической конференции «Зелёный крест»

Верховая езда, сап-серфинг и йога: летний досуг долголетов в Ленинском округе

Заведующий рефракционным отделением клиники микрохирургии глаза АйМед Кирилл Светлаков: как снизить нагрузку на глаза при работе с гаджетами

Zelensky.russia24.pro

Окружение Зеленского рассказало о разочаровании саммитом НАТО

Киевский политолог: Запад предлагал «убрать Зеленского» и давал РФ координаты

СМИ узнали об отчаянии в окружении Зеленского

Sport.russia24.pro

Верховая езда, сап-серфинг и йога: летний досуг долголетов в Ленинском округе

Генерала Попова встретили у здания суда четверо друзей в футболках с его фото

Андрей Матросов – бронзовый призер кубка чемпионов Союзного государства среди вальщиков леса «Лесоруб-2024»

Черчесов о России-2018 против Испании-2024: «Эту команду мы бы не обыграли. На ЧМ-2018 я радовался выходу на испанцев, у них не было тренера, Йерро пришел за два дня до турнира»

Person.russian.city

Собянин сообщил о скором открытии флагманского центра больницы имени Буянова

Собянин объявил о начале строительства станции метро «Достоевская»

Сергей Собянин: Развиваем умные сервисы

Мэр Москвы: инвесторы отреставрировали 12 помещений в исторических зданиях

Ecology.russia24.pro

Экосистема недвижимости М2: миграционный поток в Москву снижается

В ЦАО проверили состояние площадок для выгула собак

«Жареное солнце больших городов»: где в Москве укрыться от аномальной жары

Вильфанд: тяжелая погода в Москве отступит в конце недели

29ru.net

В 2024 году туристический поток из России в Тбилиси снизился почти на 9%

Экс-игрока «Манчестер Юнайтед» Канчельскиса избили до потери сознания в Москве

Трамп сообщил, что выбрал кандидата в вице-президенты США

В компании Джонни Деппа заметили модель из Екатеринбурга

Severodvinsk.ws

Дни рождения

Гребцы Архангельской области выступают на Всероссийских соревнованиях

Многолетнюю мерзлоту будут изучать в Амурской области

АО «Транснефть – Север» за 6 месяцев 2024 г. выполнило 26 тыс. экологических исследований

Sevpoisk.ru

Случайно убивший журналистку Бабаеву инструктор получил год исправительных работ

Суд приговорил к году исправительных работ виновного в гибели экс-главреда "Газеты.Ru"

Круиз-викторина "Твоей истории негромкой мне дорог каждый уголок"

В рейтинге городов России по объемам ввода жилья Севастополь на 29 месте, Симферополь — 73

103news.com

Экс-игрока «Манчестер Юнайтед» Канчельскиса избили до потери сознания в Москве

Предательский маневр: почему половина американских компаний остались в России

Два мигранта устроили разборки с домкратом у Московского вокзала в Петербурге, создав пробку

Трамп сообщил, что выбрал кандидата в вице-президенты США

Агрегатор новостей 24СМИ