Is AI About to Run Out of Data? The History of Oil Says No
Is the AI bubble about to burst? Every day that the stock prices of semiconductor champion Nvidia and the so-called “Fab Five” tech giants (Microsoft, Apple, Alphabet, Amazon, and Meta) fail to regain their mid-year peaks, more people ask that question.
It would not be the first time in financial history that the hype around a new technology led investors to drive up the value of the companies selling it to unsustainable heights—and then get cold feet. Political uncertainty around the U.S. election is itself raising the probability of a sell-off, as Donald Trump expresses his lingering resentments against the Big Tech companies and his ambivalence towards Taiwan, where the semiconductors essential for artificial intelligence mostly get made.
[time-brightcove not-tgx=”true”]The deeper question is whether AI can deliver the staggering long-term value that the internet has. If you invested in Amazon in late 1999, you would have been down over 90% by early 2001. But you would be up over 4,000% today.
A chorus of skeptics now loudly claims that AI progress is about to hit a brick wall. Models such as GPT-4 and Gemini have already hoovered up most of the internet’s data for training, the story goes, and will lack the data needed to get much smarter.
Read More: 4 Charts That Show Why AI Progress Is Unlikely to Slow Down
However, history gives us a strong reason to doubt the doubters. Indeed, we think they are likely to end up in the same unhappy place as those who in 2001 cast aspersions on the future of Jeff Bezos’s scrappy online bookstore.
The generative AI revolution has breathed fresh life into the TED-ready aphorism “data is the new oil.” But when LinkedIn influencers trot out that 2006 quote by British entrepreneur Clive Humby, most of them are missing the point. Data is like oil, but not just in the facile sense that each is the essential resource that defines a technological era. As futurist Ray Kurzweil observes, the key is that both data and oil vary greatly in the difficulty—and therefore cost—of extracting and refining them.
Some petroleum is light crude oil just below the ground, which gushes forth if you dig a deep enough hole in the dirt. Other petroleum is trapped far beneath the earth or locked in sedimentary shale rocks, and requires deep drilling and elaborate fracking or high-heat pyrolysis to be usable. When oil prices were low prior to the 1973 embargo, only the cheaper sources were economically viable to exploit. But during periods of soaring prices over the decades since, producers have been incentivized to use increasingly expensive means of unlocking further reserves.
The same dynamic applies to data—which is after all the plural of the Latin datum. Some data exist in neat and tidy datasets—labeled, annotated, fact-checked, and free for download in a common file format. But most data are buried more deeply. Data may be on badly scanned handwritten pages; may consist of terabytes of raw video or audio, without any labels on relevant features; may be riddled with inaccuracies and measurement errors or skewed by human biases. And most data are not on the public internet at all.
Read More: The Billion-Dollar Price Tag of Building AI
An estimated 96% to 99.8% of all online data are inaccessible to search engines—for example, paywalled media, password-protected corporate databases, legal documents, and medical records, plus an exponentially growing volume of private cloud storage. In addition, the vast majority of printed material has still never been digitized—around 90% for high-value collections such as the Smithsonian and U.K. National Archives, and likely a much higher proportion across all archives worldwide.
Yet arguably the largest untapped category is information that’s currently not captured in the first place, from the hand motions of surgeons in the operating room to the subtle expressions of actors on a Broadway stage.
For the first decade after large amounts of data became the key to training state-of-the-art AI, commercial applications were very limited. It therefore made sense for tech companies to harvest only the cheapest data sources. But the launch of Open AI’s ChatGPT in 2022 changed everything. Now, the world’s tech titans are locked in a frantic race to turn theoretical AI advances into consumer products worth billions. Many millions of users now pay around $20 per month for access to the premium AI models produced by Google, OpenAI, and Anthropic. But this is peanuts compared to the economic value that will be unlocked by future models capable of reliably performing professional tasks such as legal drafting, computer programming, medical diagnosis, financial analysis, and scientific research.
The skeptics are right that the industry is about to run out of cheap data. As smarter models enable wider adoption of AI for lucrative use cases, however, powerful incentives will drive the drilling for ever more expensive data sources—the proven reserves of which are orders of magnitude larger than what has been used so far. This is already catalyzing a new training data sector, as companies including Scale AI, Sama, and Labelbox specialize in the digital refining needed to make the less accessible data usable.
Read More: OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic
This is also an opportunity for data owners. Many companies and nonprofits have mountains of proprietary data that are gathering dust today, but which could be used to propel the next generation of AI breakthroughs. OpenAI has already spent hundreds of millions of dollars licensing training data, inking blockbuster deals with Shutterstock and the Associated Press for access to their archives. Just as there was speculation in mineral rights during previous oil booms, we may soon see a rise in data brokers finding and licensing data in the hope of cashing in when AI companies catch up.
Much like the geopolitical scramble for oil, competition for top-quality data is also likely to affect superpower politics. Countries’ domestic privacy laws affect the availability of fresh training data for their tech ecosystems. The European Union’s 2016 General Data Protection Regulation leaves Europe’s nascent AI sector with an uphill climb to international competitiveness, while China’s expansive surveillance state allows Chinese firms to access larger and richer datasets than can be mined in America. Given the military and economic imperatives to stay ahead of Chinese AI labs, Western firms may thus be forced to look overseas for sources of data unavailable at home.
Yet just as alternative energy is fast eroding the dominance of fossil fuels, new AI development techniques may reduce the industry’s reliance on massive amounts of data. Premier labs are now working to perfect techniques known as “synthetic data” generation and “self-play,” which allow AI to create its own training data. And while AI models currently learn several orders of magnitude less efficiently than humans, as models develop more advanced reasoning, they will likely be able to hone their capabilities with far less data.
There are legitimate questions about how long AI’s recent blistering progress can be sustained. Despite enormous long-term potential, the short-term market bubble will likely burst before AI is smart enough to live up to the white-hot hype. But just as generations of “peak oil” predictions have been dashed by new extraction methods, we should not bet on an AI bust due to data running out.