Elon Musk is putting his AI chips to work — and he's catching up with Mark Zuckerberg
- Elon Musk just put a whole bunch of Nvidia chips to work.
- He said on Monday that his company xAI brought its AI training cluster, Colossus, online.
- Built with Nvidia's H100 GPUs, the cluster could help Musk catch up to Meta.
Elon Musk might be distracted by the Brazilian Supreme Court's decision to ban X, but he isn't letting that stop him from pushing forward with his AI ambitions.
On Monday, the billionaire said xAI — the company he launched in 2023 — had brought a massive new training cluster of chips online over the weekend, claiming it represented "the most powerful AI training system in the world."
The system, dubbed Colossus, was built at a site in Memphis, Tennessee, using 100,000 chips from Nvidia, specifically its H100 GPUs. Musk said the cluster was built in 122 days and would "double in size" in a few months as more GPUs are added.
Though Musk confirmed the size of the cluster in July, bringing it online marks a key step for his AI ambitions and, critically, allows him to play catch-up with his Silicon Valley nemesis Mark Zuckerberg.
Zuckerberg's and Musk's ambitions — in Musk's case, to turn xAI into a company that advances "our collective understanding of the universe" with its Grok chatbot — depend on high-performance GPUs, which provide the computing power required for powerful AI models.
These haven't exactly been easy to come by, nor have they been cheap.
The hype generated around AI since the release of ChatGPT in late 2022 has left companies scrambling for Nvidia GPUs, with shortages stemming from frenzied demand and supply constraints. In some instances the chips have been sold for upward of $40,000.
Despite the barriers to access, companies have sought to secure a supply of GPUs in any way they can and put them to work to edge ahead of rivals.
Llama versus Grok
Nathan Benaich, the founder and general partner of Air Street Capital, has been tracking the number of H100 GPUs acquired by tech companies. He put Meta's total at 350,000 and xAI's at 100,000. Tesla, one of Musk's other companies, was at 35,000.
In January, Zuckerberg said Meta would have a stockpile of 600,000 GPUs by the end of the year, with some 350,000 of those being Nvidia's H100s.
Microsoft, OpenAI, and Amazon haven't disclosed the sizes of their H100 piles.
Meta hasn't disclosed exactly how many GPUs Zuckerberg has secured from his 600,000 target and how many have been put to use. But in a research paper published in July, Meta said the largest version of its Llama 3 large language model had been trained on 16,000 H100 GPUs. In March, the company announced "a major investment in Meta's AI future" with two 24,000 GPU clusters to support the development of Llama 3.
It suggests xAI's latest training cluster, with its 100,000 H100 GPUs, is much bigger than the cluster used to train Meta's largest AI model.
The scale of the feat hasn't been lost on the industry.
On X, a post from Nvidia's data-center account in response to Musk said, "Exciting to see Colossus, the world's largest GPU #supercomputer, come online in record time."
Greg Yang, an xAI cofounder, had a more colorful response to the news that riffed on a song by the American rapper Tyga:
Shaun Maguire, a partner at the venture-capital firm Sequoia, wrote on X that the xAI team now "has access to the world's most powerful training cluster" to build the next version of its Grok chatbot. He added, "In the last few weeks Grok-2 catapulted to being roughly at parity with the state of the art models."
But, as with most AI companies, there are big question marks over commercializing the technology. "It's impressive xAI has been able to raise so much with Elon and make progress, but their product strategy remains unclear," Benaich told Business Insider.
In July, Musk said the next version of Grok — after training on 100,000 H100s — "should be really something special."
We'll find out soon enough how competitive it makes him with Zuckerberg on AI.