Elon Musk’s xAI commenced operations of a new supercomputer with 100,000 Nvidia H100 GPUs at its new Memphis, Tennessee facility.
A supercomputer meant to train AI models at Elon Musk-led xAI has come online. In line with the startup’s ambitious plans to make the supercomputer fully operational by the end of 2025, the first phase was completed in just four months.
This weekend, the @xAI team brought our Colossus 100k H100 training cluster online. From start to finish, it was done in 122 days.
Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200k (50k H200s) in a few months.
Excellent…
— Elon Musk (@elonmusk) September 2, 2024
The supercomputer is called “Colossus” and presumably named after a supercomputer from the 1970 Hollywood sci-fi classic Colossus: The Forbin Project. It comprises a cluster of 100,000 Nvidia H100 GPUs, Musk announced in a post on X earlier this week. It is located at a site in Memphis, Tennessee where it occupies a plant previously owned by Electrolux. Here, Musk plans to build a “gigafactory of compute,” as previously reported by The Information.
Colossus is believed to have one of the largest fleets of GPUs in a single cluster. Musk envisions expanding the cluster to “double in size” with 200,000 GPUs, including 50,000 additional Nvidia H200s, which are newer and offer higher computational memory. The tech mogul had previously committed to spending $3–4 billion on sourcing GPUs.
Of the roughly $10B in AI-related expenditures I said Tesla would make this year, about half is internal, primarily the Tesla-designed AI inference computer and sensors present in all of our cars, plus Dojo.
For building the AI training superclusters, NVidia hardware is about…
— Elon Musk (@elonmusk) June 4, 2024
Understandably, these GPUs are some of the most sought-after tech products presently. Their surging demand has boosted Nvidia’s market cap and helped it become the world’s most valued company earlier this year. Procuring these, however, may be a challenge as the leading tech giants, including Meta, Google, Amazon, and Microsoft are all vying for Nvidia’s silicon. xAI circumvented this challenge by securing the initial batch of GPUs that were originally delivered to Tesla.
The most imminent use case for xAI’s newly charged-up supercomputer is to train the next version of Grok, an AI chatbot that is accessible to paid subscribers on the Musk-owned social network, X. While xAI released Grok-2 in a beta preview in August, Musk has already confirmed Grok 3 will arrive by the end of 2024 and be trained using 100,000 Nvidia GPUs. Colossus fulfills those compute requirements.
Grok 3 end of year after training on 100k H100s should be really something special
— Elon Musk (@elonmusk) July 1, 2024
The hundreds of thousands of GPUs combined are expected to accelerate Grok’s learning significantly. However, there is no guarantee it will ensure it will surpass the capabilities of AI models by other companies, especially as the likes of Meta plan to balloon their portfolio of GPUs by a significantly larger margin. In addition to Grok, the supercomputer can be expected to train the underlying AI models that will power the Tesla Optimus robot.