Artificial intelligence (AI) systems such as large language models (LLMs) require vast computing power, and while much of the focus is on graphics processing unit (GPU) and central processing unit (CPU) capacity, storage is an overlooked aspect that can contribute to maximizing AI efficiency.
How do storage solutions and AI intersect, and how can an optimized storage infrastructure enhance the performance and effectiveness of AI systems?
The Data-Driven Nature of AI
At the core of any AI system lies massive amounts of data. Machine learning (ML) algorithms rely on extensive data sets to learn patterns, make predictions, and improve over time.
Companies must allocate hefty sums to capital and operating expenses for large AI supercomputers for deep learning or LLM training, of which storage accounts for around 5%, according to James Coomer, Senior Vice President of Products at DataDirect Networks (DDN), speaking at the recent ai-Pulse conference.
“The question is as a 5% piece of the pie, what can we do competitively to make everything much more efficient — how can storage make the rest of the pie do more work? With hundreds of millions of dollars being spent, if we can save 5%, 10% and give it back to training that’s a real win.”
Training an LLM like ChatGPT-3 requires over 100 servers running for around a month, during which time the system will run through multiple epochs, training and checking for errors. At each of these stages, it will be moving data and writing checkpoints, and then once the model is ready to be sent to customers, it will be distributed using read access – all of which requires data storage capacity.
Efficient storage solutions, then, play a crucial role in handling the increasing volume and movement of data that AI systems require. Such scalable storage systems tend to connect hardware that runs in the cloud and racks in data centers with fast wires to GPUs, CPUs, or large compute infrastructure.
One solution is to build an efficient architecture based on parallel file systems.
“NFS is the standard protocol for sharing data across a network, but the problem is the compute nodes don’t know where the data is. They have to go to a server, and then that server has to go find the data, which means you need a second network for the servers to find the data in this large, shared infrastructure,” Coomer explained.
“A parallel file system moves some intelligence into the compute, so now what’s really happening is that the compute nodes know where the data is and get it directly. They can go in parallel and move data directly from where it resides into the application.”
This halves the complexity of the system while making it more scalable.
A fast, reliable, and scalable storage architecture ensures that AI models can access and process data sets seamlessly, facilitating more efficient training and inference.
Optimizing Training Workflows
When training LLMs, every epoch moves data across the network, and each phase reads the same data from the network, so automatic system caching keeps the data on the local disk so that it can be re-read rather than moving it repeatedly across the network. This facilitates faster access to the data and reduces the time spent on input/output operations, expediting the workflows.
In this “hot node” process, the first epoch writes data to the local flash storage as well as reading it, but there’s subsequently no network traffic.
By re-reading data using the local nonvolatile memory express (NVMe), systems can skim off 3% of the runtime, Coomer said. This not only saves 3% of a $100 million data center, allowing it to increase productivity, but it also frees up network capacity for other tasks.
While machine learning may appear to be read-intensive, as models are trained on text, video files, images, and so on, only half of the process involves reading the disk – the other half is writing, Coomer noted.
The need for writes is driven by the use of checkpoints, which are intermediate snapshots of an AI model’s entire internal state — effectively backup and restore points written to disk. Checkpoints allow developers to pick up training from a certain point if there is an error.
The more checkpoints there are, the less re-computation is required if there is a restart, such as when hardware fails, saving resources.
For example, training the Megatron-LM on NVIDIA’s SELENE supercomputer requires thousands of checkpoints — each over 170GB requiring up to 7GB/s per DGX server, of which there are 128. Checkpoints also improve a model’s prediction accuracy, allow developers to continue training models across different systems, transfer learning, restart experiments from less trained states, and correct validation errors.
Large Storage for Small File Writes
This means AI storage must be optimized for a large number of small file reads and writes. One way to do this is by serving writes at high speeds, driving reads and writes to support the checkpoints as well as the data loads, Coomer said.
This optimization is particularly important in research and development environments, where data scientists and engineers iterate on models to enhance their accuracy and performance.
On average, checkpoint-related overheads can account for 12% of total training time and can rise to as much as 43%.
“If you use a well-optimized, well-architected parallel file system for storage, you can shave that 43% and move it down to 5% or 10% just by moving that data faster in a specific way,” Coomer said.
“If you get the right storage versus an NFS system, you can provide between 5% and 12% more useful training time out of your infrastructure compared with if you had used a non-specialized NFS system.”
Minimizing latency is crucial for specific AI applications, particularly those requiring real-time responses. Storage solutions that provide access to data with low latency enable AI systems to respond promptly to user inputs or changing environmental conditions.
This is key for industries such as healthcare diagnostics, autonomous vehicles, and financial trading, which require split-second decisions that can have significant implications.
But not just speed and capacity that make storage systems effective — they also require intelligent data management.
The diverse data sets that train AI models can sit in different locations or formats. Therefore, storage systems that provide efficient data indexing, organization, and accessibility can contribute to streamlining training workflows. This enables AI developers to focus on innovation rather than grappling with data logistics.
In AI supercomputing systems, performance bottlenecks can arise from various factors, including slow access to data storage.
Investing in high-performance storage solutions helps avoid these bottlenecks, ensuring that the computational power of the hardware is used to the full. This optimization results in faster training times and decision-making, increasing the efficiency of AI-driven applications.