Training AI models at scale imposes many challenges. As the size of the model grows, the amount of computation for backpropagation increases, as does the amount of data to fit the model. The use of GPUs accelerates the computation—however, the faster the GPUs get, higher throughput data needs to be sent to the GPUs to keep them busy with computation all the time.
At Reality Labs, our researchers and engineers train various AI models to innovate in the world of spatial computing. To do that, we often need to iterate on ideas many times. It is essential to train our AI models quickly, and to do so, we need to make maximum use of GPUs. However, existing solutions for data loading don’t allow us to fine-tune performance, nor provide insight into performance.
To achieve better utilization of GPUs and improve the speed of model training, we developed a new data loading solution, Scalable and Performant Data Loading (SPDL). SPDL embraces thread-based parallelism, which has a smaller memory footprint compared to conventional process-based parallelism. SPDL implemented basic media processing operations that work complementary with this thread-based parallelism in existing Python versions.
Issues in AI model training efficiency
The GPU Efficiency Team at Reality Labs works with various teams to diagnose training inefficiencies and discuss their solutions. The causes of and solutions for inefficiencies span across many different subdomains, not limited to data loading.
Here we summarize the main issues in data loading we addressed when designing SPDL.
Concurrently performing operations of a different nature
When training models with large amounts of data, the data are retrieved from remote storages, preprocessed by the CPU, then transferred to GPU devices. The performance of these stages are bound by different factors. The data acquisition is mainly bound by network bandwidth, preprocessing is bound by CPU, and transfer is bound by memory bus. An ideal pipeline should perform these operations concurrently, and all the stages should be executed without waiting on the upstream or downstream stages. This requires adjusting the concurrency of each stage separately.
Tooling to diagnose and optimize data loading
In our experience, almost all the training pipelines have data-related issues. What makes it difficult to resolve these issues are the lack of insight on how a data loader is behaving and the lack of appropriate parameters for tuning the performance.
PyTorch’s DataLoader class provides a simple user experience by abstracting away the internal mechanics, but this abstraction also makes it difficult to profile performance. The PyTorch Profiler can provide insights about the Python call stack only when there’s no worker process, which isn’t applicable in actual training pipelines.
Because the dataset interface completely abstracts away the internal mechanism, the support DataLoader can provide for performance is limited. Oftentimes increasing the value of num_workers and enabling pin_memory are the only options. However, increasing the number of worker processes comes with undesirable side-effects.