Scalable Input Data Processing for Resource-Efficient Machine Learning
Data is the lifeblood of machine learning. Yet, our system infrastructure for managing and preprocessing training data in ML jobs lags behind the vast advancements in hardware accelerators, software frameworks, and algorithms that optimize model training computations. The input data pipeline in an ML job is responsible for extracting data from storage, transforming data on-the-fly, and loading data to a training node (typically a GPU or TPU). As hardware accelerators continues to provide more FLOPS, feeding data at a sufficient rate to saturate accelerators is increasingly challenging. The high cost of accelerators compared to their CPU hosts makes it particularly important to ensure that they operate at high utilization. Hence, the input pipeline is critical to the end-to-end throughput and cost of ML jobs. In this talk, we will discuss the characteristics of real ML input pipelines from production workloads which have led to the trend of disaggregating input data processing from model training. I will present recent open-source systems such as tf.data service and Cachew, which leverage a disaggregated system architecture to scale-out and optimize data processing within and across jobs. These systems alleviate input bottlenecks and dramatically improve the training time and cost of ML jobs.