Automating Distributed Machine Learning: Algorithms and System Optimization


Grant Data
Project Title
Automating Distributed Machine Learning: Algorithms and System Optimization
Principal Investigator
Professor Wu, Chuan   (Project Coordinator (PC))
Co-Investigator(s)
Professor Yu Yizhou   (Co-principal investigator)
Xu Hong   (Co-Investigator)
Wang Wei   (Co-Investigator)
Li Bo   (Co-Investigator)
Kwok James Tin Yau   (Co-Investigator)
Chu Xiaowen   (Co-Investigator)
Duration
36
Start Date
2023-03-01
Amount
7030400
Conference Title
Automating Distributed Machine Learning: Algorithms and System Optimization
Keywords
1) Distributed machine learning 2) AutoML 3) Computation Acceleration 4) Communication Scheduling 5) DNN Training
Discipline
NetworkArtificial Intelligence and Machine learning
Panel
Engineering (E)
HKU Project Code
C7004-22G
Grant Type
Collaborative Research Fund (CRF) - Group Research Project 2022/2023
Funding Year
2022
Status
On-going
Objectives
1. [To devise automated hyperparameter optimization and neural architecture search algorithms for deriving optimized deep learning models.] Artificial intelligence (AI), with machine learning (ML) in particular, has led to remarkable breakthroughs in applications across various domains. However, building high-quality ML models requires extensive ML expertise, which may not always be available in reality. Automated machine learning (AutoML) has been proposed recently to automate the complex learning process. In the context of deep learning, AutoML is commonly used for hyperparameter optimization and neural architecture search, which allows the automatic design of optimized deep neural networks from big data. As deep learning models are routinely trained in a cluster of machines, exploiting the distributed training capability provides new opportunities to AutoML. We will study the critical issues of hyperparameter optimization (e.g., batch size and learning rate selection) and neural architecture search in the distributed setting. We seek to obtain optimized neural network architecture and hyperparameters with the objective of achieving the highest model accuracy and learning efficiency, through distributed model search and training over multiple devices/machines. 2. [To design strategies and methods for automatic computation acceleration in distributed machine learning, for most efficient exploitation of expensive AI hardware.] Deep neural networks (DNNs) have been extensively used in a wide range of applications such as computer vision, natural language processing and speech recognition. It is well known that DNN training is time-, resource/energy- and manpower- consuming, due to the large volumes of training data, the growing model sizes and the need for manual resource scheduling. Mainstream ML frameworks such as TensorFlow and PyTorch support distributed DNN training in a cluster of worker machines, each having multiple accelerator devices (e.g., GPUs). Given a large DNN model and a set of available devices, several issues remain elusive:(1) how to automatically assign ML operators in the model (e.g., MatMul, ReLu) onto devices for computation, (2) how to enable low- or mixed-precision computation on the devices, and (3) how to automatically set execution order of operators on and across the devices, so that the extremely (cost or energy) expensive AI accelerator devices can be maximally utilized, and training convergence time and human intervention during training are minimized. We will design efficient strategies to address these issues. 3. [To devise communication plans and computation/communication co-scheduling algorithms for distributed machine learning, for achieving automatic learning acceleration.] Distributed DNN training is also communication-intensive, as it requires frequent exchange of locally-computed model gradients and updated parameters across workers, and transfer of intermediate data across model partitions deployed on different devices. Without careful planning and scheduling, communication alone can take up significant time — e.g., 50%- 90% of the overall training time. We aim to accelerate parameter/data communications in distributed training via automatic right-sizing of communication tensors (tensor fusion/partition) and transmission scheduling (maximizing the overlap between operator computation and tensor communication). We will also study joint computation and communication scheduling, with operator/tensor fusion/partition and execution order automatically and strategically planned in a cohort. 4. [To design and implement an accurate ML simulator to emulate large-scale distributed DNN training, and to implement and evaluate our algorithms and strategies using mainstream ML frameworks in real-world AI scenarios.] We will implement our algorithms and system designs in mainstream open-source ML frameworks such as PyTorch and TensorFlow. We will evaluate our prototypes on AI clusters running on public cloud services such as AWS EC2, as well as on AI clouds of our industry collaborators (e.g., Alibaba and ByteDance). We will run various representative deep learning workloads to evaluate model accuracy, convergence speed and hardware utilization. Given the inherent limitation on scale with testbed evaluation (forbidden costs), we will also develop a simulator to characterize the performance of distributed ML systems and emulate the training process without prolonged running on many expensive devices, to expedite algorithm design and strategy search. We plan to open source the simulator and algorithm implementation on the mainstream ML frameworks for the community, and employ them in applications such as learning large-scale neural networks for medical image analysis.