Keywords:
Deep Learning, supercomputing, distributed systems
Candidates:
Students must have strong programming background (C/C++ and Python) and good machine learning knowledge to begin with. Students need to be good at Linux command lines. Students with good knowledge of TensorFlow programming, linear algebra, optimization, and parallel/distributed programming are preferred. After the research project, in addition to a technical report or a paper, the students should learn the following skills:
- The process of computer science research: analyzing the pros and cons of an algorithm, designing numerical experiments, and writing a good scientific paper;
- The application of distributed systems and Supercomputers on emerging applications like deep learning;
- The codesign between system (supercomputer) and algorithm (deep learning technique).
Introduction: Deep neural network (i.e. Deep Learning) is the most successful artificial intelligence technique. However, deep neural networks training is extremely slow. For example, finishing 90-epoch ImageNet-1k training with ResNet-50 model on a NVIDIA M40 GPU takes 14 days. It can take several months on a MAC laptop. This training requires 10^18 single precision operations in total. On the other hand, the world’s current fastest supercomputer can finish 2 * 10^17 single precision operations per second. If we can make full use of the supercomputer for DNN training, we should be able to finish the 90-epoch ResNet-50 training in five seconds. However, the current bottleneck for fast DNN training is in the algorithm level. Specifically, the current batch size (e.g. 512) is too small to make efficient use of many processors. In this project, students will focus on design a new optimization algorithm that can make full use thousands of computing servers.
The students are also welcome to propose their own project in related areas. Specific ideas cannot be disclosed via this introduction, but raw directions include:
- Explore and explain why extremely large batches often lose accuracy. It is will be good if the students can give either a mathematical or empirical answer.
- Studying advanced optimization methods and trying to replace Momentum SGD or state-of-the-art adaptive optimization solvers. Ideally, the new proposed optimization solver should scale the batch size to at least 64K without losing accuracy for ImageNet training.
- The students can try designing some new parallel machine learning algorithms like model-parallelism approach or asynchronous approach.
Course Features
- Lectures 1
- Quizzes 1
- Duration 10 weeks
- Skill level All levels
- Language English
- Students 3
- Assessments Yes