Deep Learning at Scale on NVIDIA V100 Accelerators

Rengan Xu, Frank Han and Quy Ta

The recent explosion in the popularity of Deep Learning (DL) is due to a combination of improved algorithms, access to large datasets and increased computational power. This had led to a plethora of open-source DL frameworks, each with varying characteristics and capabilities. End users are then left with the difficult task of determining software and hardware configurations to get optimal performance from each framework.
We share our experiences and develop best practices for DL training with TensorFlow, MXNet and Caffe2. The paper also looks at DL inferencing with TensorRT on NVIDIA V100 Volta GPUs. It focuses on one of the more prominent neural network architectures, Resnet50, combined with Imagenet dataset. We quantify the impact of hardware attributes on DL workloads such as the usage of PCIe vs NVLink GPUs, performance past a single worker node, effect of high speed interconnect such as InfiniBand EDR on training and the implication of utilizing a network attached storage and its advantages.