Fast convolutional neural networks on graphics processing units
Date
2019
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
The Convolutional Neural Networks (CNNs) architecture is one of the most widely used deep learning tools. The execution time of CNNs is dominated by the time spent on the convolution steps. Most CNNs implementations adopt a simple yet efficient im2col (image to column) +GEMM approach to implement convolution. The im2col+GEMM approach lowers the convolution into matrix multiplication that can be easily parallelized with highly efficient BLAS libraries. The contribution of this dissertation is that we observe significant but intricately patterned data redundancy in this matrix representation of convolution. We have not been able to identify earlier work that exploits this redundancy to improve the performance of CNNs. In this work, we analyze the origin of the redundancy generated by the im2col process, and reveal a new data pattern to more mathematically concisely describe the matrix representation for convolution. Based on this redundancy-minimized matrix representation, we implement a FFT-based convolution with finer FFT granularity. It achieves on average 23% and maximum 50% speedup on the ILSVRC2017 benchmark over the regular FFT convolution from NVIDIA’s cuDNN library, one of the most widely used CNNs libraries. Moreover, by replacing existing methods with our new convolution method in a popular deep-learning programming framework Caffe, we observe on average 74% speedup for multiple synthetic CNNs in closer-to-real-world application scenarios.
