Bandwidth matters. Deep learning algorithms dominate Artificial Intelligence (AI) area but existing platforms to run them are not efficient enough. The inefficiency largely comes from the inefficient memory system design: we need to load too many data from the external memory but bandwidth is insufficient.
To deploy neural networks onto customized hardware and achieve high energy efficiency, we propose a complete flow consists of Deep Compression, compilation, and hardware acceleration. Deep Compression can reduce the size of neural networks and also the bandwidth requirement by 35× to 49× without affecting their accuracy. Two fully parameterized accelerators including Aristotle architecture for CNN and Descartes architecture for sparse DNN/RNN are proposed to save memory access and take advantage of sparsity. A compiler is designed to convert neural network models to instructions in tens of seconds.
Evaluated on two practical CNN algorithms for object detection and face landmark detection, Aristotle accelerator on low-end embedded FPGA achieves 1.20x and 3.93x performance than Tegra K1 GPU while achieving 75% energy reduction. Based on medium FPGA, Descartes accelerator achieves 5x performance and 15x energy efficiency than Maxwell Titan X GPU with LSTM for speech recognition.