Publications

Zhao Y, Gao X, Liu J, Wang E, Mullins R, Cheung P, Constantinides G, Xu C-Zet al., 2019, Automatic generation of multi-precision multi-arithmetic CNN accelerators for FPGAs, 2019 International Conference on Field-Programmable Technology, Publisher: IEEE

Modern deep Convolutional Neural Networks (CNNs) are computationally demanding, yet real applications often require high throughput and low latency. To help tackle these problems, we propose Tomato, a framework designed to automate the process of generating efficient CNN accelerators. The generated design is pipelined and each convolution layer uses different arithmetics at various precisions. Using Tomato, we showcase state-of-the-art multi-precision multi-arithmetic networks, including MobileNet-V1, running on FPGAs. To our knowledge, this is the first multi-precision multi-arithmetic auto-generation framework for CNNs. In software, Tomato fine-tunes pretrained networks to use a mixture of short powers-of-2 and fixed-point weights with a minimal loss in classification accuracy. The fine-tuned parameters are combined with the templated hardware designs to automatically produce efficient inference circuits in FPGAs. We demonstrate how our approach significantly reduces model sizes and computation complexities, and permits us to pack a complete ImageNet network onto a single FPGA without accessing off-chip memories for the first time. Furthermore, we show how Tomato produces implementations of networks with various sizes running on single or multiple FPGAs. To the best of our knowledge, our automatically generated accelerators outperform closest FPGA-based competitors by at least 2-4x for lantency and throughput; the generated accelerator runs ImageNet classification at a rate of more than 3000 frames per second.

Conference paper

Su J, Faraone J, Liu J, Zhao Y, Thomas DB, Leong PHW, Cheung PYKet al., 2018, Redundancy-reduced MobileNet acceleration on reconfigurable logic for ImageNet classification, Pages: 16-28, ISSN: 0302-9743

Modern Convolutional Neural Networks (CNNs) excel in image classification and recognition applications on large-scale datasets such as ImageNet, compared to many conventional feature-based computer vision algorithms. However, the high computational complexity of CNN models can lead to low system performance in power-efficient applications. In this work, we firstly highlight two levels of model redundancy which widely exist in modern CNNs. Additionally, we use MobileNet as a design example and propose an efficient system design for a Redundancy-Reduced MobileNet (RR-MobileNet) in which off-chip memory traffic is only used for inputs/outputs transfer while parameters and intermediate values are saved in on-chip BRAM blocks. Compared to AlexNet, our RR-mobileNet has 25 × less parameters, 3.2 × less operations per image inference but 9%/5.2% higher Top1/Top5 classification accuracy on ImageNet classification task. The latency of a single image inference is only 7.85 ms.

Abstract
Cite
Citations: 45

Conference paper

Zhao Y, Wickerson J, Constantinides GA, 2017, An efficient implementation of online arithmetic, IEEE International Conference on Field Programmable Technology (FPT), Publisher: IEEE

We propose the first hardware implementation ofstandard arithmetic operators – addition, multiplication, anddivision – that utilises constant compute resource but allowsnumerical precision to be adjusted arbitrarily at run-time.Traditionally, precision must be set at design-time so that additionand multiplication, which calculate the least significant digit(LSD) of their results first, and division, which calculates themost significant digit (MSD) first, can be chained together. Toget around this, we employ online operators, which are alwaysMSD-first, and thus allow successive operations to be pipelined.Even online operators require precision to be fixed at design-timebecause multiplication and division traditionally involve paralleladders. To avoid this, we propose an architecture, which we haveimplemented on an FPGA, that reuses a fixed-precision adder andstores residues in on-chip RAM. As such, we can use a single pieceof hardware to perform calculations to any precision, limited onlyby the availability of on-chip RAM. For instance, we obtain an8x speed-up, compared to the parallel-in-serial-out (PISO) fixedpointmethod, when executing 100 iterations of Newton’s methodat a precision of 64 digits, while the product of circuit area andlatency stays comparable.

Conference paper

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Dr Aaron Zhao

Contact

Location

Summary