Imperial College London

Dr Aaron Zhao

Faculty of EngineeringDepartment of Electrical and Electronic Engineering

Lecturer in Computer Engineering



a.zhao Website




Electrical EngineeringSouth Kensington Campus





Publication Type

2 results found

Zhao Y, Gao X, Liu J, Wang E, Mullins R, Cheung P, Constantinides G, Xu C-Zet al., 2019, Automatic generation of multi-precision multi-arithmetic CNN accelerators for FPGAs, 2019 International Conference on Field-Programmable Technology, Publisher: IEEE

Modern deep Convolutional Neural Networks (CNNs) are computationally demanding, yet real applications often require high throughput and low latency. To help tackle these problems, we propose Tomato, a framework designed to automate the process of generating efficient CNN accelerators. The generated design is pipelined and each convolution layer uses different arithmetics at various precisions. Using Tomato, we showcase state-of-the-art multi-precision multi-arithmetic networks, including MobileNet-V1, running on FPGAs. To our knowledge, this is the first multi-precision multi-arithmetic auto-generation framework for CNNs. In software, Tomato fine-tunes pretrained networks to use a mixture of short powers-of-2 and fixed-point weights with a minimal loss in classification accuracy. The fine-tuned parameters are combined with the templated hardware designs to automatically produce efficient inference circuits in FPGAs. We demonstrate how our approach significantly reduces model sizes and computation complexities, and permits us to pack a complete ImageNet network onto a single FPGA without accessing off-chip memories for the first time. Furthermore, we show how Tomato produces implementations of networks with various sizes running on single or multiple FPGAs. To the best of our knowledge, our automatically generated accelerators outperform closest FPGA-based competitors by at least 2-4x for lantency and throughput; the generated accelerator runs ImageNet classification at a rate of more than 3000 frames per second.

Conference paper

Zhao Y, Wickerson J, Constantinides GA, 2017, An efficient implementation of online arithmetic, IEEE International Conference on Field Programmable Technology (FPT), Publisher: IEEE

We propose the first hardware implementation ofstandard arithmetic operators – addition, multiplication, anddivision – that utilises constant compute resource but allowsnumerical precision to be adjusted arbitrarily at run-time.Traditionally, precision must be set at design-time so that additionand multiplication, which calculate the least significant digit(LSD) of their results first, and division, which calculates themost significant digit (MSD) first, can be chained together. Toget around this, we employ online operators, which are alwaysMSD-first, and thus allow successive operations to be pipelined.Even online operators require precision to be fixed at design-timebecause multiplication and division traditionally involve paralleladders. To avoid this, we propose an architecture, which we haveimplemented on an FPGA, that reuses a fixed-precision adder andstores residues in on-chip RAM. As such, we can use a single pieceof hardware to perform calculations to any precision, limited onlyby the availability of on-chip RAM. For instance, we obtain an8x speed-up, compared to the parallel-in-serial-out (PISO) fixedpointmethod, when executing 100 iterations of Newton’s methodat a precision of 64 digits, while the product of circuit area andlatency stays comparable.

Conference paper

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00836646&limit=30&person=true