In this section

CNN Benchmark Suite

suite

Towards a Uniform Evaluation Methodology

The existing FPGA tooflows have employed ad-hoc evaluation methodologies, by targeting different CNN models and reporting the achieved performance in a non-uniform manner. A uniform evaluation methodology is proposed in order to enable the thorough and comparative evaluation of CNN-to-FPGA toolflows. See full paper for more details.

The proposed methodology comprises a benchmark suite and guidelines for evaluation metrics.

Benchmark Suite

cnns The proposed benchmark suite includes CNNs that are widely used and whose accuracy has been extensively studied by the deep learning community. Each CNN poses a unique hardware mapping challenge and stresses the FPGA toolflow from a different aspect. The main challenges to be addressed include CNNs that are (1) computation bounded, (2) off-chip memory bandwidth bounded, (3) on-chip memory capacity bounded, (4) with high layer dependency and (5) irregular and sparse layer connectivity that challenges scheduling.

Our proposed comprehensive benchmark suite consists of the following CNN models:

included in the Benchamark Suite, along with their computational challenges
Model	Depth	Design Principles	Challenges	Download
Alexnet 2012 [1]	8	Increased depth Increased layer width	Non-uniform filter sizes Grouped convolutions	alexnet.prototxt‌
ZFNet 2013 [2]	8	Wider layers	Computational load Memory footprint	zfnet.prototxt‌
VGG16 2014 [3]	16	Increased depth Uniform filter size	Computational load Memory footprint	vgg16.prototxt‌
GoogLeNet 2014 [4]	22	Inception module	Irregular computations Irregular layer connectivity	googlenet.prototxt‌
ResNet-152 2015 [5]	152	Residual block	Irregular computations Irregular layer connectivity	resnet152.prototxt‌
Inception-v4 2016 [6]	72	Residual inception block	Irregular computations Irregular layer connectivity	inception_v4.prototxt‌
DenseNet-161 2017 [7]	161	Dense block	Irregular computations Irregular layer connectivity	densenet161.prototxt‌

Evaluation Metrics

For a complete charactrisation of the hardware design points generated by each toolflow and a fair comparison across different FPGA platforms, the performance of the system should be reported using metrics that capture the following essential attributes:

Throughput: is the primamary performance metric of interest in applications such as image recognition and large-scale, multi-user analytics services over large amounts of data. Throughput is measured in GOp/s and is often achieved by processing large batches of inputs.
Latency: becomes the primary critical factor in applications such as self-driving cars and autonomous systems, but also in particular real-time cloud-based services. Measured in seconds, latency is the time between when an input enters the computing system and when the output is produced. In such scenarios, batch processing adds a prohibitive latency overhead and is often not an option.
Resource Consumption: is an indicator of the efficiency of the utilisation of the available resources on the target platform by the designs generated by a toolflow, including the DSPs, on-chip RAM, logic and FFs.
Application-Level Accuracy: is a crucial metric when approximation techniques are employed by a toolflow for the efficient mapping of CNNs. Such techniques may include precision optimisation or lossy compression methods and can have an impact on the application-level accuracy of the CNN. Potential performance-accuracy trade-offs have to be quantified and reported in terms of accuracy degradation.

Reporting all these criteria play an important role in determining the strategic trade-offs made by a toolflow. To measure the quality of a CNN-to-FPGA toolflow, we propose the following methodology:

Throughput in GOp/s with explicitly specified GOp/network, amount of weights and batch sizes, and latency in seconds/input with batch size of 1, in order to present the throughput-latency relationship, should be included in the evaluation reports.
Resource-normalised metrics are meaningful when comparing designs that target devices from the same FPGA family optimised for the same application domain. In this scenario, performance normalised with respect to logic and DSPs would allow the comparison of hardware designs for the same network on FPGAs of the same family.
Power-normalised throughput and latency should also be reported for comparison with other parallel architectures such as CPUs, GPUs and DSPs.
Since resource-normalised performance does not capture the effect of off- and on-chip memory bandwidth and capacity despite being critical for achieving high performance, target FPGA platform details should be included that explicitly indicate the off- and on-chip memory specifications.
All measurements should be made for various CNNs, with emphasis on the proposed benchmark suite of the previous section, in order to demonstrate the quality of results subject to the different mapping challenges posed by each benchmark model.

Citation

Please cite the following paper if you find this benchmarking suite useful for your work:
Venieris, S.I., Kouris, A. and Bouganis, C.S., 2018. Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions. arXiv preprint arXiv:1803.05900.

References

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems.

[2] Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding Convolutional Networks. In ECCV.

[3] K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR).

[4] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In AAAI Conference on Artificial Intelligence.

[7] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

CNN Benchmark Suite

Towards a Uniform Evaluation Methodology

Benchmark Suite

Evaluation Metrics

Citation

References

FPL 2024

FCCM 2024

FCCM 2024

AMD visit

SpatialML workshop

cornell tech visit

Dr Ben Chua

DATE 2024

DATE 2024

FPT 2023

ICCV 2023

FPL 2023

FPL 2023

FPL 2023

ASAP 2023

FCCM 2023

Best Paper Nomination

Best Paper Award

ACCV 2022

ECCV 2022

WACV 2023

NANDA 2022

RCML 2022

ASP-DAC 2021

ARC 2020

EMDL Keynote

Next Platform Interview

SAIC Talk

ICML 2020

Dr. Nur Ahmadi

FPL 2020

Dr. Manolis Vasileiadis

PhD studentship

HiPEAC WRC 2020

IEEE RAS Conf. 2020

DATE 2020

FPT 2019

BMVC 2019

Welcome Alexander

RCML19

ISVLSI 2019 Slides

Dr. Konstantinos Boikos

Exhibition Road Festival 2019

IROS 2019

ISVLSI 2019

SpatialML Centre

FPL 2019

ARC 2019

Welcome Mario

Future of Drones Workshop

IROS 2018 Talk

Dr. Stylianos I. Venieris

NG-HPC 2018 Presentation

RCML18 Presentation

Job Opening

RCML'18

IROS 2018

TNNLS 2018

MobiSys18 Workshop

CNN-to-FPGA Benchmark

FPL 2018

FPL 2018

ARC 2018

Imperial Festival 2018

Dr.Bouganis Promotion

CSUR 2018

SysML 2018

WRC'18 Talk

WRC at HiPEAC 2018

ARC 2018

NIPS 2017

DATE 2018

RCDL'17

IROS 2017