349 results found
Wang E, Davis J, Moro D, et al., 2021, Enabling Binary Neural Network Training on the Edge, International Workshop on Embedded and Mobile Deep Learning (EDML), Publisher: ACM, Pages: 37-38
Wang E, Davis J, Moro D, et al., 2021, Enabling Binary Neural Network Training on the Edge, Workshop on Binary Networks for Computer Vision
Wang E, Davis J, Moro D, et al., 2021, Enabling Binary Neural Network Training on the Edge, Publisher: arXiv
The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. In this paper, we demonstrate that they are also strongly robust to gradient quantization, thereby making the training of modern models on the edge a practical reality. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint reductions and energy savings vs Courbariaux & Bengio's standard approach. Against the latter, we see coincident memory requirement and energy consumption drops of 2--6x, while reaching similar test accuracy in comparable time, across a range of small-scale models trained to classify popular datasets. We also showcase ImageNet training of ResNetE-18, achieving a 3.12x memory reduction over the aforementioned standard. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency and safeguarding privacy.
Bin M, Cheung PYK, Crisostomi E, et al., 2021, Post-lockdown abatement of COVID-19 by fast periodic switching, PLOS COMPUTATIONAL BIOLOGY, Vol: 17, ISSN: 1553-734X
Wang E, Davis JJ, Cheung P, et al., 2020, LUTNet: Learning FPGA Configurations for Highly Efficient Neural Network Inference, IEEE Transactions on Computers, Vol: 69, Pages: 1795-1808, ISSN: 0018-9340
Research has shown that deep neural networks contain significant redundancy, and thus that high classification accuracy can be achieved even when weights and activations are quantized down to binary values. Network binarization on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We describe the realization of both unrolled and tiled LUTNet architectures, with the latter facilitating smaller, less power-hungry deployment over the former while sacrificing area and energy efficiency along with throughput. For both varieties, we demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarized neural network implementation, we achieve up to twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable.
Zhao Y, Gao X, Liu J, et al., 2019, Automatic generation of multi-precision multi-arithmetic CNN accelerators for FPGAs, 2019 International Conference on Field-Programmable Technology, Publisher: IEEE
Modern deep Convolutional Neural Networks (CNNs) are computationally demanding, yet real applications often require high throughput and low latency. To help tackle these problems, we propose Tomato, a framework designed to automate the process of generating efficient CNN accelerators. The generated design is pipelined and each convolution layer uses different arithmetics at various precisions. Using Tomato, we showcase state-of-the-art multi-precision multi-arithmetic networks, including MobileNet-V1, running on FPGAs. To our knowledge, this is the first multi-precision multi-arithmetic auto-generation framework for CNNs. In software, Tomato fine-tunes pretrained networks to use a mixture of short powers-of-2 and fixed-point weights with a minimal loss in classification accuracy. The fine-tuned parameters are combined with the templated hardware designs to automatically produce efficient inference circuits in FPGAs. We demonstrate how our approach significantly reduces model sizes and computation complexities, and permits us to pack a complete ImageNet network onto a single FPGA without accessing off-chip memories for the first time. Furthermore, we show how Tomato produces implementations of networks with various sizes running on single or multiple FPGAs. To the best of our knowledge, our automatically generated accelerators outperform closest FPGA-based competitors by at least 2-4x for lantency and throughput; the generated accelerator runs ImageNet classification at a rate of more than 3000 frames per second.
Liu J, Bouganis C, Cheung PYK, 2019, Context-based image acquisition from memory in digital systems, Journal of Real-Time Image Processing, Vol: 16, Pages: 1057-1076, ISSN: 1861-8200
A key consideration in the design of image and video processing systems is the ever increasing spatial resolution of the captured images, which has a major impact on the performance requirements of the memory subsystem. This is further amplified by the facts that the memory bandwidth requirements and energy consumption of accessing the captured images have started to become the bottlenecks in the design of high-performance image processing systems. Inspired by the successful application of progressive image sampling techniques in various image processing tasks, this work proposes the concept of Context-based Image Acquisition for hardware systems that efficiently trades image quality for reduced cost of the image acquisition process. Based on the proposed framework, a hardware architecture is developed which alters the conventional memory access pattern, to progressively and adaptively access pixels from a memory subsystem. The sampled pixels are used to reconstruct an approximation to the ground truth, which is stored in a high-performance image buffer for further processing. An instance of the architecture is prototyped on an FPGA and its performance evaluation shows that a saving of up to 85 % of memory accessing time and 33 %/45 % of image acquisition time/energy are achieved on a set of benchmarks while maintaining a high PSNR.
Wang E, Davis J, Cheung P, et al., 2019, LUTNet: Rethinking Inference in FPGA Soft Logic, IEEE Symposium on Field-programmable Custom Computing Machines (FCCM) 2019, Publisher: IEEE, Pages: 26-34, ISSN: 2576-2621
Research has shown that deep neural networks contain significant redundancy, and that high classification accuracies can be achieved even when weights and activations are quantised down to binary values. Network binarisation on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarised neural network implementation, we achieve twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable.
Wang E, Davis J, Zhao R, et al., 2019, Deep Neural Network Approximation for Custom Hardware: Where We've Been, Where We're Going, ACM Computing Surveys, Vol: 52, Pages: 40:1-40:39, ISSN: 0360-0300
Deep neural networks have proven to be particularly effective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardware-oriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms of both throughput and energy efficiency. Application-tailored accelerators, when co-designed with approximation-based network training methods, transform large, dense and computationally expensive networks into small, sparse and hardware-efficient alternatives, increasing the feasibility of network deployment. In this article, we provide a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their effectiveness for custom hardware implementation. We also include proposals for future research based on a thorough analysis of current trends. This article represents the first survey providing detailed comparisons of custom hardware accelerators featuring approximation for both convolutional and recurrent neural networks, through which we hope to inspire exciting new developments in the field.
Li Q, Wang E, Fleming ST, et al., 2019, Accelerating Position-Aware Top-k ListNet for Ranking under Custom Precision Regimes, 29th International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, Pages: 81-87, ISSN: 1946-1488
Wang E, Davis JJ, Cheung P, 2018, A PYNQ-based Framework for Rapid CNN Prototyping, IEEE Symposium on Field-programmable Custom Computing Machines (FCCM) 2018, Publisher: IEEE, Pages: 223-223
This work presents a self-contained and modifiable framework for fast and easy convolutional neural network prototyping on the Xilinx PYNQ platform. With a Python-based programming interface, the framework combines the convenience of high-level abstraction with the speed of optimised FPGA implementation. Our work is freely available on GitHub for the community to use and build upon.
Zhao R, Liu S, Ng H, et al., 2018, Hardware Compilation of Deep Neural Networks: An Overview (invited), IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) 2018, Publisher: IEEE, Pages: 1-8
Deploying a deep neural network model on a reconfigurable platform, such as an FPGA, is challenging due to the enormous design spaces of both network models and hardware design. A neural network model has various layer types, connection patterns and data representations, and the corresponding implementation can be customised with different architectural and modular parameters. Rather than manually exploring this design space, it is more effective to automate optimisation throughout an end-to-end compilation process. This paper provides an overview of recent literature proposing novel approaches to achieve this aim. We organise materials to mirror a typical compilation flow: front end, platform-independent optimisation and back end. Design templates for neural network accelerators are studied with a specific focus on their derivation methodologies. We also review previous work on network compilation and optimisation for other hardware platforms to gain inspiration regarding FPGA implementation. Finally, we propose some future directions for related research.
Davis JJ, Levine J, Stott E, et al., 2018, KOCL: Kernel-level Power Estimation for Arbitrary FPGA-SoC-accelerated OpenCL Applications, International Workshop on OpenCL (IWOCL) 2018, Publisher: ACM, Pages: 4:1-4:1
This work presents KOCL, a fully automated tool flow and accompanying software, accessible through a minimalist API, allowing OpenCL developers targetting FPGA-SoC devices to obtain kernel-level power estimates for their applications via function calls in their host code. KOCL is open-source, available with example applications at https://github.com/PRiME-project/KOCL. In order to maximise accessibility, KOCL necessitates no user exposure to hardware whatsoever.
Chen BHK, Cheung PYS, Cheung PYK, et al., 2018, CypherDB: a novel architecture for outsourcing secure database processing, IEEE Transactions on Cloud Computing, Vol: 6, Pages: 372-386, ISSN: 2168-7161
CypherDB addresses the problem of protecting the confidentiality of database stored externally in a cloud and enabling efficient computation over it to thwart any curious-but-honest cloud computing service provider. It works by encrypting the entire outsourced database and executing queries over the encrypted data using our novel CypherDB secure processor architecture. To optimize computational efficiency, our proposed processor architecture provides tightly-coupled datapaths that avoid information leakage during database access and query execution. Our simulation using a well-known database benchmark TPC-H over a commercial grade Database Management System (SQLite) demonstrates that our proposed architecture incurs an average of about 10 percent overhead when compared with the same set of operations without secure database processing.
Davis JJ, Hung E, Levine JM, et al., 2018, KAPow: High-accuracy, Low-overhead Online Per-module Power Estimation for FPGA Designs, ACM Transactions on Reconfigurable Technology and Systems, Vol: 11, Pages: 2:1-2:22, ISSN: 1936-7406
In an FPGA system-on-chip design, it is often insufficient to merely assess the power consumption of the entire circuit by compile-time estimation or runtime power measurement. Instead, to make better decisions, one must understand the power consumed by each module in the system. In this work, we combine measurements of register-level switching activity and system-level power to build an adaptive online model that produces live breakdowns of power consumption within the design. Online model refinement avoids time-consuming characterisation while also allowing the model to track long-term operating condition changes. Central to our method is an automated flow that selects signals predicted to be indicative of high power consumption, instrumenting them for monitoring. We named this technique KAPow, for 'K'ounting Activity for Power estimation, which we show to be accurate and to have low overheads across a range of representative benchmarks. We also propose a strategy allowing for the identification and subsequent elimination of counters found to be of low significance at runtime, reducing algorithmic complexity without sacrificing significant accuracy. Finally, we demonstrate an application example in which a module-level power breakdown can be used to determine an efficient mapping of tasks to modules and reduce system-wide power consumption by up to 7%.
Li Q, Fleming ST, Thomas DB, et al., 2018, Accelerating Top-k ListNet Training for Ranking Using FPGA, 17th International Conference on Field-Programmable Technology (FPT), Publisher: IEEE COMPUTER SOC, Pages: 245-248
Davis J, Levine J, Stott E, et al., 2017, STRIPE: Signal Selection for Runtime Power Estimation, International Confererence on Field-programmable Logic and Applications (FPL) 2017, Publisher: IEEE
Knowledge of power consumption at a subsystem level can facilitate adaptive energy-saving techniques such as power gating, runtime task mapping and dynamic voltage and/or frequency scaling. While we have the ability to attribute power to an arbitrary hardware system's modules in real time, the selection of the particular signals to monitor for the purpose of power estimation within any given module has yet to be treated as a primary concern. In this paper, we show how the automatic analysis of circuit structure and behaviour inferred through vectored simulation can be used to produce high-quality rankings of signals' importance, with the resulting selections able to achieve lower power estimation error than those of prior work coupled with decreases in area, power and modelling complexity. In particular, by monitoring just eight signals per module (~0.3% of the total) across the 15 we examined, we demonstrate how to achieve runtime module-level estimation errors 1.5--6.9x lower than when reliant on the signal selections made in accordance with a more straightforward, previously published metric.
Davis JJ, Levine JM, Stott EA, et al., 2017, KOCL: Power Self-awareness for Arbitrary FPGA-SoC-accelerated OpenCL Applications, IEEE Design and Test, Vol: 34, Pages: 36-45, ISSN: 2168-2356
Given the need for developers to rapidly produce complex, high-performance and energy-efficient hardware systems, methods facilitating their intelligent runtime management are of ever-increasing importance. For energy optimization, such control decisions require knowledge of power usage at subsystem granularity. This information must be made accessible to developers now accustomed to creating systems from high-level descriptions, such as those written in OpenCL. To address these challenges, we introduce KOCL, a tool allowing OpenCL developers targeting FPGA-SoC devices to query live kernel-level power consumption using function calls embedded in their host code. KOCL is open-source, available online at https://github.com/PRiME-project/KOCL. To maximize accessibility, its use necessitates zero exposure to hardware.
Hung E, Davis JJ, Levine JM, et al., 2016, KAPow: A System Identification Approach to Online Per-Module Power Estimation in FPGA Designs, IEEE Symposium on Field-programmable Custom Computing Machines (FCCM) 2016, Publisher: IEEE, Pages: 56-63
In a modern FPGA system-on-chip design, it is often insufficient to simply assess the total power consumption of the entire circuit by design-time estimation or runtime power rail measurement. Instead, to make better runtime decisions, it is desirable to understand the power consumed by each individual module in the system. In this work, we combine board-level power measurements with register-level activity counting to build an online model that produces a breakdown of power consumption within the design. Online model refinement avoids the need for a time-consuming characterisation stage and also allows the model to track long-term changes to operating conditions. Our flow is named KAPow, a (loose) acronym for 'K'ounting Activity for Power estimation, which we show to be accurate, with per-module power estimates as close to +/-5mW of true measurements, and to have low overheads. We also demonstrate an application example in which a per-module power breakdown can be used to determine an efficient mapping of tasks to modules and reduce system-wide power consumption by over 8%.
Davis JJ, Cheung PYK, 2016, Reduced-precision Algorithm-based Fault Tolerance for FPGA-implemented Accelerators, International Symposium on Applied Reconfigurable Computing (ARC) 2016, Publisher: Springer, Pages: 361-368, ISSN: 0302-9743
As the threat of fault susceptibility caused by mechanisms including variation and degradation increases, engineers must give growing consideration to error detection and correction. While the use of common fault tolerance strategies frequently causes the incursion of significant overheads in area, performance and/or power consumption, options exist that buck these trends. In particular, algorithm-based fault tolerance embodies a proven family of low-overhead error mitigation techniques able to be built upon to create self-verifying circuitry. In this paper, we present our research into the application of algorithm-based fault tolerance (ABFT) in FPGA-implemented accelerators at reduced levels of precision. This allows for the introduction of a previously unexplored tradeoff: sacrificing the observability of faults associated with low-magnitude errors for gains in area, performance and efficiency by reducing the bit-widths of logic used for error detection. We describe the implementation of a novel checksum truncation technique, analysing its effects upon overheads and allowed error. Our findings include that bit-width reduction of ABFT circuitry within a fault-tolerant accelerator used for multiplying pairs of 32 x 32 matrices resulted in the reduction of incurred area overhead by 16.7% and recovery of 8.27% of timing model Fmax. These came at the cost of introducing average and maximum absolute output errors of 0.430% and 0.927%, respectively, of the maximum absolute output value under transient fault injection.
Davis JJ, Hung E, Levine J, et al., 2016, Knowledge is Power: Module-level Sensing for Runtime Optimisation, ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA) 2016, Publisher: ACM, Pages: 276-276
We propose the compile-time instrumentation of coexisting modules---IP blocks, accelerators, etc.---implemented in FPGAs. The efficient mapping of tasks to execution units can then be achieved, for power and/or timing performance, by tracking dynamic power consumption and/or timing slack online at module-level granularity. Our proposed instrumentation is transparent, thereby not affecting circuit functionality. Power and timing overheads have proven to be small and tend to be outweighed by the exposed runtime benefits.
Su J, Thomas DB, Cheung PYK, 2016, Increasing Network Size and Training Throughput of FPGA Restricted Boltzmann Machines using Dropout, 24th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE, Pages: 48-51
Chen BHK, Cheung PYS, Cheung PYK, et al., 2015, An Efficient Architecture for Zero Overhead Data En-/Decryption using Reconfigurable Cryptographic Engine, International Conference on Field Programmable Technology (FTP), Publisher: IEEE, Pages: 248-251
Chau TCP, Niu X, Eele A, et al., 2014, Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable Systems, ACM Transactions on Reconfigurable Technology and Systems, Vol: 7, ISSN: 1936-7414
This article presents an approach for mapping real-time applications based on particle filters (PFs) toheterogeneous reconfigurable systems, which typically consist of multiple FPGAs and CPUs. A method isproposed to adapt the number of particles dynamically and to utilise runtime reconfigurability of FPGAs forreduced power and energy consumption. A data compression scheme is employed to reduce communicationoverhead between FPGAs and CPUs. A mobile robot localisation and tracking application is developed toillustrate our approach. Experimental results show that the proposed adaptive PF can reduce up to 99% ofcomputation time. Using runtime reconfiguration, we achieve a 25% to 34% reduction in idle power. A 1Usystem with four FPGAs is up to 169 times faster than a single-core CPU and 41 times faster than a 1UCPU server with 12 cores. It is also estimated to be 3 times faster than a system with four GPUs.
Davis J, Cheung PYK, 2014, Achieving Low-overhead Fault Tolerance for Parallel Accelerators with Dynamic Partial Reconfiguration, International Conference on Field-programmable Logic and Applications (FPL) 2014, Publisher: IEEE, Pages: 1-6, ISSN: 1946-147X
While allowing for the fabrication of increasingly complex and efficient circuitry, transistor shrinkage and count-per-device expansion have major downsides: chiefly increased variation, degradation and fault susceptibility. For this reason, design-time consideration of fault tolerance will have to be given to increasing numbers of electronic systems in the future to ensure yields, reliabilities and lifetimes remain acceptably high. Many commonly implemented operators are suited to modification resulting in datapath error detection capabilities with low area overheads. FPGAs are uniquely placed to allow further area savings to be made when incorporating fault avoidance mechanisms thanks to their dynamic reconfigurability. In this paper, we examine the practicalities and costs involved in implementing hardware-software fault tolerance on a test platform: a parallel matrix multiplication accelerator in hardware, with controller in software, running on a Xilinx Zynq system-on-chip. A combination of `bolt-on' error detection logic and software-triggered routing reconfiguration serve to provide low-overhead datapath fault tolerance at runtime. Rapid yet accurate fault diagnoses along with low hardware (area), software (configuration storage) and performance penalties are achieved.
Davis J, Cheung PYK, 2014, Reducing Overheads for Fault-tolerant Datapaths with Dynamic Partial Reconfiguration, IEEE Symposium on Field-programmable Custom Computing Machines (FCCM) 2014, Publisher: IEEE, Pages: 103-103
As process scaling and transistor count inflation continue, silicon chips are becoming increasingly susceptible to faults. Although FPGAs are particularly vulnerable to these effects, their runtime reconfigurability offers unique opportunities for fault tolerance. This work presents an application combining algorithmic-level error detection with dynamic partial reconfiguration (DPR) to allow faults manifested within its datapath at runtime to be circumvented at low cost.
Guan Z, Wong JSJ, Chaudhuri S, et al., 2014, Classification on variation maps: a new placement strategy to alleviate process variation on FPGA, IEICE ELECTRONICS EXPRESS, Vol: 11, ISSN: 1349-2543
Davis JJ, Cheung PYK, 2014, Datapath Fault Tolerance for Parallel Accelerators, International Conference on Field-programmable Technology (FPT) 2013, Publisher: IEEE, Pages: 366-369
While we reap the benefits of process scaling in terms of transistor density and switching speed, consideration must be given to the negative effects it causes: increased variation, degradation and fault susceptibility. Above device level, such phenomena and the faults they induce can lead to reduced yield, decreased system reliability and, in extreme cases, total failure after a period of successful operation. Although error detection and correction are almost always considered for highly sensitive and susceptible applications such as those in space, for other, more general-purpose applications they are often overlooked. In this paper, we present a parallel matrix multiplication accelerator running in hardware on the Xilinx Zynq system-on-chip platform, along with 'bolt-on' logic for detecting, locating and avoiding faults within its datapath. Designs of various sizes are compared with respect to resource overhead and performance impact. Our largest-implemented fault-tolerant accelerator was found to consume 17.3% more area, run at a 3.95% lower frequency and incur an 18.8% execution time penalty over its equivalent fault-susceptible design during fault-free operation.
Chau TCP, Kurek M, Targett JS, et al., 2014, SMCGen: Generating Reconfigurable Design for Sequential Monte Carlo Applications, 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines ((FCCM), Publisher: IEEE, Pages: 141-148
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.