340 results found
Research has shown that deep neural networks contain significant redundancy, and that high classification accuracies can be achieved even when weights and activations are quantised down to binary values. Network binarisation on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarised neural network implementation, we achieve twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable.
Wang E, Davis J, Zhao R, et al., 2019, Deep Neural Network Approximation for Custom Hardware: Where We've Been, Where We're Going, ACM Computing Surveys, Vol: 52, Pages: 40:1-40:39, ISSN: 0360-0300
Deep neural networks have proven to be particularly effective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardware-oriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms of both throughput and energy efficiency. Application-tailored accelerators, when co-designed with approximation-based network training methods, transform large, dense and computationally expensive networks into small, sparse and hardware-efficient alternatives, increasing the feasibility of network deployment. In this article, we provide a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their effectiveness for custom hardware implementation. We also include proposals for future research based on a thorough analysis of current trends. This article represents the first survey providing detailed comparisons of custom hardware accelerators featuring approximation for both convolutional and recurrent neural networks, through which we hope to inspire exciting new developments in the field.
Wang E, Davis JJ, Cheung P, 2018, A PYNQ-based Framework for Rapid CNN Prototyping, IEEE Symposium on Field-programmable Custom Computing Machines (FCCM) 2018, Publisher: IEEE, Pages: 223-223
This work presents a self-contained and modifiable framework for fast and easy convolutional neural network prototyping on the Xilinx PYNQ platform. With a Python-based programming interface, the framework combines the convenience of high-level abstraction with the speed of optimised FPGA implementation. Our work is freely available on GitHub for the community to use and build upon.
Zhao R, Liu S, Ng H, et al., 2018, Hardware Compilation of Deep Neural Networks: An Overview (invited), IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) 2018, Publisher: IEEE, Pages: 1-8
Deploying a deep neural network model on a reconfigurable platform, such as an FPGA, is challenging due to the enormous design spaces of both network models and hardware design. A neural network model has various layer types, connection patterns and data representations, and the corresponding implementation can be customised with different architectural and modular parameters. Rather than manually exploring this design space, it is more effective to automate optimisation throughout an end-to-end compilation process. This paper provides an overview of recent literature proposing novel approaches to achieve this aim. We organise materials to mirror a typical compilation flow: front end, platform-independent optimisation and back end. Design templates for neural network accelerators are studied with a specific focus on their derivation methodologies. We also review previous work on network compilation and optimisation for other hardware platforms to gain inspiration regarding FPGA implementation. Finally, we propose some future directions for related research.
Davis JJ, Levine J, Stott E, et al., 2018, KOCL: Kernel-level Power Estimation for Arbitrary FPGA-SoC-accelerated OpenCL Applications, International Workshop on OpenCL (IWOCL) 2018, Publisher: ACM, Pages: 4:1-4:1
This work presents KOCL, a fully automated tool flow and accompanying software, accessible through a minimalist API, allowing OpenCL developers targetting FPGA-SoC devices to obtain kernel-level power estimates for their applications via function calls in their host code. KOCL is open-source, available with example applications at https://github.com/PRiME-project/KOCL. In order to maximise accessibility, KOCL necessitates no user exposure to hardware whatsoever.
Chen BHK, Cheung PYS, Cheung PYK, et al., 2018, CypherDB: a novel architecture for outsourcing secure database processing, IEEE Transactions on Cloud Computing, Vol: 6, Pages: 372-386, ISSN: 2168-7161
CypherDB addresses the problem of protecting the confidentiality of database stored externally in a cloud and enabling efficient computation over it to thwart any curious-but-honest cloud computing service provider. It works by encrypting the entire outsourced database and executing queries over the encrypted data using our novel CypherDB secure processor architecture. To optimize computational efficiency, our proposed processor architecture provides tightly-coupled datapaths that avoid information leakage during database access and query execution. Our simulation using a well-known database benchmark TPC-H over a commercial grade Database Management System (SQLite) demonstrates that our proposed architecture incurs an average of about 10 percent overhead when compared with the same set of operations without secure database processing.
Davis JJ, Hung E, Levine JM, et al., 2018, KAPow: High-accuracy, Low-overhead Online Per-module Power Estimation for FPGA Designs, ACM Transactions on Reconfigurable Technology and Systems, Vol: 11, Pages: 2:1-2:22, ISSN: 1936-7406
In an FPGA system-on-chip design, it is often insufficient to merely assess the power consumption of the entire circuit by compile-time estimation or runtime power measurement. Instead, to make better decisions, one must understand the power consumed by each module in the system. In this work, we combine measurements of register-level switching activity and system-level power to build an adaptive online model that produces live breakdowns of power consumption within the design. Online model refinement avoids time-consuming characterisation while also allowing the model to track long-term operating condition changes. Central to our method is an automated flow that selects signals predicted to be indicative of high power consumption, instrumenting them for monitoring. We named this technique KAPow, for 'K'ounting Activity for Power estimation, which we show to be accurate and to have low overheads across a range of representative benchmarks. We also propose a strategy allowing for the identification and subsequent elimination of counters found to be of low significance at runtime, reducing algorithmic complexity without sacrificing significant accuracy. Finally, we demonstrate an application example in which a module-level power breakdown can be used to determine an efficient mapping of tasks to modules and reduce system-wide power consumption by up to 7%.
Davis J, Levine J, Stott E, et al., 2017, STRIPE: Signal Selection for Runtime Power Estimation, International Confererence on Field-programmable Logic and Applications (FPL) 2017, Publisher: IEEE
Knowledge of power consumption at a subsystem level can facilitate adaptive energy-saving techniques such as power gating, runtime task mapping and dynamic voltage and/or frequency scaling. While we have the ability to attribute power to an arbitrary hardware system's modules in real time, the selection of the particular signals to monitor for the purpose of power estimation within any given module has yet to be treated as a primary concern. In this paper, we show how the automatic analysis of circuit structure and behaviour inferred through vectored simulation can be used to produce high-quality rankings of signals' importance, with the resulting selections able to achieve lower power estimation error than those of prior work coupled with decreases in area, power and modelling complexity. In particular, by monitoring just eight signals per module (~0.3% of the total) across the 15 we examined, we demonstrate how to achieve runtime module-level estimation errors 1.5--6.9x lower than when reliant on the signal selections made in accordance with a more straightforward, previously published metric.
Davis JJ, Levine JM, Stott EA, et al., 2017, KOCL: Power Self-awareness for Arbitrary FPGA-SoC-accelerated OpenCL Applications, IEEE Design and Test, Vol: 34, Pages: 36-45, ISSN: 2168-2356
Given the need for developers to rapidly produce complex, high-performance and energy-efficient hardware systems, methods facilitating their intelligent runtime management are of ever-increasing importance. For energy optimization, such control decisions require knowledge of power usage at subsystem granularity. This information must be made accessible to developers now accustomed to creating systems from high-level descriptions, such as those written in OpenCL. To address these challenges, we introduce KOCL, a tool allowing OpenCL developers targeting FPGA-SoC devices to query live kernel-level power consumption using function calls embedded in their host code. KOCL is open-source, available online at https://github.com/PRiME-project/KOCL. To maximize accessibility, its use necessitates zero exposure to hardware.
Hung E, Davis JJ, Levine JM, et al., 2016, KAPow: A System Identification Approach to Online Per-Module Power Estimation in FPGA Designs, IEEE Symposium on Field-programmable Custom Computing Machines (FCCM) 2016, Publisher: IEEE, Pages: 56-63
In a modern FPGA system-on-chip design, it is often insufficient to simply assess the total power consumption of the entire circuit by design-time estimation or runtime power rail measurement. Instead, to make better runtime decisions, it is desirable to understand the power consumed by each individual module in the system. In this work, we combine board-level power measurements with register-level activity counting to build an online model that produces a breakdown of power consumption within the design. Online model refinement avoids the need for a time-consuming characterisation stage and also allows the model to track long-term changes to operating conditions. Our flow is named KAPow, a (loose) acronym for 'K'ounting Activity for Power estimation, which we show to be accurate, with per-module power estimates as close to +/-5mW of true measurements, and to have low overheads. We also demonstrate an application example in which a per-module power breakdown can be used to determine an efficient mapping of tasks to modules and reduce system-wide power consumption by over 8%.
Davis JJ, Cheung PYK, 2016, Reduced-precision Algorithm-based Fault Tolerance for FPGA-implemented Accelerators, International Symposium on Applied Reconfigurable Computing (ARC) 2016, Publisher: Springer, Pages: 361-368, ISSN: 0302-9743
As the threat of fault susceptibility caused by mechanisms including variation and degradation increases, engineers must give growing consideration to error detection and correction. While the use of common fault tolerance strategies frequently causes the incursion of significant overheads in area, performance and/or power consumption, options exist that buck these trends. In particular, algorithm-based fault tolerance embodies a proven family of low-overhead error mitigation techniques able to be built upon to create self-verifying circuitry. In this paper, we present our research into the application of algorithm-based fault tolerance (ABFT) in FPGA-implemented accelerators at reduced levels of precision. This allows for the introduction of a previously unexplored tradeoff: sacrificing the observability of faults associated with low-magnitude errors for gains in area, performance and efficiency by reducing the bit-widths of logic used for error detection. We describe the implementation of a novel checksum truncation technique, analysing its effects upon overheads and allowed error. Our findings include that bit-width reduction of ABFT circuitry within a fault-tolerant accelerator used for multiplying pairs of 32 x 32 matrices resulted in the reduction of incurred area overhead by 16.7% and recovery of 8.27% of timing model Fmax. These came at the cost of introducing average and maximum absolute output errors of 0.430% and 0.927%, respectively, of the maximum absolute output value under transient fault injection.
Davis JJ, Hung E, Levine J, et al., 2016, Knowledge is Power: Module-level Sensing for Runtime Optimisation, ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA) 2016, Publisher: ACM, Pages: 276-276
We propose the compile-time instrumentation of coexisting modules---IP blocks, accelerators, etc.---implemented in FPGAs. The efficient mapping of tasks to execution units can then be achieved, for power and/or timing performance, by tracking dynamic power consumption and/or timing slack online at module-level granularity. Our proposed instrumentation is transparent, thereby not affecting circuit functionality. Power and timing overheads have proven to be small and tend to be outweighed by the exposed runtime benefits.
Su J, Thomas DB, Cheung PYK, 2016, Increasing Network Size and Training Throughput of FPGA Restricted Boltzmann Machines using Dropout, 24th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE, Pages: 48-51
Chen BHK, Cheung PYS, Cheung PYK, et al., 2015, An Efficient Architecture for Zero Overhead Data En-/Decryption using Reconfigurable Cryptographic Engine, International Conference on Field Programmable Technology (FTP), Publisher: IEEE, Pages: 248-251
Chau TCP, Niu X, Eele A, et al., 2014, Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable Systems, ACM Transactions on Reconfigurable Technology and Systems, Vol: 7, ISSN: 1936-7414
This article presents an approach for mapping real-time applications based on particle filters (PFs) toheterogeneous reconfigurable systems, which typically consist of multiple FPGAs and CPUs. A method isproposed to adapt the number of particles dynamically and to utilise runtime reconfigurability of FPGAs forreduced power and energy consumption. A data compression scheme is employed to reduce communicationoverhead between FPGAs and CPUs. A mobile robot localisation and tracking application is developed toillustrate our approach. Experimental results show that the proposed adaptive PF can reduce up to 99% ofcomputation time. Using runtime reconfiguration, we achieve a 25% to 34% reduction in idle power. A 1Usystem with four FPGAs is up to 169 times faster than a single-core CPU and 41 times faster than a 1UCPU server with 12 cores. It is also estimated to be 3 times faster than a system with four GPUs.
Davis J, Cheung PYK, 2014, Achieving Low-overhead Fault Tolerance for Parallel Accelerators with Dynamic Partial Reconfiguration, International Conference on Field-programmable Logic and Applications (FPL) 2014, Publisher: IEEE, Pages: 1-6, ISSN: 1946-147X
While allowing for the fabrication of increasingly complex and efficient circuitry, transistor shrinkage and count-per-device expansion have major downsides: chiefly increased variation, degradation and fault susceptibility. For this reason, design-time consideration of fault tolerance will have to be given to increasing numbers of electronic systems in the future to ensure yields, reliabilities and lifetimes remain acceptably high. Many commonly implemented operators are suited to modification resulting in datapath error detection capabilities with low area overheads. FPGAs are uniquely placed to allow further area savings to be made when incorporating fault avoidance mechanisms thanks to their dynamic reconfigurability. In this paper, we examine the practicalities and costs involved in implementing hardware-software fault tolerance on a test platform: a parallel matrix multiplication accelerator in hardware, with controller in software, running on a Xilinx Zynq system-on-chip. A combination of `bolt-on' error detection logic and software-triggered routing reconfiguration serve to provide low-overhead datapath fault tolerance at runtime. Rapid yet accurate fault diagnoses along with low hardware (area), software (configuration storage) and performance penalties are achieved.
Davis J, Cheung PYK, 2014, Reducing Overheads for Fault-tolerant Datapaths with Dynamic Partial Reconfiguration, IEEE Symposium on Field-programmable Custom Computing Machines (FCCM) 2014, Publisher: IEEE, Pages: 103-103
As process scaling and transistor count inflation continue, silicon chips are becoming increasingly susceptible to faults. Although FPGAs are particularly vulnerable to these effects, their runtime reconfigurability offers unique opportunities for fault tolerance. This work presents an application combining algorithmic-level error detection with dynamic partial reconfiguration (DPR) to allow faults manifested within its datapath at runtime to be circumvented at low cost.
Guan Z, Wong JSJ, Chaudhuri S, et al., 2014, Classification on variation maps: a new placement strategy to alleviate process variation on FPGA, IEICE ELECTRONICS EXPRESS, Vol: 11, ISSN: 1349-2543
Davis JJ, Cheung PYK, 2014, Datapath Fault Tolerance for Parallel Accelerators, International Conference on Field-programmable Technology (FPT) 2013, Publisher: IEEE, Pages: 366-369
While we reap the benefits of process scaling in terms of transistor density and switching speed, consideration must be given to the negative effects it causes: increased variation, degradation and fault susceptibility. Above device level, such phenomena and the faults they induce can lead to reduced yield, decreased system reliability and, in extreme cases, total failure after a period of successful operation. Although error detection and correction are almost always considered for highly sensitive and susceptible applications such as those in space, for other, more general-purpose applications they are often overlooked. In this paper, we present a parallel matrix multiplication accelerator running in hardware on the Xilinx Zynq system-on-chip platform, along with 'bolt-on' logic for detecting, locating and avoiding faults within its datapath. Designs of various sizes are compared with respect to resource overhead and performance impact. Our largest-implemented fault-tolerant accelerator was found to consume 17.3% more area, run at a 3.95% lower frequency and incur an 18.8% execution time penalty over its equivalent fault-susceptible design during fault-free operation.
Chau TCP, Kurek M, Targett JS, et al., 2014, SMCGen: Generating Reconfigurable Design for Sequential Monte Carlo Applications, 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines ((FCCM), Publisher: IEEE, Pages: 141-148
Stott E, Levine JM, Cheung PYK, et al., 2014, Timing Fault Detection in FPGA-based Circuits, 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines ((FCCM), Publisher: IEEE, Pages: 96-99
Guan Z, Wong JSJ, Chaudhuri S, et al., 2014, Mitigation of process variation effect in FPGAs with partial rerouting method, IEICE ELECTRONICS EXPRESS, Vol: 11, ISSN: 1349-2543
Liu J, Bouganis C, Cheung PYK, 2014, Image Progressive Acquisition for Hardware Systems, Design, Automation and Test in Europe Conference and Exhibition (DATE), Publisher: IEEE, ISSN: 1530-1591
Liu J, Bouganis C, Cheung PYK, 2014, Kernel-based Adaptive Image Sampling, 9th International Conference on Computer Vision Theory and Applications (VISAPP), Publisher: IEEE, Pages: 25-32
Wong JSJ, Cheung PYK, 2013, Timing Measurement Platform for Arbitrary Black-Box Circuits Based on Transition Probability, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol: 21, Pages: 2307-2320, ISSN: 1063-8210
Powell A, Savvas-Bouganis C, Cheung PYK, 2013, High-level power and performance estimation of FPGA-based soft processors and its application to design space exploration, JOURNAL OF SYSTEMS ARCHITECTURE, Vol: 59, Pages: 1144-1156, ISSN: 1383-7621
Guan Z, Wong JSJ, Chaudhuri S, et al., 2013, A VARIATION-ADAPTIVE RETIMING METHOD EXPLOITING RECONFIGURABILITY, 23rd International Conference on Field Programmable Logic and Applications (FPL), Publisher: IEEE, ISSN: 1946-1488
Levine JM, Stott E, Constantinides GA, et al., 2013, SMI: SLACK MEASUREMENT INSERTION FOR ONLINE TIMING MONITORING IN FPGAS, 23rd International Conference on Field Programmable Logic and Applications (FPL), Publisher: IEEE, ISSN: 1946-1488
Guan Z, Wong JSJ, Chaudhuri S, et al., 2013, Exploiting Stochastic Delay Variability on FPGAs with Adaptive Partial Rerouting, 12th International Conference on Field-Programmable Technology (FPT), Publisher: IEEE, Pages: 254-261
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.