23 results found
Wang E, Davis J, Moro D, et al., 2021, Enabling Binary Neural Network Training on the Edge
The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. In this paper, we demonstrate that they are also strongly robust to gradient quantization, thereby making the training of modern models on the edge a practical reality. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint reductions and energy savings vs Courbariaux & Bengio's standard approach. Against the latter, we see coincident memory requirement and energy consumption drops of 2--6x, while reaching similar test accuracy in comparable time, across a range of small-scale models trained to classify popular datasets. We also showcase ImageNet training of ResNetE-18, achieving a 3.12x memory reduction over the aforementioned standard. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency and safeguarding privacy.
Wang E, Davis JJ, Cheung P, et al., 2020, LUTNet: learning FPGA configurations for highly efficient neural network inference, IEEE Transactions on Computers, Vol: 69, Pages: 1795-1808, ISSN: 0018-9340
Research has shown that deep neural networks contain significant redundancy, and thus that high classification accuracy can be achieved even when weights and activations are quantized down to binary values. Network binarization on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We describe the realization of both unrolled and tiled LUTNet architectures, with the latter facilitating smaller, less power-hungry deployment over the former while sacrificing area and energy efficiency along with throughput. For both varieties, we demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarized neural network implementation, we achieve up to twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable.
Li H, McInerney I, Davis J, et al., 2020, Digit stability inference for iterative methods using redundant number representation, IEEE Transactions on Computers, Pages: 1-8, ISSN: 0018-9340
In our recent work on iterative computation in hardware, we showed that arbitrary-precision solvers can perform more favorably than their traditional arithmetic equivalents when the latter's precisions are either under- or over-budgeted for the solution of the problem at hand. Significant proportions of these performance improvements stem from the ability to infer the existence of identical most-significant digits between iterations. This technique uses properties of algorithms operating on redundantly represented numbers to allow the generation of those digits to be skipped, increasing efficiency. It is unable, however, to guarantee that digits will stabilize, i.e., never change in any future iteration. In this article, we address this shortcoming, using interval and forward error analyses to prove that digits of high significance will become stable when computing the approximants of systems of linear equations using stationary iterative methods. We formalize the relationship between matrix conditioning and the rate of growth in most-significant digit stability, using this information to converge to our desired results more quickly. Versus our previous work, an exemplary hardware realization of this new technique achieves an up-to 2.2x speedup in the solution of a set of variously conditioned systems using the Jacobi method.
Li H, Davis J, Wickerson J, et al., 2019, ARCHITECT: Arbitrary-precision Hardware with Digit Elision for Efficient Iterative Compute, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol: 28, Pages: 516-529, ISSN: 1063-8210
Many algorithms feature an iterative loop that converges to the result of interest. The numerical operations in such algorithms are generally implemented using finite-precision arithmetic, either fixed- or floating-point, most of which operate least-significant digit first. This results in a fundamental problem: if, after some time, the result has not converged, is this because we have not run the algorithm for enough iterations or because the arithmetic in some iterations was insufficiently precise? There is no easy way to answer this question, so users will often over-budget precision in the hope that the answer will always be to run for a few more iterations. We propose a fundamentally new approach: with the appropriate arithmetic able to generate results from most-significant digit first, we show that fixed compute-area hardware can be used to calculate an arbitrary number of algorithmic iterations to arbitrary precision, with both precision and approximant index increasing in lockstep. Consequently, datapaths constructed following our principles demonstrate efficiency over their traditional arithmetic equivalents where the latter's precisions are either under- or over-budgeted for the computation of a result to a particular accuracy. Use of most-significant digit-first arithmetic additionally allows us to declare certain digits to be stable at runtime, avoiding their recalculation in subsequent iterations and thereby increasing performance and decreasing memory footprints. Versus arbitrary-precision iterative solvers without the optimisations we detail herein, we achieve up-to 16x performance speedups and 1.9x memory savings for the evaluated benchmarks.
Wang E, Davis J, Cheung P, et al., 2019, LUTNet: Rethinking Inference in FPGA Soft Logic, IEEE Symposium on Field-programmable Custom Computing Machines (FCCM) 2019, Publisher: IEEE, Pages: 26-34, ISSN: 2576-2621
Research has shown that deep neural networks contain significant redundancy, and that high classification accuracies can be achieved even when weights and activations are quantised down to binary values. Network binarisation on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarised neural network implementation, we achieve twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable.
Wang E, Davis J, Zhao R, et al., 2019, Deep Neural Network Approximation for Custom Hardware: Where We've Been, Where We're Going, ACM Computing Surveys, Vol: 52, Pages: 40:1-40:39, ISSN: 0360-0300
Deep neural networks have proven to be particularly effective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardware-oriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms of both throughput and energy efficiency. Application-tailored accelerators, when co-designed with approximation-based network training methods, transform large, dense and computationally expensive networks into small, sparse and hardware-efficient alternatives, increasing the feasibility of network deployment. In this article, we provide a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their effectiveness for custom hardware implementation. We also include proposals for future research based on a thorough analysis of current trends. This article represents the first survey providing detailed comparisons of custom hardware accelerators featuring approximation for both convolutional and recurrent neural networks, through which we hope to inspire exciting new developments in the field.
Li H, Davis JJ, Wickerson J, et al., 2018, Digit Elision for Arbitrary-accuracy Iterative Computation, IEEE Symposium on Computer Arithmetic (ARITH) 2018, Publisher: IEEE, Pages: 107-114, ISSN: 2576-2265
Recently, a fixed compute-resource hardware architecture was proposed to enable the iterative solution of systems of linear equations to arbitrary accuracies. This technique, named ARCHITECT, achieves exact numeric computation by using online arithmetic to allow the refinement of results from earlier iterations over time, eschewing rounding error. ARCHITECT has a key drawback, however: often, many more digits than strictly necessary are generated, with this problem exacerbating the more accurate a solution is sought. In this paper, we infer the locations of these superfluous digits within stationary iterative calculations by exploiting online arithmetic’s digit dependencies and using forward error analysis. We demonstrate that their lack of computation is guaranteed not to affect the ability to reach a solution of any accuracy. Versus ARCHITECT, our illustrative hardware implementation achieves a geometric mean 20.1x speedup in the solution of a set of representative linear systems through the avoidance of redundant digit calculation. For the calculation of high-precision results, we also obtain an up-to 22.4x memory requirement reduction over the same architecture. Finally, we demonstrate that iterative solvers implemented following our proposals show superiority over conventional arithmetic implementations by virtue of their runtime-tunable precisions.
Wang E, Davis JJ, Cheung P, 2018, A PYNQ-based Framework for Rapid CNN Prototyping, IEEE Symposium on Field-programmable Custom Computing Machines (FCCM) 2018, Publisher: IEEE, Pages: 223-223
This work presents a self-contained and modifiable framework for fast and easy convolutional neural network prototyping on the Xilinx PYNQ platform. With a Python-based programming interface, the framework combines the convenience of high-level abstraction with the speed of optimised FPGA implementation. Our work is freely available on GitHub for the community to use and build upon.
Zhao R, Liu S, Ng H, et al., 2018, Hardware Compilation of Deep Neural Networks: An Overview (invited), IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) 2018, Publisher: IEEE, Pages: 1-8
Deploying a deep neural network model on a reconfigurable platform, such as an FPGA, is challenging due to the enormous design spaces of both network models and hardware design. A neural network model has various layer types, connection patterns and data representations, and the corresponding implementation can be customised with different architectural and modular parameters. Rather than manually exploring this design space, it is more effective to automate optimisation throughout an end-to-end compilation process. This paper provides an overview of recent literature proposing novel approaches to achieve this aim. We organise materials to mirror a typical compilation flow: front end, platform-independent optimisation and back end. Design templates for neural network accelerators are studied with a specific focus on their derivation methodologies. We also review previous work on network compilation and optimisation for other hardware platforms to gain inspiration regarding FPGA implementation. Finally, we propose some future directions for related research.
Bragg G, Leech C, Balsamo D, et al., 2018, An Application- and Platform-agnostic Runtime Management Framework for Multicore Systems, International Joint Conference on Pervasive and Embedded Computing and Communication Systems (PECCS) 2018, Publisher: SciTePress, Pages: 57-66
Heterogeneous multiprocessor systems have increased in complexity to provide both high performance and energy efficiency for a diverse range of applications. This motivates the need for a standard framework that enables the management, at runtime, of software applications executing on these processors. This paper proposes the first fully application- and platform-agnostic framework for runtime management approaches that control and optimise software applications and hardware resources. This is achieved by separating the system into three distinct layers connected by an API and cross-layer constructs called knobs and monitors. The proposed framework also supports the management of applications that are executing concurrently on heterogeneous platforms. The operation of the proposed framework is experimentally validated using a basic runtime controller and two heterogeneous platforms, to show how it is application- and platform-agnostic and easy to use. Furthermore, the management of concurrently executing applications through the framework is demonstrated. Finally, two recently reported runtime management approaches are implemented to demonstrate how the framework enables their operation and comparison. The energy and latency overheads introduced by the framework have been quantified and an open-source implementation has been released.
Davis JJ, Levine J, Stott E, et al., 2018, KOCL: Kernel-level Power Estimation for Arbitrary FPGA-SoC-accelerated OpenCL Applications, International Workshop on OpenCL (IWOCL) 2018, Publisher: ACM, Pages: 4:1-4:1
This work presents KOCL, a fully automated tool flow and accompanying software, accessible through a minimalist API, allowing OpenCL developers targetting FPGA-SoC devices to obtain kernel-level power estimates for their applications via function calls in their host code. KOCL is open-source, available with example applications at https://github.com/PRiME-project/KOCL. In order to maximise accessibility, KOCL necessitates no user exposure to hardware whatsoever.
Davis JJ, Hung E, Levine JM, et al., 2018, KAPow: high-accuracy, low-overhead online per-module power estimation for FPGA designs, ACM Transactions on Reconfigurable Technology and Systems, Vol: 11, Pages: 2:1-2:22, ISSN: 1936-7406
In an FPGA system-on-chip design, it is often insufficient to merely assess the power consumption of the entire circuit by compile-time estimation or runtime power measurement. Instead, to make better decisions, one must understand the power consumed by each module in the system. In this work, we combine measurements of register-level switching activity and system-level power to build an adaptive online model that produces live breakdowns of power consumption within the design. Online model refinement avoids time-consuming characterisation while also allowing the model to track long-term operating condition changes. Central to our method is an automated flow that selects signals predicted to be indicative of high power consumption, instrumenting them for monitoring. We named this technique KAPow, for 'K'ounting Activity for Power estimation, which we show to be accurate and to have low overheads across a range of representative benchmarks. We also propose a strategy allowing for the identification and subsequent elimination of counters found to be of low significance at runtime, reducing algorithmic complexity without sacrificing significant accuracy. Finally, we demonstrate an application example in which a module-level power breakdown can be used to determine an efficient mapping of tasks to modules and reduce system-wide power consumption by up to 7%.
Li H, Davis JJ, Wickerson JP, et al., 2018, ARCHITECT: Arbitrary-precision Constant-hardware Iterative Compute, International Conference on Field-programmable Technology (FPT) 2017, Publisher: IEEE, Pages: 73-79
Many algorithms feature an iterative loop that converges to the result of interest. The numerical operations in such algorithms are generally implemented using finite-precision arithmetic, either fixed or floating point, most of which operate least-significant digit first. This results in a fundamental problem: if, after some time, the result has not converged, is this because we have not run the algorithm for enough iterations or because the arithmetic in some iterations was insufficiently precise? There is no easy way to answer this question, so users will often over-budget precision in the hope that the answer will always be to run for a few more iterations. We propose a fundamentally new approach: armed with the appropriate arithmetic able to generate results from most-significant digit first, we show that fixed compute-area hardware can be used to calculate an arbitrary number of algorithmic iterations to arbitrary precision, with both precision and iteration index increasing in lockstep. Thus, datapaths constructed following our principles demonstrate efficiency over their traditional arithmetic equivalents where the latter’s precisions are either under- or over-budgeted for the computation of a result to a particular accuracy. For the execution of 100 iterations of the Jacobi method, we obtain a 1.60x increase in frequency and 15.7x LUT and 50.2x flip-flop reductions over a 2048-bit parallel-in, serial-out traditional arithmetic equivalent, along with 46.2x LUT and 83.3x flip-flop decreases versus the state-of-the-art online arithmetic implementation.
Davis J, Levine J, Stott E, et al., 2017, STRIPE: Signal Selection for Runtime Power Estimation, International Confererence on Field-programmable Logic and Applications (FPL) 2017, Publisher: IEEE
Knowledge of power consumption at a subsystem level can facilitate adaptive energy-saving techniques such as power gating, runtime task mapping and dynamic voltage and/or frequency scaling. While we have the ability to attribute power to an arbitrary hardware system's modules in real time, the selection of the particular signals to monitor for the purpose of power estimation within any given module has yet to be treated as a primary concern. In this paper, we show how the automatic analysis of circuit structure and behaviour inferred through vectored simulation can be used to produce high-quality rankings of signals' importance, with the resulting selections able to achieve lower power estimation error than those of prior work coupled with decreases in area, power and modelling complexity. In particular, by monitoring just eight signals per module (~0.3% of the total) across the 15 we examined, we demonstrate how to achieve runtime module-level estimation errors 1.5--6.9x lower than when reliant on the signal selections made in accordance with a more straightforward, previously published metric.
Davis JJ, Levine JM, Stott EA, et al., 2017, KOCL: Power Self-awareness for Arbitrary FPGA-SoC-accelerated OpenCL Applications, IEEE Design and Test, Vol: 34, Pages: 36-45, ISSN: 2168-2356
Given the need for developers to rapidly produce complex, high-performance and energy-efficient hardware systems, methods facilitating their intelligent runtime management are of ever-increasing importance. For energy optimization, such control decisions require knowledge of power usage at subsystem granularity. This information must be made accessible to developers now accustomed to creating systems from high-level descriptions, such as those written in OpenCL. To address these challenges, we introduce KOCL, a tool allowing OpenCL developers targeting FPGA-SoC devices to query live kernel-level power consumption using function calls embedded in their host code. KOCL is open-source, available online at https://github.com/PRiME-project/KOCL. To maximize accessibility, its use necessitates zero exposure to hardware.
Xia F, Rafiev A, Aalsaud A, et al., 2017, Voltage, Throughput, Power, Reliability, and Multicore Scaling, Computer, Vol: 50, Pages: 34-45, ISSN: 0018-9162
This article studies the interplay between the performance, energy, and reliability (PER) of parallel-computing systems. It describes methods supporting the meaningful crossplatform analysis of this interplay. These methods lead to the PER software tool, which helps designers analyze, compare, and explore these properties.
Hung E, Davis JJ, Levine JM, et al., 2016, KAPow: A System Identification Approach to Online Per-Module Power Estimation in FPGA Designs, IEEE Symposium on Field-programmable Custom Computing Machines (FCCM) 2016, Publisher: IEEE, Pages: 56-63
In a modern FPGA system-on-chip design, it is often insufficient to simply assess the total power consumption of the entire circuit by design-time estimation or runtime power rail measurement. Instead, to make better runtime decisions, it is desirable to understand the power consumed by each individual module in the system. In this work, we combine board-level power measurements with register-level activity counting to build an online model that produces a breakdown of power consumption within the design. Online model refinement avoids the need for a time-consuming characterisation stage and also allows the model to track long-term changes to operating conditions. Our flow is named KAPow, a (loose) acronym for 'K'ounting Activity for Power estimation, which we show to be accurate, with per-module power estimates as close to +/-5mW of true measurements, and to have low overheads. We also demonstrate an application example in which a per-module power breakdown can be used to determine an efficient mapping of tasks to modules and reduce system-wide power consumption by over 8%.
Davis JJ, Cheung PYK, 2016, Reduced-precision Algorithm-based Fault Tolerance for FPGA-implemented Accelerators, International Symposium on Applied Reconfigurable Computing (ARC) 2016, Publisher: Springer, Pages: 361-368, ISSN: 0302-9743
As the threat of fault susceptibility caused by mechanisms including variation and degradation increases, engineers must give growing consideration to error detection and correction. While the use of common fault tolerance strategies frequently causes the incursion of significant overheads in area, performance and/or power consumption, options exist that buck these trends. In particular, algorithm-based fault tolerance embodies a proven family of low-overhead error mitigation techniques able to be built upon to create self-verifying circuitry. In this paper, we present our research into the application of algorithm-based fault tolerance (ABFT) in FPGA-implemented accelerators at reduced levels of precision. This allows for the introduction of a previously unexplored tradeoff: sacrificing the observability of faults associated with low-magnitude errors for gains in area, performance and efficiency by reducing the bit-widths of logic used for error detection. We describe the implementation of a novel checksum truncation technique, analysing its effects upon overheads and allowed error. Our findings include that bit-width reduction of ABFT circuitry within a fault-tolerant accelerator used for multiplying pairs of 32 x 32 matrices resulted in the reduction of incurred area overhead by 16.7% and recovery of 8.27% of timing model Fmax. These came at the cost of introducing average and maximum absolute output errors of 0.430% and 0.927%, respectively, of the maximum absolute output value under transient fault injection.
Davis JJ, Hung E, Levine J, et al., 2016, Knowledge is Power: Module-level Sensing for Runtime Optimisation, ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA) 2016, Publisher: ACM, Pages: 276-276
We propose the compile-time instrumentation of coexisting modules---IP blocks, accelerators, etc.---implemented in FPGAs. The efficient mapping of tasks to execution units can then be achieved, for power and/or timing performance, by tracking dynamic power consumption and/or timing slack online at module-level granularity. Our proposed instrumentation is transparent, thereby not affecting circuit functionality. Power and timing overheads have proven to be small and tend to be outweighed by the exposed runtime benefits.
Yang S, Shafik R, Merrett G, et al., 2015, Adaptive Energy Minimization of Embedded Heterogeneous Systems using Regression-based Learning, International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), Publisher: IEEE, Pages: 103-110
Modern embedded systems consist of heterogeneous computing resources with diverse energy and performance trade-offs. This is because these resources exercise the application tasks differently, generating varying workloads and energy consumption. As a result, minimizing energy consumption in these systems is challenging as continuous adaptation between application task mapping (i.e. allocating tasks among the computing resources) and dynamic voltage/frequency scaling (DVFS) is required. Existing approaches have limitations due to lack of such adaptation with practical validation (Table I). This paper addresses such limitation and proposes a novel adaptive energy minimization approach for embedded heterogeneous systems. Fundamental to this approach is a runtime model, generated through regression-based learning of energy/performance trade-offs between different computing resources in the system. Using this model, an application task is suitably mapped on a computing resource during runtime, ensuring minimum energy consumption for a given application performance requirement. Such mapping is also coupled with a DVFS control to adapt to performance and workload variations. The proposed approach is designed, engineered and validated on a Zynq-ZC702 platform, consisting of CPU, DSP and FPGA cores. Using several image processing applications as case studies, it was demonstrated that our proposed approach can achieve significant energy savings (>70%), when compared to the existing approaches.
Davis J, Cheung PYK, 2014, Achieving Low-overhead Fault Tolerance for Parallel Accelerators with Dynamic Partial Reconfiguration, International Conference on Field-programmable Logic and Applications (FPL) 2014, Publisher: IEEE, Pages: 1-6, ISSN: 1946-147X
While allowing for the fabrication of increasingly complex and efficient circuitry, transistor shrinkage and count-per-device expansion have major downsides: chiefly increased variation, degradation and fault susceptibility. For this reason, design-time consideration of fault tolerance will have to be given to increasing numbers of electronic systems in the future to ensure yields, reliabilities and lifetimes remain acceptably high. Many commonly implemented operators are suited to modification resulting in datapath error detection capabilities with low area overheads. FPGAs are uniquely placed to allow further area savings to be made when incorporating fault avoidance mechanisms thanks to their dynamic reconfigurability. In this paper, we examine the practicalities and costs involved in implementing hardware-software fault tolerance on a test platform: a parallel matrix multiplication accelerator in hardware, with controller in software, running on a Xilinx Zynq system-on-chip. A combination of `bolt-on' error detection logic and software-triggered routing reconfiguration serve to provide low-overhead datapath fault tolerance at runtime. Rapid yet accurate fault diagnoses along with low hardware (area), software (configuration storage) and performance penalties are achieved.
Davis J, Cheung PYK, 2014, Reducing Overheads for Fault-tolerant Datapaths with Dynamic Partial Reconfiguration, IEEE Symposium on Field-programmable Custom Computing Machines (FCCM) 2014, Publisher: IEEE, Pages: 103-103
As process scaling and transistor count inflation continue, silicon chips are becoming increasingly susceptible to faults. Although FPGAs are particularly vulnerable to these effects, their runtime reconfigurability offers unique opportunities for fault tolerance. This work presents an application combining algorithmic-level error detection with dynamic partial reconfiguration (DPR) to allow faults manifested within its datapath at runtime to be circumvented at low cost.
Davis JJ, Cheung PYK, 2014, Datapath Fault Tolerance for Parallel Accelerators, International Conference on Field-programmable Technology (FPT) 2013, Publisher: IEEE, Pages: 366-369
While we reap the benefits of process scaling in terms of transistor density and switching speed, consideration must be given to the negative effects it causes: increased variation, degradation and fault susceptibility. Above device level, such phenomena and the faults they induce can lead to reduced yield, decreased system reliability and, in extreme cases, total failure after a period of successful operation. Although error detection and correction are almost always considered for highly sensitive and susceptible applications such as those in space, for other, more general-purpose applications they are often overlooked. In this paper, we present a parallel matrix multiplication accelerator running in hardware on the Xilinx Zynq system-on-chip platform, along with 'bolt-on' logic for detecting, locating and avoiding faults within its datapath. Designs of various sizes are compared with respect to resource overhead and performance impact. Our largest-implemented fault-tolerant accelerator was found to consume 17.3% more area, run at a 3.95% lower frequency and incur an 18.8% execution time penalty over its equivalent fault-susceptible design during fault-free operation.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.