619 results found
Zhao R, Todman T, Luk W, et al., 2017, DeepPump: Multi-pumping deep Neural Networks, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 206-206, ISSN: 1063-6862
This paper presents DeepPump, an approach that generates CNN hardware designs with multi-pumping, which have competitive performance when compared with previous designs. Future work includes integrating DeepPump with other optimisations, and providing further evaluations on various FPGA platforms.
© 2017 IEEE. This paper presents an approach to enhance the performance of machine learning applications based on hardware acceleration. This approach is based on parameterised architectures designed for Convolutional Neural Network (CNN) and Support Vector Machine (SVM), and the associated design flow common to both. This approach is illustrated by two case studies including object detection and satellite data analysis. The potential of the proposed approach is presented.
Gan L, Fu H, Luk W, et al., 2017, Solving Mesoscale Atmospheric Dynamics Using a Reconfigurable Dataflow Architecture, IEEE MICRO, Vol: 37, Pages: 40-50, ISSN: 0272-1732
Yan J, Yuan J, Leong PHW, et al., 2017, Lossless Compression Decoders for Bitstreams and Software Binaries Based on High-Level Synthesis, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol: 25, Pages: 2842-2855, ISSN: 1063-8210
As the density of field-programmable gate arrays continues to increase, the size of configuration bitstreams grows accordingly. Compression techniques can reduce memory size and save external memory bandwidth. To accelerate the configuration process and reduce the software startup time, four open-source lossless compression decoders developed using high-level synthesis techniques are presented. Moreover, in order to balance the objectives of compression ratio, decompression throughput, and hardware resource overhead, various improvements and optimizations are proposed. Full bitstreams and software binaries have been collected as a benchmark, and 33 partial bitstreams have also been developed and integrated into the benchmark. Evaluations of the synthesizable compression decoders are demonstrated on a Xilinx ZC706 board, showing higher decompression throughput than those of the existing lossless compression decoders using our benchmark. The proposed decoders can reduce software startup time by up to 31.23% in embedded systems and 69.83% reduction of reconfiguration time for partial reconfigurable systems.
Chau TCP, Burovskiy P, Flynn MJ, et al., 2017, Chapter Two - Advances in Dataflow Systems., Advances in Computers, Vol: 105, Pages: 21-62
He C, Fu H, Guo C, et al., 2017, A Fully-Pipelined Hardware Design for Gaussian Mixture Models, IEEE Transactions on Computers, Vol: 66, Pages: 1837-1850, ISSN: 0018-9340
Gaussian Mixture Models (GMMs) are widely used in many applications such as data mining, signal processing and computer vision, for probability density modeling and soft clustering. However, the parameters of a GMM need to be estimated from data by, for example, the Expectation-Maximization algorithm for Gaussian Mixture Models (EM-GMM), which is computationally demanding. This paper presents a novel design for the EM-GMM algorithm targeting reconfigurable platforms, with five main contributions. First, a pipeline-friendly EM-GMM with diagonal covariance matrices that can easily be mapped to hardware architectures. Second, a function evaluation unit for Gaussian probability density based on fixed-point arithmetic. Third, our approach is extended to support a wide range of dimensions or/and components by fitting multiple pieces of smaller dimensions onto an FPGA chip. Fourth, we derive a cost and performance model that estimates logic resources. Fifth, our dataflow design targeting the Maxeler MPCX2000 with a Stratix-5SGSD8 FPGA can run over 200 times faster than a 6-core Xeon E5645 processor, and over 39 times faster than a Pascal TITAN-X GPU. Our design provides a practical solution to applications for training and explores better parameters for GMMs with hundreds of millions of high dimensional input instances, for low-latency and high-performance applications.
Funie AI, Grigoras P, Burovskiy P, et al., 2017, Run-time reconfigurable acceleration for genetic programming fitness evaluation in trading strategies, Journal of Signal Processing Systems, Vol: 90, Pages: 39-52, ISSN: 1939-8018
Genetic programming can be used to identify complex patterns in financial markets which may lead to more advanced trading strategies. However, the computationally intensive nature of genetic programming makes it difficult to apply to real world problems, particularly in real-time constrained scenarios. In this work we propose the use of Field Programmable Gate Array technology to accelerate the fitness evaluation step, one of the most computationally demanding operations in genetic programming. We propose to develop a fully-pipelined, mixed precision design using run-time reconfiguration to accelerate fitness evaluation. We show that run-time reconfiguration can reduce resource consumption by a factor of 2 compared to previous solutions on certain configurations. The proposed design is up to 22 times faster than an optimised, multithreaded software implementation while achieving comparable financial returns.
Burovskiy P, Grigoras P, Sherwin S, et al., 2017, Efficient Assembly for High-Order Unstructured FEM Meshes (FPL 2015), ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, Vol: 10, ISSN: 1936-7406
Zhao R, Niu X, Wu Y, et al., 2017, Optimizing CNN-based object detection algorithms on embedded FPGA platforms, 13th International Symposium, ARC 2017, Publisher: Springer, Pages: 255-267, ISSN: 0302-9743
Algorithms based on Convolutional Neural Network (CNN) have recently been applied to object detection applications, greatly improving their performance. However, many devices intended for these algorithms have limited computation resources and strict power consumption constraints, and are not suitable for algorithms designed for GPU workstations. This paper presents a novel method to optimise CNNbased object detection algorithms targeting embedded FPGA platforms. Given parameterised CNN hardware modules, an optimisation flow takes network architectures and resource constraints as input, and tunes hardware parameters with algorithm-specific information to explore the design space and achieve high performance. The evaluation shows that our design model accuracy is above 85% and, with optimised configuration, our design can achieve 49.6 times speed-up compared with software implementation.
A trading strategy is generally optimised for a given market regime. If it takes too long to switch from one trading strategy to another, then a sub-optimal trading strategy may be adopted. This paper proposes the first FPGA-based framework which supports multiple trend-following trading strategies to obtain accurate market characterisation for various financial market regimes. The framework contains a trading strategy kernel library covering a number of well-known trend-following strategies, such as “triple moving average”. Three types of design are targeted: a static reconfiguration trading strategy (SRTS), a full reconfiguration trading strategy (FRTS), and a partial reconfiguration trading strategy (PRTS). Our approach is evaluated using both synthetic and historical market data. Compared to a fully optimised CPU implementation, the SRTS design achieves 11 times speedup, the FRTS design achieves 2 times speedup, while the PRTS design achieves 7 times speedup. The FRTS and PRTS designs also reduce the amount of resources used on chip by 29% and 15% respectively, when compared to the SRTS design.
Grigoras P, Burovskiy P, Arram J, et al., 2017, Dfesnippets: An open-source library for dataflow acceleration on FPGAs, 13th International Symposium, ARC 2017, Publisher: Springer, Pages: 299-310, ISSN: 0302-9743
Highly-tuned FPGA implementations can achieve significant performance and power efficiency gains over general purpose hardware. However the limited development productivity has prevented mainstream adoption of FPGAs in many areas such as High Performance Computing. High level standard development libraries are increasingly adopted in improving productivity. We propose an approach for performance critical applications including standard library modules, benchmarking facilities and application benchmarks to support a variety of usecases. We implement the proposed approach as an open-source library for a commercially available FPGA system and highlight applications and productivity gains.
Fu H, He C, Luk W, et al., A nanosecond-level hybrid table design for financial market data generators, The 25th IEEE International Symposium on Field-Programmable Custom Computing Machines, Publisher: IEEE
This paper proposes a hybrid sorted table designfor minimizing electronic trading latency, with three maincontributions. First, a hierarchical sorted table with twolevels, a fast cache table in reconfigurable hardware storingmegabytes of data items and a master table in software storinggigabytes of data items. Second, a full set of operations,including insertion, deletion, selection and sorting, for thehybrid table with latency in a few cycles. Third, an on-demand synchronization scheme between the cache table andthe master table. An implementation has been developed thattargets an FPGA-based network card in the environment of theChina Financial Futures Exchange (CFFEX) which sustains 1-10Gb/s bandwidth with latency of 400 to 700 nanoseconds,providing an 80- to 125-fold latency reduction compared to afully optimized CPU-based solution, and a 2.2-fold reductionover an existing FPGA-based solution.
Leong PHW, Amano H, Anderson J, et al., 2017, The First 25 Years of the FPL Conference: Significant Papers, ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, Vol: 10, ISSN: 1936-7406
A summary of contributions made by significant papers from the first 25 years of the Field-Programmable Logic and Applications conference (FPL) is presented. The 27 papers chosen represent those which have most strongly influenced theory and practice in the field.
Li T, Heinis T, Luk W, 2017, ADvaNCE - Efficient and Scalable Approximate Density-Based Clustering Based on Hashing, Informatica, Lith. Acad. Sci., Vol: 28, Pages: 105-130
Fu H, He C, Ruan H, et al., 2017, Accelerating Financial Market Server through Hybrid List Design, ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Publisher: ASSOC COMPUTING MACHINERY, Pages: 289-290
Gan L, Fu H, Mencer O, et al., 2017, Data Flow Computing in Geoscience Applications, CREATIVITY IN COMPUTING AND DATAFLOW SUPERCOMPUTING, Editors: Hurson, Milutinovic, Publisher: ELSEVIER ACADEMIC PRESS INC, Pages: 125-158, ISBN: 978-0-12-811955-6
Chau T, Burovskiy P, Flynn M, et al., 2017, Advances in Dataflow Systems, ADVANCES IN COMPUTERS, VOL 106, Editors: Hurson, Milutinovic, Publisher: ELSEVIER ACADEMIC PRESS INC, Pages: 21-62, ISBN: 978-0-12-812230-3
Todman T, Luk W, 2017, In-Circuit Assertions and Exceptions for Reconfigurable Hardware Design., Provably Correct Systems, Editors: Hinchey, Bowen, Olderog, Publisher: Springer, Pages: 265-281, ISBN: 978-3-319-48627-7
Zhao W, Fu H, Luk W, et al., 2016, F-CNN: An FPGA-based Framework for Training Convolutional Neural Networks, 27th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 107-114, ISSN: 2160-052X
This paper presents a novel reconfigurable framework for training Convolutional Neural Networks (CNNs). The proposed framework is based on reconfiguring a streaming datapath at runtime to cover the training cycle for the various layers in a CNN. The streaming datapath can support various parameterized modules which can be customized to produce implementations with different trade-offs in performance and resource usage. The modules follow the same input and output data layout, simplifying configuration scheduling. For different layers, instances of the modules contain different computation kernels in parallel, which can be customized with different layer configurations and data precision. The associated models on performance, resource and bandwidth can be used in deriving parameters for the datapath to guide the analysis of design trade-offs to meet application requirements or platform constraints. They enable estimation of the implementation specifications given different layer configurations, to maximize performance under the constraints on bandwidth and hardware resources. Experimental results indicate that the proposed module design targeting Maxeler technology can achieve a performance of 62.06 GFLOPS for 32-bit floating-point arithmetic, outperforming existing accelerators. Further evaluation based on training LeNet-5 shows that the proposed framework achieves about 4 times faster than CPU implementation of Caffe and about 7.5 times more energy efficient than the GPU implementation of Caffe.
Yu T, Feng B, Stillwell M, et al., 2016, Relation-oriented resource allocation for multi-accelerator systems, 27th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 243-244, ISSN: 2160-052X
This paper presents a novel approach for allocating resources in systems with multiple accelerators. It has three main contributions. First, a new model based on Birkhoff's representation theory in capturing the ordering properties of resource allocation requests (RArs). Second, an effective technique for resource allocation based on this model, targeting systems with multiple accelerators. Third, the evaluation of the proposed approach for Maxeler MPC-X multi-accelerator systems, demonstrating time-efficiency and 30%-50% failure-rate decrease (FRD) on random input dataset.
Lindsey B, Leslie M, Luk W, 2016, A domain specific language for accelerated multilevel Monte Carlo simulations, 27th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 99-106, ISSN: 1063-6862
Monte Carlo simulations are used to tackle a wide range of exciting and complex problems, such as option pricing and biophotonic modelling. Since Monte Carlo simulations are both computationally expensive and highly parallelizable, they are ideally suited for acceleration through GPUs and FPGAs. Alongside these accelerators, Multilevel Monte Carlo techniques can be harnessed to further hasten simulations. However, researchers and application developers must invest a great deal of effort to design, optimise and test such Monte Carlo simulations. Furthermore, these models often have to be rewritten from scratch to target new hardware accelerators. This paper presents Neb, a Domain Specific Language for describing and generating Multilevel Monte Carlo simulations for a variety of hardware architectures. Neb compiles equations written in LATEX to C++, OpenCL or Maxeler's MaxJ language, allowing acceleration through GPUs or FPGAs. Neb can be used to solve stochastic equations or to generate paths for analysis with other tools. To evaluate the performance of Neb, a variety of financial models are executed on CPUs, GPUs and FPGAs, demonstrating peak acceleration of 3.7 times with FPGAs in 40nm transistor technology, and 14.4 times with GPUs in 28nm transistor technology. Furthermore, the energy efficiency of these accelerators is compared, revealing FPGAs to be 8.73 times and GPUs 2.52 times more efficient than CPUs.
Gan L, Fu H, Mencer O, et al., 2016, Chapter Four - Data Flow Computing in Geoscience Applications., Advances in Computers, Vol: 104, Pages: 125-158
Hung E, Todman T, Luk W, 2016, Transparent In-Circuit Assertions for FPGAs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol: 36, Pages: 1193-1202, ISSN: 0278-0070
Commonly used in software design, assertions are statements placed into a design to ensure that its behaviour matches that expected by a designer. Although assertions apply equally to hardware design, they are typically supported only for logic simulation, and discarded prior to physical implementation. We propose a new HDL-agnostic language for describing latency-insensitive assertions and novel methods to add such assertions transparently to an already placed-and-routed circuit without affecting the existing design. We also describe how this language and associated methods can be used to implement semi-transparent exception handling. The key to our work is that by treating hardware assertions and exceptions as being oblivious or less sensitive to latency, assertion logic need only use spare FPGA resources. We use network-flow techniques to route necessary signals to assertions via spare flip-flops, eliminating any performance degradation, even on large designs (92% of slices in one test). Experimental evaluation shows zero impact on critical-path delay, even on large benchmarks operating above 200MHz, at the cost of a small power penalty.
Grigoras P, Burovskiy P, Luk W, et al., 2016, Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA, 2016 26th International Conference on Field Programmable Logic and Applications (FPL), ISSN: 1946-1488
Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. This problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. The efficiency of the proposed architecture combined with the effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems.
Kurek M, Deisenroth MP, Luk W, et al., 2016, Knowledge Transfer in Automatic Optimisation of Reconfigurable Designs, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE
This paper presents a novel approach for automatic optimisation of reconfigurable design parameters based on knowledge transfer. The key idea is to make use of insights derived from optimising related designs to benefit future optimisations. We show how to use designs targeting one device to speed up optimisation of another device. The proposed approach is evaluated based on various applications including computational finance and seismic imaging. It is capable of achieving up to 35% reduction in optimisation time in producing designs with similar performance, compared to alternative optimisation methods.
Stroobandt D, Varbanescu AL, Ciobanu CB, et al., 2016, EXTRA: Towards the exploitation of eXascale technology for reconfigurable architectures, ReCoSoC 2016, Publisher: IEEE
To handle the stringent performance requirements of future exascale-class applications, High Performance Computing (HPC) systems need ultra-efficient heterogeneous compute nodes. To reduce power and increase performance, such compute nodes will require hardware accelerators with a high degree of specialization. Ideally, dynamic reconfiguration will be an intrinsic feature, so that specific HPC application features can be optimally accelerated, even if they regularly change over time. In the EXTRA project, we create a new and flexible exploration platform for developing reconfigurable architectures, design tools and HPC applications with run-time reconfiguration built-in as a core fundamental feature instead of an add-on. EXTRA covers the entire stack from architecture up to the application, focusing on the fundamental building blocks for run-time reconfigurable exascale HPC systems: new chip architectures with very low reconfiguration overhead, new tools that truly take reconfiguration as a central design concept, and applications that are tuned to maximally benefit from the proposed run-time reconfiguration techniques. Ultimately, this open platform will improve Europe's competitive advantage and leadership in the field.
Zhou H, Niu X, Yuan J, et al., Connect on the fly: enhancing and prototyping of cycle-reconfigurable modules, 26th International Conference on Field-Programmable Logic and Applications, Publisher: IEEE
This paper introduces cycle-reconfigurable modulesthat enhance FPGA architectures with efficient support fordynamic data accesses: data accesses with accessed data size andlocation known only at runtime. The proposed module adoptsnew reconfiguration strategies based ondynamic FIFOs,dynamiccaches, anddynamic shared memoriesto significantly reduceconfiguration generation and routing complexity. We developa prototype FPGA chip with the proposed cycle-reconfigurablemodule in the SMIC 130-nm technology. The integrated moduletakes less than the chip area of 39 CLBs, and reconfiguresthousands of runtime connections in 1.2 ns. Applications for large-scale sorting, sparse matrix-vector multiplication, and Mem-cached are developed. The proposed modules enable 1.4 and11 times reductions in area-delay product compared with thoseapplications mapped to previous architectures and conventionalFPGAs.
Niu X, Ng C, Yumi T, et al., EURECA Compilation: Automatic Optimisation of Cycle-Reconfigurable Circuits, 26th International Conference on Field-Programmable Logic and Applications, Publisher: IEEE
EURECA architectures have been proposed as anenhancement to the existing FPGAs, to enable cycle-by-cyclereconfiguration. Applications with irregular data accesses, whichpreviously cannot be efficiently supported in hardware, canbe efficiently mapped into EURECA architectures. One majorchallenge to apply the EURECA architectures to practicalapplications is the intensive design efforts required to analyseand optimise cycle-reconfigurable operations, in order to obtainaccurate and high-performance results while underlying circuitsreconfigure cycle by cycle. In this work, we propose compilersupport for EURECA-based designs. The compiler supportadopts techniques based on session types to automatically derive aruntime reconfiguration scheduler that guarantees design correct-ness; and a streaming circuit model to ensure high-performancecircuits. Three benchmark applications —large-scale sorting,Memcached, and SpMV— developed with the proposed compilersupport show up to 11.2 times (21.8 times when architecturescales) reduction in area-delay product when compared withconventional architectures, and achieve up to39%improvementscompared with manually optimised EURECA designs.
Thomas DB, Inggs G, Luk W, 2016, A domain specific approach to high performance heterogeneous computing, IEEE Transactions on Parallel and Distributed Systems, Vol: 28, Pages: 2-15, ISSN: 1045-9219
Users of heterogeneous computing systems face two problems: firstly, in understanding the trade-off relationships betweenthe observable characteristics of their applications, such as latency and quality of the result, and secondly, how to exploit knowledge ofthese characteristics to allocate work to distributed computing platforms efficiently. A domain specific approach addresses both ofthese problems. By considering a subset of operations or functions, models of the observable characteristics or domain metrics may beformulated in advance, and populated at run-time for task instances. These metric models can then be used to express the allocation ofwork as a constrained integer program.These claims are illustrated using the domain of derivatives pricing in computational finance, with the domain metrics of workloadlatency and pricing accuracy. For a large, varied workload of 128 Black-Scholes and Heston model-based option pricing tasks, runningupon a diverse array of 16 Multicore CPUs, GPUs and FPGAs platforms, predictions made by models of both the makespan andaccuracy are generally within 10% of the run-time performance. When these models are used as inputs to machine learning andMILP-based workload allocation approaches, a latency improvement of up to 24 and 270 times over the heuristic approach is seen.
Hmid SN, Coutinho JGF, Luk W, 2016, A Transfer-Aware Runtime System for Heterogeneous Asynchronous Parallel Execution, ACM SIGARCH Computer Architecture News, Vol: 43, Pages: 40-45, ISSN: 0163-5964
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.