611 results found
Luk W, 2015, Analysing reconfigurable computing systems, Transforming Reconfigurable Systems: A Festschrift Celebrating the 60th Birthday of Professor Peter Cheung, Pages: 101-116, ISBN: 9781783266968
© 2015 by Imperial College Press. All rights reserved. The distinguishing feature of a reconfigurable computing system is that the function and the interconnection of its processing elements can be changed, in some cases during run-time. However, reconfigurability is a double-edged sword: it only produces attractive results if used judiciously, since there are various overheads associated with exploiting reconfigurability in computing systems. This chapter introduces a simple approach for analysing the performance, resource usage and energy consumption of reconfigurable computing systems, and explains how it can be used in analysing some recent advances in design techniques for various applications that produce run-time reconfigurable implementations. Directions for future development of this approach are also explored.
Inggs G, Thomas DB, Luk W, 2015, An Efficient, Automatic Approach to High Performance Heterogeneous Computing., CoRR, Vol: abs/1505.04417
Inggs G, Thomas DB, Constantinides GA, et al., 2015, Seeing Shapes in Clouds: On the Performance-Cost trade-off for Heterogeneous Infrastructure-as-a-Service.
Chau TCP, Niu X, Eele A, et al., 2014, Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable Systems, ACM Transactions on Reconfigurable Technology and Systems, Vol: 7, ISSN: 1936-7414
This article presents an approach for mapping real-time applications based on particle filters (PFs) toheterogeneous reconfigurable systems, which typically consist of multiple FPGAs and CPUs. A method isproposed to adapt the number of particles dynamically and to utilise runtime reconfigurability of FPGAs forreduced power and energy consumption. A data compression scheme is employed to reduce communicationoverhead between FPGAs and CPUs. A mobile robot localisation and tracking application is developed toillustrate our approach. Experimental results show that the proposed adaptive PF can reduce up to 99% ofcomputation time. Using runtime reconfiguration, we achieve a 25% to 34% reduction in idle power. A 1Usystem with four FPGAs is up to 169 times faster than a single-core CPU and 41 times faster than a 1UCPU server with 12 cores. It is also estimated to be 3 times faster than a system with four GPUs.
Guo C, Luk W, 2014, Pipelined HAC Estimation Engines for Multivariate Time Series, JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, Vol: 77, Pages: 117-129, ISSN: 1939-8018
Shan Y, Hao Y, Wang W, et al., 2014, Hardware Acceleration for an Accurate Stereo Vision System Using Mini-Census Adaptive Support Region, ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, Vol: 13, ISSN: 1539-9087
Guo C, Luk W, Vinkovskaya E, et al., 2014, Customisable pipelined engine for intensity evaluation in multivariate hawkes point processes, ACM SIGARCH Computer Architecture News, Vol: 41, Pages: 59-64, ISSN: 0163-5964
Guo L, Thomas DB, Luk W, 2014, Customisable architectures for the set covering problem, ACM SIGARCH Computer Architecture News, Vol: 41, Pages: 101-106, ISSN: 0163-5964
Chau TCP, Targett JS, Wijeyasinghe M, et al., 2014, Accelerating sequential Monte Carlo method for real-time air traffic management, Publisher: Association for Computing Machinery (ACM), Pages: 35-40, ISSN: 0163-5964
Niu X, Jin Q, Luk W, et al., 2014, A Self-Aware Tuning and Self-Aware Evaluation Method for Finite-Difference Applications in Reconfigurable Systems, ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, Vol: 7, ISSN: 1936-7406
Le Masle A, Luk W, 2014, Mapping Loop Structures onto Parametrized Hardware Pipelines, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, Vol: 22, Pages: 631-640, ISSN: 1063-8210
Spada F, Scolari A, Durelli GC, et al., 2014, FPGA-based design using the FASTER toolchain: the case of STM Spear development board, 12th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), Publisher: IEEE, Pages: 134-141, ISSN: 2158-9178
Grigoras P, Tottenham M, Niu X, et al., 2014, Elastic Management of Reconfigurable Accelerators, 12th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), Publisher: IEEE, Pages: 174-181, ISSN: 2158-9178
Zhao W, Fu H, Yang G, et al., 2014, Patra: Parallel tree-reweighted message passing architecture
© 2014 Technical University of Munich (TUM). Maximum a posteriori probability inference algorithms for Markov Random Field are widely used in many applications, such as computer vision and machine learning. Sequential tree-reweighted message passing (TRW-S) is an inference algorithm which shows good quality in finding optimal solutions. However, the performance of TRW-S in software cannot meet the requirements of many real-time applications, due to the sequential scheme and the high memory, bandwidth and computational costs. This paper proposes Patra, a novel parallel tree-reweighted message passing architecture, which involves a fully pipelined design targeting FPGA technology. We build a hybrid CPU/FPGA system to test the performance of Patra for stereo matching. Experimental results show that Patra provides about 100 times faster than a software implementation of TRW-S, and 12 times faster than a GPU-based message passing algorithm. Compared with an existing design in four FPGAs, we can achieve 2 times speedup in a single FPGA. Moreover, Patra can work at video rate in many cases, such as a rate of 167 frame/sec for a standard stereo matching test case, which makes it promising for many real-time applications.
Gan L, Fu H, Yang C, et al., 2014, A highly-efficient and green data flow engine for solving Euler atmospheric equations
© 2014 Technical University of Munich (TUM). Atmospheric modeling is an essential issue in the study of climate change. However, due to the complicated algorithmic and communication models, scientists and researchers are facing tough challenges in finding efficient solutions to solve the atmospheric equations. In this paper, we accelerate a solver for the three-dimensional Euler atmospheric equations through reconfigurable data flow engines. We first propose a hybrid design that achieves efficient resource allocation and data reuse. Furthermore, through algorithmic offsetting, fast memory table, and customizable-precision arithmetic, we map a complex Euler kernel into a single FPGA chip, which can perform 956 floating point operations per cycle. In a 1U-chassis, our CPU-DFE unit with 8 FPGA chips is 18.5 times faster and 8.3 times more power efficient than a multicore system based on two 12-core Intel E5-2697 (Ivy Bridge) CPUs, and is 6.2 times faster and 5.2 times more power efficient than a hybrid unit equipped with two 12-core Intel E5-2697 (Ivy Bridge) CPUs and three Intel Xeon Phi 5120d (MIC) cards.
Guo L, Thomas DBJ, Guo C, et al., 2014, Automated framework for FPGA-based parallel genetic algorithms, 24th International Conference on Field Programmable Logic and Applications, Publisher: IEEE
Parallel genetic algorithms (pGAs) are a variant of genetic algorithms which can promise substantial gains in both efficiency of execution and quality of results. pGAs have attracted researchers to implement them in FPGAs, but the implementation always needs large human effort. To simplify the implementation process and make the hardware pGA designs accessible to potential non-expert users, this paper proposes a general-purpose framework, which takes in a high-level description of the optimisation target and automatically generates pGA designs for FPGAs. Our pGA system exploits the two levels of parallelism found in GA instances and genetic operations, allowing users to tailor the architecture for resource constraints at compile-time. The framework also enables users to tune a subset of parameters at run-time without time-consuming recompilation. Our pGA design is more flexible than previous ones, and has an average speedup of 26 times compared to the multi-core counterparts over five combinatorial and numerical optimisation problems. When compared with a GPU, it also shows a 6.8 times speedup over a combinatorial application.
Chow G, Grigoras P, Burovskiy PA, et al., 2014, An efficient sparse conjugate gradient solver using a Beneš permutation network, 24th International Conference on Field Programmable Logic and Applications, Publisher: IEEE
The conjugate gradient (CG) is one of the most widely used iterative methods for solving systems of linear equations. However, parallelizing CG for large sparse systems is difficult due to the inherent irregularity in memory access pattern. We propose a novel processor architecture for the sparse conjugate gradient method. The architecture consists of multiple processing elements and memory banks, and is able to compute efficiently both sparse matrix-vector multiplication, and other dense vector operations. A Beneš permutation network with an optimised control scheme is introduced to reduce memory bank conflicts without expensive logic. We describe a heuristics for offline scheduling, the effect of which is captured in a parametric model for estimating the performance of designs generated from our approach.
Fidjeland AK, Luk W, Muggleton SH, 2014, Customisable multi-processor acceleration of inductive logic programming, Latest Advances in Inductive Logic Programming, Pages: 123-141, ISBN: 9781783265084
© 2015 Imperial College Press. All rights reserved. Parallel approaches to Inductive Logic Programming (ILP) are adopted to address the computational complexity in the learning process. Existing parallel ILP implementations build on conventional general-purpose processors. This chapter describes a different approach, by exploiting usercustomisable parallelism available in advanced reconfigurable devices such as Field-Programmable Gate Arrays (FPGAs). Our customisable parallel architecture for ILP has three elements: a customisable logic programming processor, a multi-processor for parallel hypothesis evaluation, and an architecture generation framework for creating such multi-processors. Our approach offers a means of achieving high performance by producing parallel architectures adapted both to the problem domain and to specific problem instances. The coverage test in Progol 4.4 is performed up to 56 times faster using our multi-processor.
Burovskiy P, Girdlestone S, Davies C, et al., 2014, Dataflow acceleration of Krylov subspace sparse banded problems
© 2014 Technical University of Munich (TUM). Most of the efforts in the FPGA community related to sparse linear algebra focus on increasing the degree of internal parallelism in matrix-vector multiply kernels. We propose a parametrisable dataflow architecture presenting an alternative and complementary approach to support acceleration of banded sparse linear algebra problems which benefit from building a Krylov subspace. We use banded structure of a matrix A to overlap the computations Ax, A2x,..., Akx by building a pipeline of matrix-vector multiplication processing elements (PEs) each performing Aix. Due to on-chip data locality, FLOPS rate sustainable by such pipeline scales linearly with k. Our approach enables trade-off between the number k of overlapped matrix power actions and the level of parallelism in a PE. We illustrate our approach for Google PageRank computation by power iteration for large banded single precision sparse matrices. Our design scales up to 32 sequential PEs with floating point accumulation and 80 PEs with fixed point accumulation on Stratix V D8 FPGA. With 80 single-pipe fixed point PEs clocked at 160Mhz, our design sustains 12.7 GFLOPS.
Hung E, Todman T, Luk W, 2014, Transparent insertion of latency-oblivious logic onto FPGAs
© 2014 Technical University of Munich (TUM). We present an approach for inserting latency-oblivious functionality into pre-existing FPGA circuits transparently. To ensure transparency - that such modifications do not affect the design's maximum clock frequency - we insert any additional logic post place-and-route, using only the spare resources that were not consumed by the pre-existing circuit. The typical challenge with adding new functionality into existing circuits incrementally is that spare FPGA resources to host this functionality must be located close to the input signals that it requires, in order to minimise the impact of routing delays. In congested designs, however, such co-location is often not possible. We overcome this challenge by using flow techniques to pipeline and route signals from where they originate, potentially in a region of high resource congestion, into a region of low congestion capable of hosting new circuitry, at the expense of latency. We demonstrate and evaluate our approach by augmenting realistic designs with self-monitoring circuitry, which is not sensitive to latency. We report results on circuits operating over 200MHz and show that our insertions have no impact on timing, are 2-4 times faster than compile-time insertion, and incur only a small power overhead.
Guo L, Thomas DB, Luk W, 2014, Automated Framework for General-Purpose Genetic Algorithms in FPGAs, 17th European Conference on Applications of Evolutionary Computation (EvpApplications), Publisher: SPRINGER-VERLAG BERLIN, Pages: 714-725, ISSN: 0302-9743
Kurek M, Becker T, Chau TCP, et al., 2014, Automating Optimization of Reconfigurable Designs, 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines ((FCCM), Publisher: IEEE, Pages: 210-213
Chau TCP, Kurek M, Targett JS, et al., 2014, SMCGen: Generating Reconfigurable Design for Sequential Monte Carlo Applications, 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines ((FCCM), Publisher: IEEE, Pages: 141-148
Ma Y, Liu J, Zhang C, et al., 2014, HW/SW Partitioning For Region-based Dynamic Partial Reconfigurable FPGAs, 32nd IEEE International Conference on Computer Design (ICCD), Publisher: IEEE, Pages: 470-476, ISSN: 1063-6404
Funie A-I, Salmon M, Luk W, 2014, A Hybrid Genetic-Programming Swarm-Optimisation Approach for Examining the Nature and Stability of High Frequency Trading Strategies, 13th International Conference on Machine Learning and Applications (ICMLA), Publisher: IEEE, Pages: 29-34
Yang J, Guo C, Luk W, et al., 2014, Collaborative processing of Least-Square Monte Carlo for American Options, International Conference on Field Programmable Technology, Publisher: IEEE, Pages: 52-59
Bara A, Niu X, Luk W, 2014, A Dataflow System for Anomaly Detection and Analysis, International Conference on Field Programmable Technology, Publisher: IEEE, Pages: 276-279
Inggs G, Fleming S, Thomas D, et al., 2014, Is High Level Synthesis ready for business? A computational finance case study, International Conference on Field Programmable Technology, Publisher: IEEE, Pages: 12-19
Shao S, Guo C, Luk W, et al., 2014, Accelerating Transfer Entropy Computation, International Conference on Field Programmable Technology, Publisher: IEEE, Pages: 60-67
Guo C, Luk W, Weston S, 2014, Pipelined Reconfigurable Accelerator for Ordinal Pattern Encoding, IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 194-201, ISSN: 2160-0511
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.