554 results found
Arram J, Kaplan T, Luk W, et al., 2017, Leveraging FPGAs for Accelerating Short Read Alignment, IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, Vol: 14, Pages: 668-677, ISSN: 1545-5963
Burovskiy P, Grigoras P, Sherwin S, et al., 2017, Efficient Assembly for High-Order Unstructured FEM Meshes (FPL 2015), ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, Vol: 10, ISSN: 1936-7406
Fu H, He C, Luk W, et al., 2017, A nanosecond-level hybrid table design for financial market data generators, Pages: 227-234
© 2017 IEEE. This paper proposes a hybrid sorted table design for minimizing electronic trading latency, with three main contributions. First, a hierarchical sorted table with two levels, a fast cache table in reconfigurable hardware storing megabytes of data items and a master table in software storing gigabytes of data items. Second, a full set of operations, including insertion, deletion, selection and sorting, for the hybrid table with latency in a few cycles. Third, an on-demand synchronization scheme between the cache table and the master table. An implementation has been developed that targets an FPGA-based network card in the environment of the China Financial Futures Exchange (CFFEX) which sustains 1-10Gb/s bandwidth with latency of 400 to 700 nanoseconds, providing an 80- to 125-fold latency reduction compared to a fully optimized CPU-based solution, and a 2.2-fold reduction over an existing FPGA-based solution.
© Springer International Publishing AG 2017. A trading strategy is generally optimised for a given market regime. If it takes too long to switch from one trading strategy to another, then a sub-optimal trading strategy may be adopted. This paper proposes the first FPGA-based framework which supports multiple trend-following trading strategies to obtain accurate market characterisation for various financial market regimes. The framework contains a trading strategy kernel library covering a number of well-known trend-following strategies, such as “triple moving average”. Three types of design are targeted: a static reconfiguration trading strategy (SRTS), a full reconfiguration trading strategy (FRTS), and a partial reconfiguration trading strategy (PRTS). Our approach is evaluated using both synthetic and historical market data. Compared to a fully optimised CPU implementation, the SRTS design achieves 11 times speedup, the FRTS design achieves 2 times speedup, while the PRTS design achieves 7 times speedup. The FRTS and PRTS designs also reduce the amount of resources used on chip by 29% and 15% respectively, when compared to the SRTS design.
Grigoras P, Burovskiy P, Arram J, et al., 2017, Dfesnippets: An open-source library for dataflow acceleration on FPGAs, Pages: 299-310, ISSN: 0302-9743
© Springer International Publishing AG 2017. Highly-tuned FPGA implementations can achieve significant performance and power efficiency gains over general purpose hardware. However the limited development productivity has prevented mainstream adoption of FPGAs in many areas such as High Performance Computing. High level standard development libraries are increasingly adopted in improving productivity. We propose an approach for performance critical applications including standard library modules, benchmarking facilities and application benchmarks to support a variety of usecases. We implement the proposed approach as an open-source library for a commercially available FPGA system and highlight applications and productivity gains.
Hung E, Todman T, Luk W, 2017, Transparent In-Circuit Assertions for FPGAs, IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, Vol: 36, Pages: 1193-1202, ISSN: 0278-0070
Inggs G, Thomas DB, Luk W, 2017, A Domain Specific Approach to High Performance Heterogeneous Computing, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, Vol: 28, Pages: 2-15, ISSN: 1045-9219
Leong PHW, Amano H, Anderson J, et al., 2017, The First 25 Years of the FPL Conference: Significant Papers, ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, Vol: 10, ISSN: 1936-7406
Li T, Heinis T, Luk W, 2017, ADvaNCE - Efficient and Scalable Approximate Density-Based Clustering Based on Hashing, INFORMATICA, Vol: 28, Pages: 105-130, ISSN: 0868-4952
Zhao R, Niu X, Wu Y, et al., 2017, Optimizing CNN-based object detection algorithms on embedded FPGA platforms, Pages: 255-267, ISSN: 0302-9743
© Springer International Publishing AG 2017. Algorithms based on Convolutional Neural Network (CNN) have recently been applied to object detection applications, greatly improving their performance. However, many devices intended for these algorithms have limited computation resources and strict power consumption constraints, and are not suitable for algorithms designed for GPU workstations. This paper presents a novel method to optimise CNNbased object detection algorithms targeting embedded FPGA platforms. Given parameterised CNN hardware modules, an optimisation flow takes network architectures and resource constraints as input, and tunes hardware parameters with algorithm-specific information to explore the design space and achieve high performance. The evaluation shows that our design model accuracy is above 85% and, with optimised configuration, our design can achieve 49.6 times speed-up compared with software implementation.
Cardoso JMP, Coutinho JGF, Carvalho T, et al., 2016, Performance-driven instrumentation and mapping strategies using the LARA aspect-oriented programming approach, SOFTWARE-PRACTICE & EXPERIENCE, Vol: 46, Pages: 251-287, ISSN: 0038-0644
Grigoras P, Burovskiy P, Luk W, 2016, CASK - Open-Source Custom Architectures for Sparse Kernels, ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Publisher: ASSOC COMPUTING MACHINERY, Pages: 179-184
Grigoras P, Burovskiy P, Luk W, et al., 2016, Optimising Sparse Matrix Vector Multiplication for Large Scale FEM problems on FPGA, 26th International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, ISSN: 1946-1488
Hmid SN, Coutinho JGF, Luk W, 2016, A Transfer-Aware Runtime System for Heterogeneous Asynchronous Parallel Execution, ACM SIGARCH Computer Architecture News, Vol: 43, Pages: 40-45, ISSN: 0163-5964
Kurek M, Becker T, Guo C, et al., 2016, Self-aware hardware acceleration of financial applications on a heterogeneous cluster, Natural Computing Series, Pages: 241-260
© Springer International Publishing Switzerland 2016. This chapter describes self-awareness in four financial applications. We apply some of the design patterns of Chapter 5 and techniques of Chapter 7. We describe three applications briefly, highlighting the links to self-awareness and self-expression. The applications are (i) a hybrid genetic programming and particle swarm optimisation approach for high-frequency trading, with fitness function evaluation accelerated by FPGA; (ii) an adaptive point process model for currency trading, accelerated by FPGA hardware; (iii) an adaptive line arbitrator synthesising high-reliability and low-latency feeds from redundant data feeds (A/B feeds) using FPGA hardware. Finally, we describe in more detail a generic optimisation approach for reconfigurable designs automating design optimisation, using reconfigurable hardware to speed up the optimisation process, applied to applications including a quadrature-based financial application. In each application, the hardware-accelerated self-aware approaches give significant benefits: up to 55× speedup for hardware-accelerated design optimisation compared to software hill climbing.
Kurek M, Deisenroth MP, Luk W, et al., 2016, Knowledge Transfer in Automatic Optimisation of Reconfigurable Designs, 24th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE, Pages: 84-87
Li T, Heinis T, Luk W, 2016, Hashing-based approximate DBSCAN, Pages: 31-45, ISSN: 0302-9743
© Springer International Publishing Switzerland 2016. Analyzing massive amounts of data and extracting value from it has become key across different disciplines. As the amounts of data grow rapidly, however, current approaches for data analysis struggle. This is particularly true for clustering algorithms where distance calculations between pairs of points dominate overall time. Crucial to the data analysis and clustering process, however, is that it is rarely straightforward. Instead, parameters need to be determined through several iterations. Entirely accurate results are thus rarely needed and instead we can sacrifice precision of the final result to accelerate the computation. In this paper we develop ADvaNCE, a new approach to approximating DBSCAN. ADvaNCE uses two measures to reduce distance calculation overhead: (1) locality sensitive hashing to approximate and speed up distance calculations and (2) representative point selection to reduce the number of distance calculations. Our experiments show that our approach is in general one order of magnitude faster (at most 30x in our experiments) than the state of the art.
Lindsey B, Leslie M, Luk W, 2016, A Domain Specific Language for Accelerated Multilevel Monte Carlo Simulations, 27th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 99-106, ISSN: 1063-6862
© 2006 IEEE. This article consists of a collection of slides from the author's conference presentation on optimial custom instruction processors. Some of the specific topics discussed include: the special features and system specifications of extensible processors; design flow capabilities; instruction set selection and bandwidth considerations; applications specific processor synthesis; and both current and future areas of processor technology development.
Ma Y, Zhang C, Luk W, 2016, Hybrid two-stage HW/SW partitioning algorithm for dynamic partial reconfigurable FPGAs, Qinghua Daxue Xuebao/Journal of Tsinghua University, Vol: 56, ISSN: 1000-0054
© 2016, Press of Tsinghua University. All right reserved. More and more hardware platforms are providing dynamic partial reconfiguration; thus, traditional hardware/software partitioning algorithms are no longer applicable. Some studies have analyzed the dynamic partial reconfiguration as mixed-integer linear programming (MILP) models to get solutions. However, the MILP models are slow and can only handle small problems. This paper uses heuristic algorithms to determine the status of some critical tasks to reduce the scale of the MILP problem for large problems. Tests show that this method is about 200 times faster with the same solution quality as the traditional mathematical programming method.
Niu X, Ng N, Yuki T, et al., 2016, EURECA Compilation: Automatic Optimisation of Cycle-Reconfigurable Circuits, 26th International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, ISSN: 1946-1488
Niu X, Todman T, Luk W, 2016, Self-adaptive hardware acceleration on a heterogeneous cluster, Natural Computing Series, Pages: 167-192
© Springer International Publishing Switzerland 2016. Building a cluster of computers is a common technique to significantly improve the throughput of computationally intensive applications. Communication networks connect hundreds to thousands of compute nodes to form a cluster system, where a parallelisable application workload is distributed into the compute nodes. Theoretically, heterogeneous clusters with various types of processing units are more efficient than homogeneous clusters, since some types of processing units perform better than others on certain applications. A heterogeneous cluster can achieve better cluster performance by adapting cluster configurations to assign applications to processing elements that fit well with the applications. In this chapter we describe how to build a heterogeneous cluster that can adapt to application requirements. Section 9.1 provides an overview of heterogeneous computing. Section 9.2 presents the commonly used hardware and software architectures of heterogeneous clusters. Section 9.3 discusses the use of self-awareness and self-adaptivity in two runtime scenarios of a heterogeneous cluster, and Section 9.4 presents the experimental results. Finally, Section 9.5 discusses approaches to formally verify the developed applications.
Stroobandt D, Varbanescu AL, Ciobanu CB, et al., 2016, EXTRA: Towards the Exploitation of eXascale Technology for Reconfigurable Architectures, 11th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), Publisher: IEEE
Wang S, Niu X, Ma N, et al., 2016, A scalable dataflow accelerator for real time onboard hyperspectral image classification, Pages: 105-116, ISSN: 0302-9743
© Springer International Publishing Switzerland 2016. Real-time hyperspectral image classification is a necessary primitive in many remotely sensed image analysis applications. Previous work has shown that Support Vector Machines (SVMs) can achieve high classification accuracy, but unfortunately it is very computationally expensive. This paper presents a scalable dataflow accelerator on FPGA for real-time SVM classification of hyperspectral images.To address data dependencies, we adapt multi-class classifier based on Hamming distance. The architecture is scalable to high problem dimensionality and available hardware resources. Implementation results show that the FPGA design achieves speedups of 26x, 1335x, 66x and 14x compared with implementations on ZYNQ, ARM, DSP and Xeon processors. Moreover, one to two orders of magnitude reduction in power consumption is achieved for the AVRIS hyperspectral image datasets.
Yu T, Feng B, Stillwell M, et al., 2016, Relation-Oriented Resource Allocation for Multi-Accelerator Systems, 27th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 243-244, ISSN: 1063-6862
Zhao W, Fu H, Luk W, et al., 2016, F-CNN: An FPGA-based Framework for Training Convolutional Neural Networks, 27th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 107-114, ISSN: 1063-6862
Zhou H, Niu X, Yuan J, et al., 2016, Connect On the Fly: Enhancing and Prototyping of Cycle-Reconfigurable Modules, 26th International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, ISSN: 1946-1488
Arram J, Luk W, Jiang P, 2015, Ramethy: Reconfigurable acceleration of bisulfite sequence alignment, Pages: 250-259
This paper proposes a novel reconfigurable architecture for accelerating DNA sequence alignment. This architecture is applied to bisulfite sequence alignment, a stage in recently developed bioinformatics pipelines for cancer and non-invasive prenatal diagnosis. Alignment is currently the bottleneck in such pipelines, accounting for over 50% of the total analysis time. Our design, Ramethy (Reconfigurable Acceleration of METHYlation data analysis), performs alignment of short reads with up to two mismatches. Ramethy is based on the FM-index, which we optimise to reduce the number of search steps and improve approximate matching performance. We implement Ramethy on a 1U Maxeler MPC-X1000 dataow node consisting of 8 Altera Stratix-V FPGAs. Measured results show a 14.9 times speedup compared to soap2 running with 16 threads on dual Intel Xeon E5-2650 CPUs, and 3.8 times speedup compared to soap3-dp running on an NVIDIA GTX 580 GPU. Upper-bound performance estimates for the MPC-X1000 indicate a maximum speedup of 88.4 times and 22.6 times compared to soap2 and soap3-dp respectively. In addition to runtime, Ramethy consumes over an order of magnitude lower energy while having accuracy identical to soap2 and soap3-dp, making it a strong candidate for integration into bioinformatics pipelines.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.