Imperial College London

ProfessorWayneLuk

Faculty of EngineeringDepartment of Computing

Professor of Computer Engineering
 
 
 
//

Contact

 

+44 (0)20 7594 8313w.luk Website

 
 
//

Location

 

434Huxley BuildingSouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

555 results found

Fu H, He C, Luk W, Li W, Yang Get al., A nanosecond-level hybrid table design for financial market data generators, The 25th IEEE International Symposium on Field-Programmable Custom Computing Machines, Publisher: IEEE

This paper proposes a hybrid sorted table designfor minimizing electronic trading latency, with three maincontributions. First, a hierarchical sorted table with twolevels, a fast cache table in reconfigurable hardware storingmegabytes of data items and a master table in software storinggigabytes of data items. Second, a full set of operations,including insertion, deletion, selection and sorting, for thehybrid table with latency in a few cycles. Third, an on-demand synchronization scheme between the cache table andthe master table. An implementation has been developed thattargets an FPGA-based network card in the environment of theChina Financial Futures Exchange (CFFEX) which sustains 1-10Gb/s bandwidth with latency of 400 to 700 nanoseconds,providing an 80- to 125-fold latency reduction compared to afully optimized CPU-based solution, and a 2.2-fold reductionover an existing FPGA-based solution.

CONFERENCE PAPER

Shao S, Guo L, Guo C, Chau T, Thomas DB, LUk W, Weston Set al., Recursive pipelined genetic propagation for bilevel optimisation, FPL

CONFERENCE PAPER

Burovskiy P, Grigoras P, Sherwin S, Luk Wet al., 2017, Efficient assembly for high-order unstructured FEM meshes (FPL 2015), ACM Transactions on Reconfigurable Technology and Systems, Vol: 10, ISSN: 1936-7406

© 2017 ACM.The Finite Element Method (FEM) is a common numerical technique used for solving Partial Differential Equations on large and unstructured domain geometries. Numerical methods for FEM typically use algorithms and data structures which exhibit an unstructured memory access pattern. This makes acceleration of FEM on Field-Programmable Gate Arrays using an efficient, deeply pipelined architecture particularly challenging. In this work, we focus on implementing and optimising a vector assembly operation which, in the context of FEM, induces the unstructured memory access. We propose a dataflow architecture, graphbased theoretical model, and design flow for optimising the assembly operation for spectral/hp finite element method on reconfigurable accelerators. We evaluate the proposed approach on two benchmark meshes and show that the graph-theoretic method of generating a static data access schedule results in a significant improvement in resource utilisation compared to prior work. This enables supporting larger FEM meshes on FPGA than previously possible.

JOURNAL ARTICLE

Funie AI, Guo L, Niu X, Luk W, Salmon Met al., 2017, Custom framework for run-time trading strategies, Pages: 154-167, ISSN: 0302-9743

© Springer International Publishing AG 2017.A trading strategy is generally optimised for a given market regime. If it takes too long to switch from one trading strategy to another, then a sub-optimal trading strategy may be adopted. This paper proposes the first FPGA-based framework which supports multiple trend-following trading strategies to obtain accurate market characterisation for various financial market regimes. The framework contains a trading strategy kernel library covering a number of well-known trend-following strategies, such as “triple moving average”. Three types of design are targeted: a static reconfiguration trading strategy (SRTS), a full reconfiguration trading strategy (FRTS), and a partial reconfiguration trading strategy (PRTS). Our approach is evaluated using both synthetic and historical market data. Compared to a fully optimised CPU implementation, the SRTS design achieves 11 times speedup, the FRTS design achieves 2 times speedup, while the PRTS design achieves 7 times speedup. The FRTS and PRTS designs also reduce the amount of resources used on chip by 29% and 15% respectively, when compared to the SRTS design.

CONFERENCE PAPER

Gan L, Fu H, Mencer O, Luk W, Yang Get al., 2017, Data Flow Computing in Geoscience Applications, Advances in Computers, Pages: 125-158

© 2017 Elsevier Inc.Geoscience research is one of the major fields that calls for the support of high-performance computers (HPC). With the algorithms of geoscience application becoming more complex, and the ever-increasing demands for better performance and finer resolutions, technical innovations from both algorithmic and architectural perspectives are highly desired. In recent years, data flow computing engines based on reconfigurable computing systems such as FPGAs have been introduced into HPC area, and start to show some inspiringly good results in many important applications. In this chapter, we summarize our initial efforts and experiences of using Maxeler Data Flow Engines as high-performance platforms, and target at eliminating the main bottlenecks and obtaining higher efficiencies for solving geoscience problems. Choosing three computing kernels from two popular geoscience application domains (climate modeling and exploration geophysics), we present a set of customization and optimization techniques based on the reconfigurable hardware platforms. Through building highly efficient computing pipelines that fit well to both the algorithm and the architecture, we manage to achieve better results in both the performance and power efficiency over traditional multi-core and many-core architectures. Our work demonstrates that data flow computing engines are promising candidates to make contributions to the development of geoscience applications.

BOOK CHAPTER

Grigoras P, Burovskiy P, Arram J, Niu X, Cheung K, Xie J, Luk Wet al., 2017, Dfesnippets: An open-source library for dataflow acceleration on FPGAs, Pages: 299-310, ISSN: 0302-9743

© Springer International Publishing AG 2017.Highly-tuned FPGA implementations can achieve significant performance and power efficiency gains over general purpose hardware. However the limited development productivity has prevented mainstream adoption of FPGAs in many areas such as High Performance Computing. High level standard development libraries are increasingly adopted in improving productivity. We propose an approach for performance critical applications including standard library modules, benchmarking facilities and application benchmarks to support a variety of usecases. We implement the proposed approach as an open-source library for a commercially available FPGA system and highlight applications and productivity gains.

CONFERENCE PAPER

Inggs G, Thomas DB, Luk W, 2017, A Domain Specific Approach to High Performance Heterogeneous Computing, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, Vol: 28, Pages: 2-15, ISSN: 1045-9219

JOURNAL ARTICLE

Leong PHW, Amano H, Anderson J, Bertels K, Cardoso JMP, Diessel O, Gogniat G, Hutton M, Lee J, Luk W, Lysaght P, Platzner M, Prasanna VK, Rissa T, Silvano C, So HK-H, Wang Yet al., 2017, The First 25 Years of the FPL Conference: Significant Papers, Publisher: ASSOC COMPUTING MACHINERY, ISSN: 1936-7406

CONFERENCE PAPER

Li T, Heinis T, Luk W, 2017, ADvaNCE - Efficient and Scalable Approximate Density-Based Clustering Based on Hashing, INFORMATICA, Vol: 28, Pages: 105-130, ISSN: 0868-4952

JOURNAL ARTICLE

Zhao R, Niu X, Wu Y, Luk W, Liu Qet al., 2017, Optimizing CNN-based object detection algorithms on embedded FPGA platforms, Pages: 255-267, ISSN: 0302-9743

© Springer International Publishing AG 2017.Algorithms based on Convolutional Neural Network (CNN) have recently been applied to object detection applications, greatly improving their performance. However, many devices intended for these algorithms have limited computation resources and strict power consumption constraints, and are not suitable for algorithms designed for GPU workstations. This paper presents a novel method to optimise CNNbased object detection algorithms targeting embedded FPGA platforms. Given parameterised CNN hardware modules, an optimisation flow takes network architectures and resource constraints as input, and tunes hardware parameters with algorithm-specific information to explore the design space and achieve high performance. The evaluation shows that our design model accuracy is above 85% and, with optimised configuration, our design can achieve 49.6 times speed-up compared with software implementation.

CONFERENCE PAPER

Arram J, Kaplan T, Luk W, Jiang Pet al., 2016, Leveraging FPGAs for Accelerating Short Read Alignment, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Pages: 1-1, ISSN: 1545-5963

JOURNAL ARTICLE

Arram J, Pflanzer M, Kaplan T, Luk Wet al., 2016, FPGA acceleration of reference-based compression for genomic data, Pages: 9-16

© 2015 IEEE.One of the key challenges facing genomics today is efficiently storing the massive amounts of data generated by next-generation sequencing platforms. Reference-based compression is a popular strategy for reducing the size of genomic data, whereby sequence information is encoded as a mapping to a known reference sequence. Determining the mapping is a computationally intensive problem, and is the bottleneck of most reference-based compression tools currently available. This paper presents the first FPGA acceleration of reference-based compression for genomic data. We develop a new mapping algorithm based on the FM-index search operation which includes optimisations targeting the compression ratio and speed. Our hardware design is implemented on a Maxeler MPC-X2000 node comprising 8 Altera Stratix V FPGAs. When evaluated against compression tools currently available, our tool achieves a superior compression ratio, compression time, and energy consumption for both FASTA and FASTQ formats. For example, our tool achieves a 30% higher compression ratio and is 71.9 times faster than the fastqz tool.

CONFERENCE PAPER

Cardoso JMP, Coutinho JGF, Carvalho T, Diniz PC, Petrov Z, Luk W, Gonçalves Fet al., 2016, Performance-driven instrumentation and mapping strategies using the LARA aspect-oriented programming approach, Software - Practice and Experience, Vol: 46, Pages: 251-287, ISSN: 0038-0644

Copyright © 2014 John Wiley & Sons, Ltd.Summary The development of applications for high-performance embedded systems is a long and error-prone process because in addition to the required functionality, developers must consider various and often conflicting nonfunctional requirements such as performance and/or energy efficiency. The complexity of this process is further exacerbated by the multitude of target architectures and mapping tools. This article describes LARA, an aspect-oriented programming language that allows programmers to convey domain-specific knowledge and nonfunctional requirements to a toolchain composed of source-to-source transformers, compiler optimizers, and mapping/synthesis tools. LARA is sufficiently flexible to target different tools and host languages while also allowing the specification of compilation strategies to enable efficient generation of software code and hardware cores (using hardware description languages) for hybrid target architectures - a unique feature to the best of our knowledge not found in any other aspect-oriented programming language. A key feature of LARA is its ability to deal with different models of join points, actions, and attributes. In this article, we describe the LARA approach and evaluate its impact on code instrumentation and analysis and on selecting critical code sections to be migrated to hardware accelerators for two embedded applications from industry.

JOURNAL ARTICLE

Grigoras P, Burovskiy P, Luk W, 2016, CASK - Open-source custom architectures for sparse kernels, Pages: 179-184

© 2016 ACM.Sparse matrix vector multiplication (SpMV) is an impor- tant kernel in many scientific applications. To improve the performance and applicability of FPGA based SpMV, we propose an approach for exploiting properties of the input matrix to generate optimised custom architectures. The ar- chitectures generated by our approach are between 3.8 to 48 times faster than the worst case architectures for each matrix, showing the benefits of instance specific design for SpMV.

CONFERENCE PAPER

Grigoras P, Burovskiy P, Luk W, Sherwin Set al., 2016, Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA

© 2016 EPFL.Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. This problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. The efficiency of the proposed architecture combined with the effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems.

CONFERENCE PAPER

Guo L, Funie AI, Thomas DB, Fu H, Luk Wet al., 2016, Parallel Genetic Algorithms on Multiple FPGAs, ACM SIGARCH Computer Architecture News, Vol: 43, Pages: 86-93, ISSN: 0163-5964

JOURNAL ARTICLE

Hmid SN, Coutinho JGF, Luk W, 2016, A Transfer-Aware Runtime System for Heterogeneous Asynchronous Parallel Execution, ACM SIGARCH Computer Architecture News, Vol: 43, Pages: 40-45, ISSN: 0163-5964

JOURNAL ARTICLE

Hung E, Todman T, Luk W, 2016, Transparent In-Circuit Assertions for FPGAs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Pages: 1-1, ISSN: 0278-0070

JOURNAL ARTICLE

Kurek M, Becker T, Guo C, Denholm S, Funie AI, Salmon M, Todman T, Luk Wet al., 2016, Self-aware hardware acceleration of financial applications on a heterogeneous cluster, Natural Computing Series, Pages: 241-260

© Springer International Publishing Switzerland 2016.This chapter describes self-awareness in four financial applications. We apply some of the design patterns of Chapter 5 and techniques of Chapter 7. We describe three applications briefly, highlighting the links to self-awareness and self-expression. The applications are (i) a hybrid genetic programming and particle swarm optimisation approach for high-frequency trading, with fitness function evaluation accelerated by FPGA; (ii) an adaptive point process model for currency trading, accelerated by FPGA hardware; (iii) an adaptive line arbitrator synthesising high-reliability and low-latency feeds from redundant data feeds (A/B feeds) using FPGA hardware. Finally, we describe in more detail a generic optimisation approach for reconfigurable designs automating design optimisation, using reconfigurable hardware to speed up the optimisation process, applied to applications including a quadrature-based financial application. In each application, the hardware-accelerated self-aware approaches give significant benefits: up to 55× speedup for hardware-accelerated design optimisation compared to software hill climbing.

BOOK CHAPTER

Kurek M, Deisenroth MP, Luk W, Todman Tet al., 2016, Knowledge Transfer in Automatic Optimisation of Reconfigurable Designs, Pages: 84-87

© 2016 IEEE.This paper presents a novel approach for automatic optimisation of reconfigurable design parameters based on knowledge transfer. The key idea is to make use of insights derived from optimising related designs to benefit future optimisations. We show how to use designs targeting one device to speed up optimisation of another device. The proposed approach is evaluated based on various applications including computational finance and seismic imaging. It is capable of achieving up to 35% reduction in optimisation time in producing designs with similar performance, compared to alternative optimisation methods.

CONFERENCE PAPER

Li T, Heinis T, Luk W, 2016, Hashing-based approximate DBSCAN, Pages: 31-45, ISSN: 0302-9743

© Springer International Publishing Switzerland 2016.Analyzing massive amounts of data and extracting value from it has become key across different disciplines. As the amounts of data grow rapidly, however, current approaches for data analysis struggle. This is particularly true for clustering algorithms where distance calculations between pairs of points dominate overall time. Crucial to the data analysis and clustering process, however, is that it is rarely straightforward. Instead, parameters need to be determined through several iterations. Entirely accurate results are thus rarely needed and instead we can sacrifice precision of the final result to accelerate the computation. In this paper we develop ADvaNCE, a new approach to approximating DBSCAN. ADvaNCE uses two measures to reduce distance calculation overhead: (1) locality sensitive hashing to approximate and speed up distance calculations and (2) representative point selection to reduce the number of distance calculations. Our experiments show that our approach is in general one order of magnitude faster (at most 30x in our experiments) than the state of the art.

CONFERENCE PAPER

Lindsey B, Leslie M, Luk W, 2016, A Domain Specific Language for accelerated Multilevel Monte Carlo simulations, Pages: 99-106, ISSN: 1063-6862

© 2016 IEEE.Monte Carlo simulations are used to tackle a wide range of exciting and complex problems, such as option pricing and biophotonic modelling. Since Monte Carlo simulations are both computationally expensive and highly parallelizable, they are ideally suited for acceleration through GPUs and FPGAs. Alongside these accelerators, Multilevel Monte Carlo techniques can be harnessed to further hasten simulations. However, researchers and application developers must invest a great deal of effort to design, optimise and test such Monte Carlo simulations. Furthermore, these models often have to be rewritten from scratch to target new hardware accelerators. This paper presents Neb, a Domain Specific Language for describing and generating Multilevel Monte Carlo simulations for a variety of hardware architectures. Neb compiles equations written in LATEX to C++, OpenCL or Maxeler's MaxJ language, allowing acceleration through GPUs or FPGAs. Neb can be used to solve stochastic equations or to generate paths for analysis with other tools. To evaluate the performance of Neb, a variety of financial models are executed on CPUs, GPUs and FPGAs, demonstrating peak acceleration of 3.7 times with FPGAs in 40nm transistor technology, and 14.4 times with GPUs in 28nm transistor technology. Furthermore, the energy efficiency of these accelerators is compared, revealing FPGAs to be 8.73 times and GPUs 2.52 times more efficient than CPUs.

CONFERENCE PAPER

Luk W, Atasu K, Dimond R, Mencer Oet al., 2016, Towards optimal custom instruction processors

© 2006 IEEE.This article consists of a collection of slides from the author's conference presentation on optimial custom instruction processors. Some of the specific topics discussed include: the special features and system specifications of extensible processors; design flow capabilities; instruction set selection and bandwidth considerations; applications specific processor synthesis; and both current and future areas of processor technology development.

CONFERENCE PAPER

Ma Y, Zhang C, Luk W, 2016, Hybrid two-stage HW/SW partitioning algorithm for dynamic partial reconfigurable FPGAs, Qinghua Daxue Xuebao/Journal of Tsinghua University, Vol: 56, ISSN: 1000-0054

© 2016, Press of Tsinghua University. All right reserved.More and more hardware platforms are providing dynamic partial reconfiguration; thus, traditional hardware/software partitioning algorithms are no longer applicable. Some studies have analyzed the dynamic partial reconfiguration as mixed-integer linear programming (MILP) models to get solutions. However, the MILP models are slow and can only handle small problems. This paper uses heuristic algorithms to determine the status of some critical tasks to reduce the scale of the MILP problem for large problems. Tests show that this method is about 200 times faster with the same solution quality as the traditional mathematical programming method.

JOURNAL ARTICLE

Niu X, Ng N, Yuki T, Wang S, Yoshida N, Luk Wet al., 2016, EURECA Compilation: Automatic Optimisation of Cycle-Reconfigurable Circuits, 26th International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, ISSN: 1946-1488

CONFERENCE PAPER

Niu X, Todman T, Luk W, 2016, Self-adaptive hardware acceleration on a heterogeneous cluster, Natural Computing Series, Pages: 167-192

© Springer International Publishing Switzerland 2016.Building a cluster of computers is a common technique to significantly improve the throughput of computationally intensive applications. Communication networks connect hundreds to thousands of compute nodes to form a cluster system, where a parallelisable application workload is distributed into the compute nodes. Theoretically, heterogeneous clusters with various types of processing units are more efficient than homogeneous clusters, since some types of processing units perform better than others on certain applications. A heterogeneous cluster can achieve better cluster performance by adapting cluster configurations to assign applications to processing elements that fit well with the applications. In this chapter we describe how to build a heterogeneous cluster that can adapt to application requirements. Section 9.1 provides an overview of heterogeneous computing. Section 9.2 presents the commonly used hardware and software architectures of heterogeneous clusters. Section 9.3 discusses the use of self-awareness and self-adaptivity in two runtime scenarios of a heterogeneous cluster, and Section 9.4 presents the experimental results. Finally, Section 9.5 discusses approaches to formally verify the developed applications.

BOOK CHAPTER

Stroobandt D, Varbanescu AL, Ciobanu CB, Al Kadi M, Brokalakis A, Charitopoulos G, Todman T, Niu X, Pnevmatikatos D, Kulkarni A, Vansteenkiste E, Luk W, Santambrogio MD, Sciuto D, Huebner M, Becker T, Gaydadjiev G, Nikitakis A, Thom AJWet al., 2016, EXTRA: Towards the Exploitation of eXascale Technology for Reconfigurable Architectures, 11th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), Publisher: IEEE

CONFERENCE PAPER

Targett JS, Niu X, Russell F, Luk W, Jeffress S, Duben Pet al., 2016, Lower precision for higher accuracy: Precision and resolution exploration for shallow water equations, Pages: 208-211

© 2015 IEEE.Accurate forecasts of future climate with numerical models of atmosphere and ocean are of vital importance. However, forecast quality is often limited by the available computational power. This paper investigates the acceleration of a C-grid shallow water model through the use of reduced precision targeting FPGA technology. Using a double-gyre scenario, we show that the mantissa length of variables can be reduced to 14 bits without affecting the accuracy beyond the error inherent in the model. Our reduced precision FPGA implementation runs 5.4 times faster than a double precision FPGA implementation, and 12 times faster than a multi-Threaded CPU implementation. Moreover, our reduced precision FPGA implementation uses 39 times less energy than the CPU implementation and can compute a 100×100 grid for the same energy that the CPU implementation would take for a 29×29 grid.

CONFERENCE PAPER

Wang S, Niu X, Ma N, Luk W, Leong P, Peng Yet al., 2016, A scalable dataflow accelerator for real time onboard hyperspectral image classification, Pages: 105-116, ISSN: 0302-9743

© Springer International Publishing Switzerland 2016.Real-time hyperspectral image classification is a necessary primitive in many remotely sensed image analysis applications. Previous work has shown that Support Vector Machines (SVMs) can achieve high classification accuracy, but unfortunately it is very computationally expensive. This paper presents a scalable dataflow accelerator on FPGA for real-time SVM classification of hyperspectral images.To address data dependencies, we adapt multi-class classifier based on Hamming distance. The architecture is scalable to high problem dimensionality and available hardware resources. Implementation results show that the FPGA design achieves speedups of 26x, 1335x, 66x and 14x compared with implementations on ZYNQ, ARM, DSP and Xeon processors. Moreover, one to two orders of magnitude reduction in power consumption is achieved for the AVRIS hyperspectral image datasets.

CONFERENCE PAPER

Yu T, Feng B, Stillwell M, Coutinho JGF, Zhao W, Liang S, Luk W, Wolf AL, Ma Yet al., 2016, Relation-Oriented Resource Allocation for Multi-Accelerator Systems, 27th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 243-244, ISSN: 1063-6862

CONFERENCE PAPER

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00154588&limit=30&person=true