Publications

Shao S, Guo L, Guo C, Chau TCP, Thomas DB, Luk W, Weston Set al., 2015, Recursive Pipelined Genetic Propagation for Bilevel Optimisation, 25th International Conference on Field Programmable Logic and Applications, Publisher: IEEE, ISSN: 1946-1488

Conference paper

Leong PHW, Amano H, Anderson J, Bertels K, Cardoso JMP, Diessel O, Gogniat G, Hutton M, Lee J, Luk W, Lysaght P, Platzner M, Prasanna VK, Rissa T, Silvano C, So H, Wang Yet al., 2015, Significant Papers from the First 25 Years of the FPL Conference, 25th International Conference on Field Programmable Logic and Applications, Publisher: IEEE, ISSN: 1946-1488

Author Web Link
Cite
Citations: 1

Conference paper

Rabozzi M, Cattaneo R, Becker T, Luk W, Santambrogio MDet al., 2015, Relocation-aware Floorplanning for Partially-Reconfigurable FPGA-based Systems, 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Publisher: IEEE, Pages: 97-104

Author Web Link
Cite
Citations: 2

Conference paper

Guo L, Funie AI, Xie Z, Thomas D, Luk Wet al., 2015, A general-purpose framework for FPGA-accelerated genetic algorithms, INTERNATIONAL JOURNAL OF BIO-INSPIRED COMPUTATION, Vol: 7, Pages: 361-375, ISSN: 1758-0366

Author Web Link
Cite
Citations: 3

Journal article

Ciobanu CB, Varbanescu AL, Pnevmatikatos D, Charitopoulos G, Niu X, Luk W, Santambrogio MD, Sciuto D, Al Kadi M, Huebner M, Becker T, Gaydadjiev G, Brokalakis A, Nikitakis A, Thom AJW, Vansteenkiste E, Stroobandt Det al., 2015, EXTRA: Towards an Efficient Open Platform for Reconfigurable High Performance Computing, 18th IEEE International Conference on Computational Science and Engineering (CSE), Publisher: IEEE, Pages: 339-342

Author Web Link
Cite
Citations: 4

Conference paper

Zhang C, Ma Y, Luk W, 2015, HW/SW Partitioning Algorithm Targeting MPSOC With Dynamic Partial Reconfigurable Fabric, 14th International Conference on Computer Aided design and Computer Graphics, Publisher: IEEE, Pages: 240-241

Author Web Link
Cite
Citations: 4

Conference paper

Inggs G, Thomas DB, Luk W, 2015, An Efficient, Automatic Approach to High Performance Heterogeneous Computing., CoRR, Vol: abs/1505.04417

Cite

Journal article

Burovskiy P, Grigoras P, Sherwin S, Luk Wet al., 2015, Efficient Assembly for High Order Unstructured FEM Meshes, 25th International Conference on Field Programmable Logic and Applications, Publisher: IEEE, ISSN: 1946-1488

Author Web Link
Cite
Citations: 1

Conference paper

Grigoras P, Burovskiy P, Hung E, Luk Wet al., 2015, Accelerating SpMV on FPGAs by Compressing Nonzero Values, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE, Pages: 64-67, ISSN: 2576-2613

Author Web Link
Cite
Citations: 23

Conference paper

Todman T, Stilkerich S, Luk W, 2015, In-circuit temporal monitors for runtime verification of reconfigurable designs, 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), Publisher: IEEE COMPUTER SOC, ISSN: 0738-100X

Author Web Link
Cite
Citations: 3

Conference paper

Lee K-H, Guo Z, Chow GCT, Chen Y, Luk W, Kwok K-Wet al., 2015, GPU-based Proximity Query Processing on Unstructured Triangular Mesh Model, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE COMPUTER SOC, Pages: 4405-4411, ISSN: 1050-4729

Author Web Link
Cite
Citations: 3

Conference paper

Chau TCP, Niu X, Eele A, Maciejowski J, Cheung PYK, Luk Wet al., 2014, Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable Systems, ACM Transactions on Reconfigurable Technology and Systems, Vol: 7, ISSN: 1936-7414

This article presents an approach for mapping real-time applications based on particle filters (PFs) toheterogeneous reconfigurable systems, which typically consist of multiple FPGAs and CPUs. A method isproposed to adapt the number of particles dynamically and to utilise runtime reconfigurability of FPGAs forreduced power and energy consumption. A data compression scheme is employed to reduce communicationoverhead between FPGAs and CPUs. A mobile robot localisation and tracking application is developed toillustrate our approach. Experimental results show that the proposed adaptive PF can reduce up to 99% ofcomputation time. Using runtime reconfiguration, we achieve a 25% to 34% reduction in idle power. A 1Usystem with four FPGAs is up to 169 times faster than a single-core CPU and 41 times faster than a 1UCPU server with 12 cores. It is also estimated to be 3 times faster than a system with four GPUs.

Journal article

Zhao W, Fu H, Yang G, Luk Wet al., 2014, Patra: Parallel tree-reweighted message passing architecture

Maximum a posteriori probability inference algorithms for Markov Random Field are widely used in many applications, such as computer vision and machine learning. Sequential tree-reweighted message passing (TRW-S) is an inference algorithm which shows good quality in finding optimal solutions. However, the performance of TRW-S in software cannot meet the requirements of many real-time applications, due to the sequential scheme and the high memory, bandwidth and computational costs. This paper proposes Patra, a novel parallel tree-reweighted message passing architecture, which involves a fully pipelined design targeting FPGA technology. We build a hybrid CPU/FPGA system to test the performance of Patra for stereo matching. Experimental results show that Patra provides about 100 times faster than a software implementation of TRW-S, and 12 times faster than a GPU-based message passing algorithm. Compared with an existing design in four FPGAs, we can achieve 2 times speedup in a single FPGA. Moreover, Patra can work at video rate in many cases, such as a rate of 167 frame/sec for a standard stereo matching test case, which makes it promising for many real-time applications.

Abstract
Cite
Citations: 3

Conference paper

Gan L, Fu H, Yang C, Luk W, Xue W, Mencer O, Huang X, Yang Get al., 2014, A highly-efficient and green data flow engine for solving Euler atmospheric equations

Atmospheric modeling is an essential issue in the study of climate change. However, due to the complicated algorithmic and communication models, scientists and researchers are facing tough challenges in finding efficient solutions to solve the atmospheric equations. In this paper, we accelerate a solver for the three-dimensional Euler atmospheric equations through reconfigurable data flow engines. We first propose a hybrid design that achieves efficient resource allocation and data reuse. Furthermore, through algorithmic offsetting, fast memory table, and customizable-precision arithmetic, we map a complex Euler kernel into a single FPGA chip, which can perform 956 floating point operations per cycle. In a 1U-chassis, our CPU-DFE unit with 8 FPGA chips is 18.5 times faster and 8.3 times more power efficient than a multicore system based on two 12-core Intel E5-2697 (Ivy Bridge) CPUs, and is 6.2 times faster and 5.2 times more power efficient than a hybrid unit equipped with two 12-core Intel E5-2697 (Ivy Bridge) CPUs and three Intel Xeon Phi 5120d (MIC) cards.

Abstract
Cite
Citations: 20

Conference paper

Guo C, Luk W, 2014, Pipelined HAC Estimation Engines for Multivariate Time Series, JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, Vol: 77, Pages: 117-129, ISSN: 1939-8018

Journal article

Burovskiy PA, Girdlestone S, Davies C, Sherwin S, Luk Wet al., 2014, Dataflow acceleration of Krylov subspace sparse banded problems, 24th International Conference on Field Programmable Logic and Applications (FPL), 2014, Publisher: IEEE

Most of the efforts in the FPGA community related to sparse linear algebra focus on increasing the degree of internal parallelism in matrix-vector multiply kernels. We propose a parametrisable dataflow architecture presenting an alternative and complementary approach to support acceleration of banded sparse linear algebra problems which benefit from building a Krylov subspace. We use banded structure of a matrix A to overlap the computations Ax, A2x,..., Akx by building a pipeline of matrix-vector multiplication processing elements (PEs) each performing Aix. Due to on-chip data locality, FLOPS rate sustainable by such pipeline scales linearly with k. Our approach enables trade-off between the number k of overlapped matrix power actions and the level of parallelism in a PE. We illustrate our approach for Google PageRank computation by power iteration for large banded single precision sparse matrices. Our design scales up to 32 sequential PEs with floating point accumulation and 80 PEs with fixed point accumulation on Stratix V D8 FPGA. With 80 single-pipe fixed point PEs clocked at 160Mhz, our design sustains 12.7 GFLOPS.

Conference paper

Yang J, Lin B, Luk W, Nahar Tet al., 2014, Particle filtering-based maximum likelihood estimation for financial parameter estimation, 24th International Conference on Field Programmable Logic and Applications, (FPL) 2014, Publisher: IEEE

This paper presents a novel method for estimating parameters of financial models with jump diffusions. It is a Particle Filter based Maximum Likelihood Estimation process, which uses particle streams to enable efficient evaluation of constraints and weights. We also provide a CPU-FPGA collaborative design for parameter estimation of Stochastic Volatility with Correlated and Contemporaneous Jumps model as a case study. The result is evaluated by comparing with a CPU and a cloud computing platform. We show 14 times speed up for the FPGA design compared with the CPU, and similar speedup but better convergence compared with an alternative parallelisation scheme using Techila Middleware on a multi-CPU environment.

Conference paper

Hung E, Todman T, Luk W, 2014, Transparent Insertion of Latency-Oblivious Logic onto FPGAs, 24th International Conference on Field Programmable Logic and Applications (FPL), 2014, Publisher: IEEE

We present an approach for inserting latency-oblivious functionality into pre-existing FPGA circuits transparently. To ensure transparency — that such modifications do not affect the design’s maximum clock frequency — we insert any additional logic post place-and-route, using only the spare resources that were not consumed by the pre-existing circuit. The typical challenge with adding new functionality into existing circuits incrementally is that spare FPGA resources to host this functionality must be located close to the input signals that it requires, in order to minimise the impact of routing delays. In congested designs, however, such co-location is often not possible. We overcome this challenge by using flow techniques to pipeline and route signals from where they originate, potentially in a region of high resource congestion, into a region of low congestion capable of hosting new circuitry, at the expense of latency. We demonstrate and evaluate our approach by augmenting realistic designs with self-monitoring circuitry, which is not sensitive to latency. We report results on circuits operating over 200MHz and show that our insertions have no impact on timing, are 2–4 times faster than compile-time insertion, and incur only a small power overhead.

Conference paper

Liu Q, Mak T, Zhang T, Niu X, Luk W, Yakovlev AVet al., 2014, Power-Adaptive Computing System Design for Solar-Energy-Powered Embedded Systems, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol: 23, Pages: 1402-1414, ISSN: 1557-9999

Through energy harvesting system, new energy sources are made available immediately for many advanced applications based on environmentally embedded systems. However, the harvested power, such as the solar energy, varies significantly under different ambient conditions, which in turn affects the energy conversion efficiency. In this paper, we propose an approach for designing power-adaptive computing systems to maximize the energy utilization under variable solar power supply. Using the geometric programming technique, the proposed approach can generate a customized parallel computing structure effectively. Then, based on the prediction of the solar energy in the future time slots by a multilayer perceptron neural network, a convex model-based adaptation strategy is used to modulate the power behavior of the real-time computing system. The developed power-adaptive computing system is implemented on the hardware and evaluated by a solar harvesting system simulation framework for five applications. The results show that the developed power-adaptive systems can track the variable power supply better. The harvested solar energy utilization efficiency is 2.46 times better than the conventional static designs and the rule-based adaptation approaches. Taken together, the present thorough design approach for self-powered embedded computing systems has a better utilization of ambient energy sources.

Journal article

Denholm S, Inoue H, Takenaka T, Becker T, Luk Wet al., 2014, Low latency FPGA acceleration of market data feed arbitration, IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 36-40, ISSN: 2160-0511

A critical source of information in automated trading is provided by market data feeds from financial exchanges. Two identical feeds, known as the A and B feeds, are used in reducing message loss. This paper presents a reconfigurable acceleration approach to A/B arbitration, operating at the network level, and supporting any messaging protocol. The key challenges are: providing efficient, low latency operations; supporting any market data protocol; and meeting the requirements of downstream applications. To facilitate a range of downstream applications, one windowing mode prioritising low latency, and three dynamically configurable windowing methods prioritising high reliability are provided. We implement a new low latency, high throughput architecture and compare the performance of the NASDAQ TotalView-ITCH, OPRA and ARCA market data feed protocols using a Xilinx Virtex-6 FPGA. The most resource intensive protocol, TotalView-ITCH, is also implemented in a Xilinx Virtex-5 FPGA within a network interface card. We offer latencies 10 times lower than an FPGA-based commercial design and 4.1 times lower than the hardware-accelerated IBM PowerEN processor, with throughputs more than double the required 10Gbps line rate.

Conference paper

Shan Y, Hao Y, Wang W, Wang Y, Chen X, Yang H, Luk Wet al., 2014, Hardware Acceleration for an Accurate Stereo Vision System Using Mini-Census Adaptive Support Region, ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, Vol: 13, ISSN: 1539-9087

Author Web Link
Cite
Citations: 27

Journal article

Chau TCP, Targett JS, Wijeyasinghe M, Luk W, Cheung PYK, Cope B, Eele A, Maciejowski Jet al., 2014, Accelerating Sequential Monte Carlo Method for Real-time Air Traffic Management, Publisher: ACM, Pages: 35-40, ISSN: 0163-5964

Conference paper

Niu X, Jin Q, Luk W, Weston Set al., 2014, A Self-Aware Tuning and Self-Aware Evaluation Method for Finite-Difference Applications in Reconfigurable Systems, ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, Vol: 7, ISSN: 1936-7406

Author Web Link
Cite
Citations: 7

Journal article

Le Masle A, Luk W, 2014, Mapping Loop Structures onto Parametrized Hardware Pipelines, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, Vol: 22, Pages: 631-640, ISSN: 1063-8210

Author Web Link
Cite
Citations: 1

Journal article

Denholm S, Inouet H, Takenaka T, Luk Wet al., 2014, Application-specific customisation of market data feed arbitration, 12th International Conference on Field-Programmable Technology (FPT), Publisher: IEEE, Pages: 322-325

Messages are transmitted from financial exchanges to update their members about changes in the market. As UDP packets are used for message transmission, members subscribe to two identical message feeds from the exchange to lower the risk of message loss or delay. As financial trades can be time sensitive, low latency arbitration between these market data feeds is of particular importance. Members must either provide generic arbitration for all of their financial applications, increasing latency, or arbitrate within each application which wastes resources and scales poorly. We present a reconfigurable accelerated approach for market feed arbitration operating at the network level. Multiple arbitrators can operate within a single FPGA to output customised feeds to downstream financial applications. Application-specific customisations are supported by each core, allowing different market feed messaging protocols, windowing operations and message buffering parameters. We model multiple-core arbitration and explore the scalability and performance improvements within and between cores. We demonstrate our design within a Xilinx Virtex-6 FPGA using the NASDAQ TotalView-ITCH 4.1 messaging standard. Our implementation operates at 16Gbps throughput, and with resource sharing, supports 12 independent cores, 33% more than simple core replication. A 56ns (7 clock cycles) windowing latency is achieved, 2.6 times lower than a hardware-accelerated CPU approach.

Conference paper

Coutinho JGF, Pell O, O'Neill E, Sanders P, McGlone J, Grigoras P, Luk W, Ragusa Cet al., 2014, HARNESS project: Managing heterogeneous computing resources for a cloud platform, Pages: 324-329, ISSN: 0302-9743

Most cloud service offerings are based on homogeneous commodity resources, such as large numbers of inexpensive machines interconnected by off-the-shelf networking equipment and disk drives, to provide low-cost application hosting. However, cloud service providers have reached a limit in satisfying performance and cost requirements for important classes of applications, such as geo-exploration and real-time business analytics. The HARNESS project aims to fill this gap by developing architectural principles that enable the next generation cloud platforms to incorporate heterogeneous technologies such as reconfigurable Dataflow Engines (DFEs), programmable routers, and SSDs, and provide as a result vastly increased performance, reduced energy consumption, and lower cost profiles. In this paper we focus on three challenges for supporting heterogeneous computing resources in the context of a cloud platform, namely: (1) cross-optimisation of heterogeneous computing resources, (2) resource virtualisation and (3) programming heterogeneous platforms. © 2014 Springer International Publishing Switzerland.

Abstract
Cite

Conference paper

Pnevmatikatos DN, Becker T, Brokalakis A, Gaydadjiev GN, Luk W, Papadimitriou K, Papaefstathiou I, Pau D, Pell O, Pilato C, Santambrogio MD, Sciuto D, Stroobandt Det al., 2014, Effective reconfigurable design: The FASTER approach, Pages: 318-323, ISSN: 0302-9743

While fine-grain, reconfigurable devices have been available for years, they are mostly used in a fixed functionality, "asic-replacement" manner. To exploit opportunities for flexible and adaptable run-time exploitation of fine grain reconfigurable resources (as implemented currently in dynamic, partial reconfiguration), better tool support is needed. The FASTER project aims to provide a methodology and a tool-chain that will enable designers to efficiently implement a reconfigurable system on a platform combining software and reconfigurable resources. Starting from a high-level application description and a target platform, our tools analyse the application, evaluate reconfiguration options, and implement the designer choices on underlying vendor tools. In addition, FASTER addresses micro-reconfiguration, verification, and the run-time management of system resources. We use industrial applications to demonstrate the effectiveness of the proposed framework and identify new opportunities for reconfigurable technologies. © 2014 Springer International Publishing Switzerland.

Abstract
Cite

Conference paper

Guo C, Luk W, 2014, Accelerating parameter estimation for multivariate self-exciting point processes, Pages: 181-184

Self-exciting point processes are stochastic processes capturing occurrence patterns of random events. They oer powerful tools to describe and predict temporal distributions of random events like stock trading and neurone spiking. A critical calculation in self-exciting point process models is parameter estimation, which ts a model to a data set. This calculation is computationally demanding when the number of data points is large and when the data dimension is high. This paper proposes the rst recongurable computing solution to accelerate this calculation. We derive an acceleration strategy in a mathematical specication by eliminating complex data dependency, by cutting hardware resource requirement, and by parallelising arithmetic operations. In our experimental evaluation, an FPGA-based implementation of the proposed solution is up to 79 times faster than one CPU core, and 13 times faster than the same CPU with eight cores.

Abstract
Cite
Citations: 3

Conference paper

Guo L, Thomas DBJ, Guo C, Luk Wet al., 2014, Automated framework for FPGA-based parallel genetic algorithms, 24th International Conference on Field Programmable Logic and Applications, Publisher: IEEE

Parallel genetic algorithms (pGAs) are a variant of genetic algorithms which can promise substantial gains in both efficiency of execution and quality of results. pGAs have attracted researchers to implement them in FPGAs, but the implementation always needs large human effort. To simplify the implementation process and make the hardware pGA designs accessible to potential non-expert users, this paper proposes a general-purpose framework, which takes in a high-level description of the optimisation target and automatically generates pGA designs for FPGAs. Our pGA system exploits the two levels of parallelism found in GA instances and genetic operations, allowing users to tailor the architecture for resource constraints at compile-time. The framework also enables users to tune a subset of parameters at run-time without time-consuming recompilation. Our pGA design is more flexible than previous ones, and has an average speedup of 26 times compared to the multi-core counterparts over five combinatorial and numerical optimisation problems. When compared with a GPU, it also shows a 6.8 times speedup over a combinatorial application.

Conference paper

Pnevmatikatos DN, Becker T, Brokalakis A, Gaydadjiev GN, Luk W, Papadimitriou K, Papaefstathiou I, Pau D, Pell O, Pilato C, Santambrogio MD, Sciuto D, Stroobandt Det al., 2014, Effective reconfigurable design: The FASTER approach, Pages: 318-323, ISSN: 0302-9743

While fine-grain, reconfigurable devices have been available for years, they are mostly used in a fixed functionality, "asic-replacement" manner. To exploit opportunities for flexible and adaptable run-time exploitation of fine grain reconfigurable resources (as implemented currently in dynamic, partial reconfiguration), better tool support is needed. The FASTER project aims to provide a methodology and a tool-chain that will enable designers to efficiently implement a reconfigurable system on a platform combining software and reconfigurable resources. Starting from a high-level application description and a target platform, our tools analyse the application, evaluate reconfiguration options, and implement the designer choices on underlying vendor tools. In addition, FASTER addresses micro-reconfiguration, verification, and the run-time management of system resources. We use industrial applications to demonstrate the effectiveness of the proposed framework and identify new opportunities for reconfigurable technologies. © 2014 Springer International Publishing Switzerland.

Abstract
Cite

Conference paper

ProfessorWayneLuk

Contact

Location

Summary