Publications

Jin Q, Becker T, Luk W, Thomas Det al., 2012, Optimising explicit finite difference option pricing for dynamic constant reconfiguration, Pages: 165-172

This paper demonstrates a novel optimisation methodology to adjust stencil based numerical procedures from the algorithm level, so as to reduce not only the amount of hardware resource consumption per kernel but also the amount of computation required to achieve desired result accuracy, when mapping the algorithm to reconfigurable hardware using dynamic constant reconfiguration. As a result, less area is needed to support run-time reconfiguration, and less computational steps are required in the numerical procedure to obtain a result with given error tolerance. We analyse one thousand fixed point implementations on a Virtex-6 XC6VLX760 FPGA for randomly generated option pricing problems, which are representative of industrial computation. When comparing optimised implementations to the un-optimised ones, the reconfiguration area upper bound is reduced by 22%; the average number of computational steps is reduced by 23%; and the area-computation-time product is reduced by 40%; while the numerical errors of the results are kept below the error tolerant level used in industry. © 2012 IEEE.

Abstract
Cite
Citations: 5

Conference paper

Niu X, Jin Q, Luk W, Liu Q, Pell Oet al., 2012, Exploiting run-time reconfiguration in stencil computation, Pages: 173-180

Stencil computation is computationally intensive and required by many applications. This paper proposes an approach to exploit run-time reconfigurability of field-programmable accelerators for stencil computation. System throughput is optimized by partitioning, analysing and scheduling tasks in applications to remove idle functions. To evaluate the proposed approach, Reverse Time Migration (RTM), a high performance application, is developed. Our optimized runtime reconfigurable solution, which targets a Virtex-6 FPGA in a Maxeler MAX3424A system, can achieves an improved throughput of 102.8 GFlop/s, up to two orders of magnitude faster than the CPU reference designs, 1.59 times faster than the best published GPU and FPGA results, and 1.45 times faster than an optimized static implementation. © 2012 IEEE.

Abstract
Cite
Citations: 20

Conference paper

Chau TCP, Luk W, Cheung PYK, Eele A, Maciejowski Jet al., 2012, Adaptive sequential Monte Carlo approach for real-time applications, Pages: 527-530

This paper presents an adaptive Sequential Monte Carlo approach for real-time applications. Sequential Monte Carlo method is employed to estimate the states of dynamic systems using weighted particles. The proposed approach reduces the run-time computation complexity by adapting the size of the particle set. Multiple processing elements on FPGAs are dynamically allocated for improved energy efficiency without violating real-time constraints. A robot localisation application is developed based on the proposed approach. Compared to a non-adaptive implementation, the dynamic energy consumption is reduced by up to 70% without affecting the quality of solutions. © 2012 IEEE.

Abstract
Cite
Citations: 6

Conference paper

Le Masle A, Luk W, 2012, Detecting power attacks on reconfigurable hardware, Pages: 14-19

We present a novel framework to detect power attacks on crypto-systems implemented on reconfigurable hardware. We monitor the device supply voltage with a ring oscillator-based on-chip power monitor. In order to detect the insertion of power measurement circuits onto a device's power rail, a power attack detection strategy taking into account abnormal supply voltages and power rail resistance values is developed. Our strategy is integrated into an on-chip attack detector. The entire framework implementation only takes 3300 LUTs of a Spartan-6 LX45 FPGA, which is 12% of the total area available. Our results on an AES and RSA crypto-system show that our attack detection framework can reach false-positive and false-negative rates as low as 0% over all our test cases if proper operating margins are set. © 2012 IEEE.

Abstract
Cite
Citations: 19

Conference paper

Cardoso JMP, Carvalho T, Coutinho JGF, Diniz PC, Petrov Z, Luk Wet al., 2012, Controlling hardware synthesis with aspects, Pages: 226-233

The synthesis and mapping of applications to configurable embedded systems is a notoriously hard process. Tools have a wide range of parameters, which interact in very unpredictable ways, thus creating a large and complex design space. When exploring this space, designers must understand the interfaces to the various tools and apply, often manually, a sequence of tool-specific transformations making this an extremely cumbersome and error-prone process. This paper describes the use of aspect-oriented techniques for capturing synthesis strategies for tuning the performance of applications' kernels. We illustrate the use of this approach when designing application-specific architectures generated by a high-level synthesis tool. The results highlight the impact of the various strategies when targeting custom hardware and expose the difficulties in devising these strategies. © 2012 IEEE.

Abstract
Cite
Citations: 2

Conference paper

Pnevmatikatos D, Becker T, Brokalakis A, Bruneel K, Gaydadjiev G, Luk W, Papadimitriou K, Papaefstathiou I, Pell O, Pilato C, Robart M, Santambrogio MD, Sciuto D, Stroobandt D, Todman Tet al., 2012, FASTER: Facilitating analysis and synthesis technologies for effective reconfiguration, Pages: 234-241

The FASTER project aims to ease the definition, implementation and use of dynamically changing hardware systems. Our motivation stems from the promise reconfigurable systems hold for achieving better performance and extending product functionality and lifetime via the addition of new features that work at hardware speed. This is a clear advantage over the more straightforward software component adaptivity. However, designing a changing hardware system is both challenging and time consuming. The FASTER project will facilitate the use of reconfigurable technology by providing a complete methodology that enables designers to easily specify, analyse, implement and verify applications on platforms with general-purpose processors and acceleration modules implemented in the latest reconfigurable technology. To better adapt to different application requirements, the tool-chain will support both region-based and micro-reconfiguration and provide a flexible run-time system that will efficiently manage the reconfigurable resources. We will use applications from the embedded, high performance computing, and desktop domains to demonstrate the potential benefits of the FASTER tools on metrics such as performance, power consumption and total ownership cost. © 2012 IEEE.

Abstract
Cite
Citations: 5

Conference paper

Liu Q, Todman T, Luk W, Constantinides GAet al., 2012, Optimizing Hardware Design by Composing Utility-Directed Transformations, IEEE TRANSACTIONS ON COMPUTERS, Vol: 61, Pages: 1800-1812, ISSN: 0018-9340

Author Web Link
Cite
Citations: 3

Journal article

Spacey S, Luk W, Kelly PHJ, Kuhn Det al., 2012, Improving communication latency with the write-only architecture, JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, Vol: 72, Pages: 1617-1627, ISSN: 0743-7315

Author Web Link
Cite
Citations: 4

Journal article

Cheung K, Schultz SR, Luk W, 2012, A large-scale spiking neural network accelerator for FPGA systems, Pages: 113-120, ISSN: 0302-9743

Spiking neural networks (SNN) aim to mimic membrane potential dynamics of biological neurons. They have been used widely in neuromorphic applications and neuroscience modeling studies. We design a parallel SNN accelerator for producing large-scale cortical simulation targeting an off-the-shelf Field-Programmable Gate Array (FPGA)-based system. The accelerator parallelizes synaptic processing with run time proportional to the firing rate of the network. Using only one FPGA, this accelerator is estimated to support simulation of 64K neurons 2.5 times real-time, and achieves a spike delivery rate which is at least 1.4 times faster than a recent GPU accelerator with a benchmark toroidal network. © 2012 Springer-Verlag.

Abstract
Cite
Citations: 46

Conference paper

Yu C, Smith AM, Luk W, Leong PHW, Wilton SJEet al., 2012, Optimizing Floating Point Units in Hybrid FPGAs, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, Vol: 20, Pages: 1295-1303, ISSN: 1063-8210

Author Web Link
Cite
Citations: 4

Journal article

Cardoso JMP, Carvalho T, Coutinho JGF, Luk W, Nobre R, Diniz PC, Petrov Zet al., 2012, LARA: An aspect-oriented programming language for embedded systems, Pages: 179-190

The development of applications for high-performance embedded systems is typically a long and error-prone process. In addition to the required functions, developers must consider various and often conflicting non-functional application requirements such as performance and energy efficiency. The complexity of this process is exacerbated by the multitude of target architectures and the associated retargetable mapping tools. This paper introduces an Aspect-Oriented Programming (AOP) approach that conveys domain knowledge and non-functional requirements to optimizers and mapping tools. We describe a novel AOP language, LARA, which allows the specification of compilation strategies to enable efficient generation of software code and hardware cores for alternative target architectures. We illustrate the use of LARA for code instrumentation and analysis, and for guiding the application of compiler and hardware synthesis optimizations. An important LARA feature is its capability to deal with different join points, action models, and attributes, and to generate an aspect intermediate representation. We present examples of our aspect-oriented hardware/software design flow for mapping real-life application codes to embedded platforms based on Field Programmable Gate Array (FPGA) technology. © 2012 ACM.

Abstract
Cite
Citations: 77

Conference paper

Coutinho JGF, Carvalho T, Durand S, Cardoso JMP, Nobre R, Diniz PC, Luk Wet al., 2012, Experiments with the LARA aspect-oriented approach, Pages: 27-30

This demonstration presents a novel design-flow and aspect-oriented language called LARA [1], which is currently used to guide the mapping of high-level C application codes to heterogeneous high-performance embedded systems. In particular, LARA is capable of capturing complex strategies and schemes involving: hardware/software partitioning, code specialization, source code transformations and code instrumentation. A key element of LARA, and a distinguishing feature from existing approaches, is its ability to support the specification of non-functional requirements and user knowledge in a non-invasive way in the exploration of suitable implementations. The design-flow incorporates several tools, such as a LARA frontend, a hardware/software partitioning tool, an aspect weaver, cost estimator, and a source-level transformation engine. All these components can be coordinated as part of an elaborate application mapping strategy using LARA. In this demonstration, we illustrate how non-functional cross-cutting concerns such as runtime monitorization and performance are codified and described in LARA and how the weaving process affects selected applications. Furthermore, we also explain how third-party tools, such as gprof, can be incorporated into the design-flow and aspect description, for instance, to affect the hardware/software partitioning process. We demonstrate how LARA can be used to extract run-time information, such as range values of variables, and can control code transformations and compiler optimizations addressing customized implementations of the corresponding computations on FPGAs. © 2012 ACM.

Abstract
Cite
Citations: 4

Conference paper

Tse AHT, Thomas D, Luk W, 2012, Design Exploration of Quadrature Methods in Option Pricing, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, Vol: 20, Pages: 818-826, ISSN: 1063-8210

Cite

Journal article

Thomas DB, Luk W, 2012, The LUT-SR Family of Uniform Random Number Generators for FPGA Architectures, IEEE Transactions on Very Large Scale Integration Systems, Vol: PP

Field-programmable gate array (FPGA) optimized random number generators (RNGs) are more resource-efficient than software-optimized RNGs because they can take advantage of bitwise operations and FPGA-specific features. However, it is difficult to concisely describe FPGA-optimized RNGs, so they are not commonly used in real-world designs. This paper describes a type of FPGA RNG called a LUT-SR RNG, which takes advantage of bitwise XOR operations and the ability to turn lookup tables (LUTs) into shift registers of varying lengths. This provides a good resource-quality balance compared to previous FPGA-optimized generators, between the previous high-resource high-period LUT-FIFO RNGs and low-resource low-quality LUT-OPT RNGs, with quality comparable to the best software generators. The LUT-SR generators can also be expressed using a simple C++ algorithm contained within this paper, allowing 60 fully-specified LUT-SR RNGs with different characteristics to be embedded in this paper, backed up by an online set of very high speed integrated circuit hardware description language (VHDL) generators and test benches.

Journal article

Jin Q, Dong D, Tse AHT, Chow GCT, Thomas DB, Luk W, Weston Set al., 2012, Multi-level customisation framework for curve based Monte Carlo financial simulations, Pages: 187-201, ISSN: 0302-9743

One of the main challenges when accelerating financial applications using reconfigurable hardware is the management of design complexity. This paper proposes a multi-level customisation framework for automatic generation of complex yet highly efficient curve based financial Monte Carlo simulators on reconfigurable hardware. By identifying multiple levels of functional specialisations and the optimal data format for the Monte Carlo simulation, we allow different levels of programmability in our framework to retain good performance and support multiple applications. Designs targeting a Virtex-6 SX475T FPGA generated by our framework are about 40 times faster than single-core software implementations on an i7-870 quad-core CPU at 2.93 GHz; they are over 10 times faster and 20 times more energy efficient than 4-core implementations on the same i7-870 quad-core CPU, and are over three times more energy efficient and 36% faster than a highly optimised implementation on an NVIDIA Tesla C2070 GPU at 1.15 GHz. In addition, our framework is platform independent and can be extended to support CPU and GPU applications. © 2012 Springer-Verlag.

Abstract
Cite
Citations: 8

Conference paper

Tse AHT, Chow GCT, Jin Q, Thomas DB, Luk Wet al., 2012, Optimising performance of quadrature methods with reduced precision, Pages: 251-263, ISSN: 0302-9743

This paper presents a generic precision optimisation methodology for quadrature computation targeting reconfigurable hardware to maximise performance at a given error tolerance level. The proposed methodology optimises performance by considering integration grid density versus mantissa size of floating-point operators. The optimisation provides the number of integration points and mantissa size with maximised throughput while meeting given error tolerance requirement. Three case studies show that the proposed reduced precision designs on a Virtex-6 SX475T FPGA are up to 6 times faster than comparable FPGA designs with double precision arithmetic. They are up to 15.1 times faster and 234.9 times more energy efficient than an i7-870 quad-core CPU, and are 1.2 times faster and 42.2 times more energy efficient than a Tesla C2070 GPU. © 2012 Springer-Verlag.

Abstract
Cite
Citations: 8

Conference paper

Liu Q, Luk W, 2012, Heterogeneous systems for energy efficient scientific computing, Pages: 64-75, ISSN: 0302-9743

This paper introduces a novel approach for exploring heterogeneous computing engines which include GPUs and FPGAs as accelerators. Our goal is to systematically automate finding solutions for such engines that maximize energy efficiency while meeting requirements in throughput and in resource constraints. The proposed approach, based on a linear programming model, enables optimization of system throughput and energy efficiency, and analysis of energy efficiency sensitivity and power consumption issues. It can be used in evaluating current and future computing hardware and interfaces to identify appropriate combinations. A heterogeneous system containing a CPU, a GPU and an FPGA with a PCI Express interface is studied based on the High Performance Linpack application. Results indicate that such a heterogeneous computing system is able to provide energy-efficient solutions to scientific computing with various performance demands. The improvement of system energy efficiency is more sensitive to some of the system components, for example in the studied system concurrently improving the energy efficiency of the interface and the GPU by 10 times could lead to over 10 times improvement of the system energy efficiency. © 2012 Springer-Verlag.

Abstract
Cite
Citations: 17

Conference paper

Liu Q, Todman T, Luk W, Constantinides GAet al., 2012, Automated Mapping of the MapReduce Pattern onto Parallel Computing Platforms, JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, Vol: 67, Pages: 65-78, ISSN: 1939-8018

Author Web Link
Cite
Citations: 1

Journal article

Denholm S, Tsoi KH, Pietzuch P, Luk Wet al., 2012, Efficient communication for FPGA clusters, 8th International Symposium, ARC 2012, Publisher: Springer Berlin Heidelberg, Pages: 335-341, ISSN: 0302-9743

Efficient communication between nodes is critical for achieving high performance in a computer cluster. Based on a dedicated inter-accelerator network, we enhance this communication with advanced networking functions, such as broadcasting and priority routing. This work enables decoupling user applications from physical network implementations, improving overall communication efficiency and modularity. A performance model is introduced taking into account application and platform specific parameters. Experiments are performed for various network configurations and application patterns. The results show up to a 55% reduction of communication time when employing our approach.

Conference paper

Ng N, Yoshida N, Niu XY, Tsoi KH, Luk Wet al., 2012, Session types: towards safe and fast reconfigurable programming, ACM SIGARCH Computer Architecture News, Vol: 40, Pages: 22-22, ISSN: 0163-5964

This paper introduces a new programming framework based on the theory of session types for safe, recongurable parallel designs.We apply the session type theory to C and Java programming languages and demonstrate that the sessionbased languages can offer a clear and tractable framework to describe communications between parallel components and guarantee communication-safety and deadlock-freedom by compile-time type checking.Many representative communication topologies such as a ring or scatter-gather can be programmed and verified in session-based programming languages. Case studies involving N-body simulation and K-means clustering are used to illustrate the session-based programming style and to demonstrate that the session-based languages perform competitively against MPI counterparts in an FPGA-based heterogeneous cluster, as well as the potential of integrating them with FPGA acceleration.

Journal article

Yiu KFC, Lu Y, Ho CH, Luk W, Huo J, Nordholm Set al., 2012, Reconfigurable FPGA-based switching path frequency-domain echo canceller with applications to voice control device, DIGITAL SIGNAL PROCESSING, Vol: 22, Pages: 376-390, ISSN: 1051-2004

Author Web Link
Cite
Citations: 3

Journal article

Atasu K, Luk W, Mencer O, Ozturan C, Dundar Get al., 2012, FISH: Fast Instruction SyntHesis for Custom Processors, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, Vol: 20, Pages: 52-65, ISSN: 1063-8210

Author Web Link
Cite
Citations: 20

Journal article

Sato Y, Inoguchi Y, Luk W, Nakamura Tet al., 2012, Evaluating Reconfigurable Dataflow Computing Using the Himeno Benchmark, International Conference on Reconfigurable Computing and FPGAs (ReConFig), Publisher: IEEE, ISSN: 2325-6532

Author Web Link
Cite
Citations: 2

Conference paper

Niu X, Tsoi KH, Luk W, 2012, Self-Adaptive Heterogeneous Cluster with Wireless Network, 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) / Workshop on High Performance Data Intensive Computing, Publisher: IEEE, Pages: 306-311, ISSN: 2164-7062

Author Web Link
Cite
Citations: 1

Conference paper

Guo C, Fu H, Luk W, 2012, A Fully-Pipelined Expectation-Maximization Engine for Gaussian Mixture Models, 11th International Conference on Field-Programmable Technology (FPT), Publisher: IEEE, Pages: 182-189

Author Web Link
Cite
Citations: 20

Conference paper

Chow GCT, Luk W, Leong PHW, 2012, A Mixed Precision Methodology for Mathematical Optimisation, 20th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE, Pages: 33-36

Author Web Link
Cite
Citations: 1

Conference paper

Todman T, Luk W, 2012, Reconfigurable design automation by high-level exploration, 23rd IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 185-188, ISSN: 2160-0511

Author Web Link
Cite
Citations: 2

Conference paper

Kurek M, Luk W, 2012, Parametric Reconfigurable Designs with Machine Learning Optimizer, 11th International Conference on Field-Programmable Technology (FPT), Publisher: IEEE, Pages: 109-112

Author Web Link
Cite
Citations: 2

Conference paper

Papadimitriou K, Pilato C, Pnevmatikatos D, Santambrogio MD, Ciobanu C, Todman T, Becker T, Davidson T, Niu X, Gaydadjiev G, Luk W, Stroobandt Det al., 2012, Novel Design Methods and a Tool Flow for Unleashing Dynamic Reconfiguration, 15th IEEE International Conference on Computational Science and Engineering (CSE) / 10th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC), Publisher: IEEE, Pages: 391-398, ISSN: 1949-0828

Author Web Link
Cite
Citations: 1

Conference paper

Shan Y, Wang Z, Wang W, Hao Y, Wang Y, Tsoi K, Luk W, Yang Het al., 2012, FPGA based Memory Efficient High Resolution Stereo Vision System for Video Tolling, 11th International Conference on Field-Programmable Technology (FPT), Publisher: IEEE, Pages: 29-32

Author Web Link
Cite
Citations: 17

Conference paper

ProfessorWayneLuk

Contact

Location

Summary