Imperial College London

ProfessorWayneLuk

Faculty of EngineeringDepartment of Computing

Professor of Computer Engineering
 
 
 
//

Contact

 

+44 (0)20 7594 8313w.luk Website

 
 
//

Location

 

434Huxley BuildingSouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

611 results found

Bertels K, Lattanzi A, Ciavattini E, Bettarelli F, Chiaradia MT, Nutricato R, Morea A, Antola A, Ferrandi F, Lattuada M, Pilato C, Sciuto D, Meeuws RJ, Yankova Y, Sima VM, Sigdel K, Luk W, De Figueiredo Coutinho JG, Ming Lam Y, Todman T, Michelotti A, Cerruto Aet al., 2012, The hArtes tool chain, Hardware/Software Co-design for Heterogeneous Multi-core Platforms: The hArtes Toolchain, Pages: 9-109, ISBN: 9789400714052

© Springer Science+Business Media B.V. 2012. This chapter describes the different design steps needed to go from legacy code to a transformed application that can be efficiently mapped on the hArtes platform.

BOOK CHAPTER

Atasu K, Luk W, Mencer O, Ozturan C, Dundar Get al., 2012, FISH: Fast Instruction SyntHesis for Custom Processors, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, Vol: 20, Pages: 52-65, ISSN: 1063-8210

JOURNAL ARTICLE

Tsoi KH, Luk W, 2011, Power profiling and optimization for heterogeneous multi-core systems, ACM SIGARCH Computer Architecture News, Vol: 39, Pages: 8-8, ISSN: 0163-5964

JOURNAL ARTICLE

Liu Q, Luk W, 2011, Objective-driven workload allocation in heterogeneous computing systems

In this work, we explore heterogeneous computing hardware, including CPUs, GPUs and FPGAs, for scientific computing. We study system metrics such as throughput, energy efficiency and temperature, and formulate the problem of workload allocation among computing hardware in mathematical models with regards to the three metrics. The workload allocation approach is evaluated using Linpack on a hardware platform containing one CPU, one GPU and one FPGA. Results show that the heterogeneous computing system with appropriate workload allocation provides high energy efficiency with peak value at 1.1 GFLOPs/W and reduces power consumption by 56.54%; and that workload allocation schemes are significantly different with regards to different system metrics. © 2011 IEEE.

CONFERENCE PAPER

Das J, Lam A, Wilton SJE, Leong PHW, Luk Wet al., 2011, An Analytical Model Relating FPGA Architecture to Logic Density and Depth, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, Vol: 19, Pages: 2229-2242, ISSN: 1063-8210

JOURNAL ARTICLE

Le Masle A, Chow GCT, Luk W, 2011, Constant power reconfigurable computing

We present Constant Power Reconfigurable Computing, a general and device-independent framework based on a closed-loop control system used to keep the power consumption constant for any reconfigurable computing design targeting FPGA implementation. We develop an on-chip power consumer, an on-chip power monitor and a proportional-integral-derivative controller with circuit primitives available in most commercial FPGAs. We demonstrate the effectiveness of the proposed methodology on a square-and-multiply exponentiation circuit implemented on a Spartan-6 LX45 FPGA board. By reducing the peak autocorrelation values by a factor of 2.7 on average, the proposed Constant Power Reconfigurable Computing approach decreases the information leaked by the power consumption of this system with only 26% area overhead and 28% power overhead. © 2011 IEEE.

CONFERENCE PAPER

Susanto KW, Luk W, 2011, Automating formal verification of customized soft-processors

Soft-processors, instruction processors implemented in FPGA technology, are often customizable to support domain-specific optimization. However the correctness of customized soft-processors, executing the associated machine code, is often not obvious. This paper proposes a novel approach for verifying the implementation of an application program for a customized soft-processor, based on the ACL2 theorem prover. The correctness proof involves verifying a machine code program executing on the target hardware device against a high-level specification of the application program. We illustrate the proposed approach with several case studies, showing how processors with different custom instructions and with different number of pipelined stages can be automatically produced and verified; such processors have a range of trade-offs in performance, size, power and energy consumption to meet different requirements. © 2011 IEEE.

CONFERENCE PAPER

Osborne WG, Luk W, Coutinho JGF, Mencer Oet al., 2011, Energy reduction by systematic run-time reconfigurable hardware deactivation, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol: 6760 LNCS, Pages: 354-369, ISSN: 0302-9743

This paper describes a method of developing energy-efficient run-time reconfigurable hardware designs. The key idea is to systematically deactivate part of the hardware using word-length optimisation techniques, and then select the most optimal reconfiguration strategy: multiple bitstream reconfiguration or component multiplexing. When multiplexing between different parts of the circuit, it may not always be possible to gate the clock to the unwanted components in FPGAs. Different methods of achieving the same effect while minimising the area used for the control logic are investigated. A model is used to determine the conditions under which reconfiguring the bitstream is more energy-efficient than multiplexing part of the design, based on power measurements taken on 130nm and 90nm devices. Various case studies, such as ray tracing, B-Splines, vector multiplication and inner product are used to illustrate this approach. © 2011 Springer-Verlag Berlin Heidelberg.

JOURNAL ARTICLE

Cope B, Cheung PYK, Luk W, Howes Let al., 2011, A systematic design space exploration approach to customising multi-processor architectures: Exemplified using graphics processors, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol: 6760 LNCS, Pages: 63-83, ISSN: 0302-9743

A systematic approach to customising Homogeneous Multi-Processor (HoMP) architectures is described. The approach involves a novel design space exploration tool and a parameterisable system model. Post-fabrication customisation options for using reconfigurable logic with a HoMP are classified. The adoption of the approach in exploring pre- and post-fabrication customisation options to optimise an architecture's critical paths is then described. The approach and steps are demonstrated using the architecture of a graphics processor. We also analyse on-chip and off-chip memory access for systems with one or more processing elements (PEs), and study the impact of the number of threads per PE on the amount of off-chip memory access and the number of cycles for each output. It is shown that post-fabrication customisation of a graphics processor can provide up to four times performance improvement for negligible area cost. © 2011 Springer-Verlag Berlin Heidelberg.

JOURNAL ARTICLE

Becker T, Jin Q, Luk W, Weston Set al., 2011, Dynamic constant reconfiguration for explicit finite difference option pricing, Pages: 176-181

This paper explores the reconfiguration of slowly changing constants in an explicit finite difference solver for option pricing. Numerical methods for option pricing, such as finite difference, are computationally very complex and can be aided by hardware acceleration. Such hardware implementations can be further improved by specialising the circuit for constants, and reconfiguring the circuit when the constants change. In this paper we demonstrate how this concept can be applied to the pricing of European and American options. We present an analytical optimisation approach that explores the benefit of specialised designs over a static one. The key to this approach is the performance and area estimation of kernels that is based on the parameters of arithmetic operators inside the kernel. This allows us to quickly explore several design options without building full designs. Our experimental results on a Xilinx XC6VLX760 FPGA show that with a partially reconfigurable design performance can be improved by a factor of 4.7 over a design without reconfiguration. © 2011 IEEE.

CONFERENCE PAPER

Yu C, Cox F, Luk W, Cheung RCCet al., 2011, Hydrate: Hybrid reconfigurable architecture expressions

This paper presents Hydrate (HYbriD Reconfigurable ArchiTecture Expressions), a generic architecture description language for exploring hybrid FPGA designs. Hydrate is based on the XML architecture description for the VPR tool. These expressions consist of variable, repeat and conditional statements to allow flexible, reusable and readable description for FPGA architectures, without modifying the VPR core. Two case studies, involving modern FPGA and floating-point applications, are used to illustrate our approach. Compared with VPR5.0 and VPR6.0, the architecture file format adopted by Hydrate is often much smaller and easier to understand, and over 90% of file size reduction for complex FPGAs can be achieved. Moreover, Hydrate is compatible with different versions of VPR, so that the powerful VPR tool flow can be used to explore future architectures. © 2011 IEEE.

CONFERENCE PAPER

Betkaoui B, Thomas DB, Luk W, Przulj Net al., 2011, A framework for FPGA acceleration of large graph problems: Graphlet counting case study

In many application domains, data are represented using large graphs involving millions of vertices and edges. Graph analysis algorithms, such as finding short paths and isomorphic subgraphs, are largely dominated by memory latency. Large cluster-based computing platforms can process graphs efficiently if the graph data can be partitioned, and on a smaller scale partitioning can be used to allocate graphs to low-latency on-chip RAMs in reconfigurable devices. However, there are many graph classes, such as scale-free social networks, which lack the locality to make partitioning graph data an efficient solution to the latency problem and are far too large to fit in on-chip RAMs and caches. In this paper, we present a framework for reconfigurable hardware acceleration of these large-scale graph problems that are difficult to partition and require high-latency off-chip memory storage. Our reconfigurable architecture tolerates off-chip memory latency by using a memory crossbar that connects many parallel identical processing elements to shared off-chip memory, without a traditional cached memory hierarchy. Quantitative comparison between the software and hardware performance of a graphlet counting case-study shows that our hardware implementation outperforms a quad-core software implementation by 10 times for large graphs. This speedup includes all software and IO overhead required, and reduces execution time for this common bioinformatics algorithm from about 2 hours to just 12 minutes. These results demonstrate that our methodology for accelerating graph algorithms is a promising approach for efficient parallel graph processing. © 2011 IEEE.

CONFERENCE PAPER

Niu XY, Tsoi KH, Luk W, 2011, Reconfiguring distributed applications in FPGA accelerated cluster with wireless networking, Pages: 545-550

FPGA accelerators are capable of improving computation and energy efficiency of many applications targeting a cluster of machines. In this work, we focus on an FPGA accelerated cluster system coupled with a wireless network. Comparing with conventional Ethernet based approaches, the proposed system with wireless network enables a lightweight and efficient method for the FPGA devices to exchange information directly. Customisable monitoring facilities are developed to support changing a distributed application dynamically at run time. The N-Body simulation application is used to demonstrate the effectiveness and potential of the proposed system. Experiments show that this approach can achieve up to 4.2 times improvement in latency. By applying the proposed inter-FPGA wireless network to the N-Body application, we achieve enhanced power efficiency while fulfilling thermal constraints in all nodes. © 2011 IEEE.

CONFERENCE PAPER

Jin Q, Luk W, Thomas DB, 2011, Unifying finite difference option-pricing for hardware acceleration, Pages: 6-9

Explicit finite difference method is widely used in finance for pricing many kinds of options. Its regular computational pattern makes it an ideal candidate for acceleration using reconfigurable hardware. However, because the corresponding hardware designs must be optimised both for the specific option and for the target platform, it is challenging and time consuming to develop designs efficiently and productively. This paper presents a unifying framework for describing and automatically implementing financial explicit finite difference procedures in reconfigurable hardware, allowing parallelised and pipelined implementations to be created from high-level mathematical expressions. The proposed framework is demonstrated using three option pricing problems. Our results show that an implementation from our framework targeting a Virtex-6 device at 310MHz is more than 24 times faster than a software implementation fully optimised by the Intel compiler on a four-core Xeron CPU at 2.66GHz. In addition, the latency of the FPGA solvers is up to 90 times lower than the corresponding software solvers. © 2011 IEEE.

CONFERENCE PAPER

Le Masle A, Luk W, Moritz CA, 2011, Parametrized hardware architectures for the lucas primality test, Pages: 124-131

We present our parametric hardware architecture of the NIST approved Lucas probabilistic primality test. To our knowledge, our work is the first hardware architecture for the Lucas test. Our main contributions are a hardware architecture for calculating the Jacobi symbol based on the binary Jacobi algorithm, a pipelined modular add-shift module for calculating the Lucas sequences, methods for dependence analysis and for scheduling of the Lucas sequences computation. Our architecture implemented on a Virtex-5 FPGA is 30% slower but 3 times more energy efficient than the software version running on a Intel Xeon W3505. Our fastest 45 nm ASIC implementation is 3.6 times faster and 400 times more energy efficient than the optimised software implementation in comparable technology. The performance scaling of our architecture is much better than linear in area. Different speed/area/energy trade-offs are available through parametrization. The cell count and the power consumption of our ASIC implementations make them suitable for integration into an embedded system whereas our FPGA implementation would more likely benefit server applications. © 2011 IEEE.

CONFERENCE PAPER

Liu Q, Mak T, Luo J, Luk W, Yakovlev Aet al., 2011, Power adaptive computing system design in energy harvesting environment, Pages: 33-40

Energy harvesting systems provide a promising alternative to battery-powered systems and create an opportunity for architecture and design method innovation for the exploitation of ambient energy source. In this paper, we propose a two-stage optimization approach to develop power adaptive computing systems which can efficiently use energy harvested from solar source. At design time, an SPMD (single process, multiple data) computation structure with multiple parallel processing units is generated, and a convex optimizer runs at run-time to decide how many processing units can operate simultaneously subject to the instant power supplied from the harvester. The approach is evaluated on three embedded applications. The results show that the proposed approach can develop and manage a computing system for each application to adjust its power consumption with respect to the power supply while maximizing speed. Compared to static systems without adaptability, our power adaptive computing system improves the harvested energy utilization efficiency up to 28.8%. These computation systems can be applied to distributed monitor networks to improve computation capability at nodes. In our experiments, the throughput per watt in a node with a ARM9 processor can be improved 19 times by adding the developed adaptive computing system to the node. © 2011 IEEE.

CONFERENCE PAPER

Cecchi S, Primavera A, Piazza F, Bettarelli F, Ciavattini E, Toppi R, Coutinho JGF, Luk W, Pilato C, Ferrandi F, Sima V-M, Bertels Ket al., 2011, The hArtes CarLab: A New Approach to Advanced Algorithms Development for Automotive Audio, JOURNAL OF THE AUDIO ENGINEERING SOCIETY, Vol: 59, Pages: 858-869, ISSN: 1549-4950

JOURNAL ARTICLE

Mak T, Cheung PYK, Lam K-P, Luk Wet al., 2011, Adaptive Routing in Network-on-Chips Using a Dynamic-Programming Network, IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, Vol: 58, Pages: 3701-3716, ISSN: 0278-0046

JOURNAL ARTICLE

Flynn MJ, Luk W, 2011, Computer System Design: System-on-Chip, Computer System Design: System-on-Chip

The next generation of computer system designers will be less concerned about details of processors and memories, and more concerned about the elements of a system tailored to particular applications. These designers will have a fundamental knowledge of processors and other elements in the system, but the success of their design will depend on the skills in making system-level tradeoffs that optimize the cost, performance and other attributes to meet application requirements. This book provides a new treatment of computer system design, particularly for System-on-Chip (SOC), which addresses the issues mentioned above. It begins with a global introduction, from the high-level view to the lowest common denominator (the chip itself), then moves on to the three main building blocks of an SOC (processor, memory, and interconnect). Next is an overview of what makes SOC unique (its customization ability and the applications that drive it). The final chapter presents future challenges for system design and SOC possibilities. © 2011 John Wiley & Sons, Inc.

JOURNAL ARTICLE

Kurek M, Ilkos I, Luk W, 2011, Customizable security-aware cache for FPGA-based soft processors, Proceedings of the 2011 7th Southern Conference on Programmable Logic, SPL 2011, Pages: 45-50

This paper describes a security-aware cache targeting field-programmable gate array (FPGA) technology. Our design is based on an architecture with a remapping table, which provides resilience against side-channel timing attacks. We show how this cache design can be optimised for FPGA resources by an index decoder with content addressable memory structure, which can be customized to meet various requirements. We show, for the first time, how our security-aware cache can be included in the Leon 3 processor, and its performance and resource usage are evaluated. © 2011 IEEE.

JOURNAL ARTICLE

Koester M, Luk W, Hagemeyer J, Porrmann M, Rueckert Uet al., 2011, Design Optimizations for Tiled Partially Reconfigurable Systems, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, Vol: 19, Pages: 1048-1061, ISSN: 1063-8210

JOURNAL ARTICLE

Tsoi KH, Tse AHT, Pietzuch P, Luk Wet al., 2011, Programming framework for clusters with heterogeneous accelerators, ACM SIGARCH Computer Architecture News, Vol: 38, Pages: 53-53, ISSN: 0163-5964

JOURNAL ARTICLE

Tse AHT, Thomas DB, Tsoi KH, Luk Wet al., 2011, Efficient reconfigurable design for pricing asian options, ACM SIGARCH Computer Architecture News, Vol: 38, Pages: 14-14, ISSN: 0163-5964

JOURNAL ARTICLE

Yamaguchi Y, Tsoi HK, Luk W, 2011, FPGA-Based Smith-Waterman Algorithm: Analysis and Novel Design, 7th International Symposium on Applied Reconfigurable Computing, ARC 2011, Publisher: SPRINGER-VERLAG BERLIN, Pages: 181-+, ISSN: 0302-9743

CONFERENCE PAPER

Denholm S, Tsoi KH, Pietzuch P, Luk Wet al., 2011, CusComNet: A Customisable Network for Reconfigurable Heterogeneous Clusters, 22nd IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 9-16, ISSN: 2160-0511

CONFERENCE PAPER

Chow GCT, Kwok KW, Luk W, Leong Pet al., 2011, Mixed Precision Comparison in Reconfigurable Systems, IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE COMPUTER SOC, Pages: 17-24

CONFERENCE PAPER

Osborne WG, Luk W, Coutinho JGF, Mencer Oet al., 2011, Energy reduction by systematic run-time reconfigurable hardware deactivation, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol: 6760 LNCS, Pages: 354-369, ISSN: 0302-9743

© 2011, Springer-Verlag Berlin Heidelberg. This paper describes a method of developing energy-efficient run-time reconfigurable hardware designs. The key idea is to systematically deactivate part of the hardware using word-length optimisation techniques, and then select the most optimal reconfiguration strategy: multiple bitstream reconfiguration or component multiplexing. When multiplexing between different parts of the circuit, it may not always be possible to gate the clock to the unwanted components in FPGAs. Different methods of achieving the same effect while minimising the area used for the control logic are investigated. A model is used to determine the conditions under which reconfiguring the bitstream is more energy-efficient than multiplexing part of the design, based on power measurements taken on 130nm and 90nm devices. Various case studies, such as ray tracing, B–Splines, vector multiplication and inner product are used to illustrate this approach.

JOURNAL ARTICLE

Cope B, Cheung PYK, Luk W, Howes Let al., 2011, A systematic design space exploration approach to customising multi-processor architectures: Exemplified using graphics processors, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol: 6760 LNCS, Pages: 63-83, ISSN: 0302-9743

© 2011, Springer-Verlag Berlin Heidelberg. A systematic approach to customising Homogeneous Multi-Processor (HoMP) architectures is described. The approach involves a novel design space exploration tool and a parameterisable system model. Post-fabrication customisation options for using reconfigurable logic with a HoMP are classified. The adoption of the approach in exploring pre- and post-fabrication customisation options to optimise an architecture’s critical paths is then described. The approach and steps are demonstrated using the architecture of a graphics processor. We also analyse on-chip and off-chip memory access for systems with one or more processing elements (PEs), and study the impact of the number of threads per PE on the amount of off-chip memory access and the number of cycles for each output. It is shown that post-fabrication customisation of a graphics processor can provide up to four times performance improvement for negligible area cost.

JOURNAL ARTICLE

Yamaguchi Y, Kuen HT, Luk W, 2011, A Comparison of FPGAs, GPUs and CPUs for Smith-Waterman Algorithm, 19th Annual ACM International Symposium on Field-Programmable Gate Arrays, Publisher: ASSOC COMPUTING MACHINERY, Pages: 282-282

CONFERENCE PAPER

Jin Q, Luk W, Thomas DB, 2011, On Comparing Financial Option Price Solvers on FPGA, IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE COMPUTER SOC, Pages: 89-92

CONFERENCE PAPER

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: id=00154588&limit=30&person=true&page=8&respub-action=search.html