Publications

Cardoso JMP, Teixeira J, Alves JC, Nobre R, Diniz PC, Coutinho JGF, Luk Wet al., 2012, Specifying Compiler Strategies for FPGA-based Systems, 20th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE, Pages: 192-199, ISSN: 2576-2613

Author Web Link
Cite
Citations: 7

Conference paper

Wang Y, Yan J, Zhou X, Wang L, Luk W, Peng C, Tong Jet al., 2012, A Partially Reconfigurable Architecture Supporting Hardware Threads, 11th International Conference on Field-Programmable Technology (FPT), Publisher: IEEE, Pages: 269-276

Author Web Link
Cite
Citations: 5

Conference paper

Santambrogio MD, Pnevmatikatos D, Papadimitriou K, Pilato C, Gaydadjiev G, Stroobandt D, Davidson T, Becker T, Todman T, Luk W, Bonetto A, Cazzaniga A, Durelli GC, Sciuto Det al., 2012, Smart Technologies for Effective Reconfiguration: The FASTER approach, 2012 7TH INTERNATIONAL WORKSHOP ON RECONFIGURABLE AND COMMUNICATION-CENTRIC SYSTEMS-ON-CHIP (RECOSOC)

Journal article

Coutinho JGF, Bhattacharya S, Luk W, Constantinides GA, Cardoso JMP, Carvalho T, Diniz PC, Petrov Zet al., 2012, Resource-Efficient Designs using an Aspect-Oriented Approach, 15th IEEE International Conference on Computational Science and Engineering (CSE) / 10th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC), Publisher: IEEE, Pages: 399-406, ISSN: 1949-0828

Author Web Link
Cite
Citations: 1

Conference paper

Betkaoui B, Wang Y, Thomas DB, Luk Wet al., 2012, A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration, 23rd IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 8-15, ISSN: 2160-0511

Author Web Link
Cite
Citations: 41

Conference paper

Chow GCT, Tse AHT, Jin Q, Luk W, Leong PHW, Thomas DBet al., 2012, A Mixed Precision Monte Carlo Methodology for Reconfigurable Accelerator Systems, 20th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Publisher: ASSOC COMPUTING MACHINERY, Pages: 57-66

Author Web Link
Cite
Citations: 20

Conference paper

Bertels K, Lattanzi A, Ciavattini E, Bettarelli F, Chiaradia MT, Nutricato R, Morea A, Antola A, Ferrandi F, Lattuada M, Pilato C, Sciuto D, Meeuws RJ, Yankova Y, Sima VM, Sigdel K, Luk W, Coutinho JGDF, Lam YM, Todman T, Michelotti A, Cerruto Aet al., 2012, The hArtes Tool Chain, HARDWARE/SOFTWARE CO-DESIGN FOR HETEROGENEOUS MULTI-CORE PLATFORMS: THE HARTES TOOLCHAIN, Editors: Bertels, Publisher: SPRINGER, Pages: 9-109, ISBN: 978-94-007-1405-2

Book chapter

Heinrich G, Logemann F, Hahn V, Jung C, Coutinho JGDF, Luk Wet al., 2012, Audio Array Processing for Telepresence, HARDWARE/SOFTWARE CO-DESIGN FOR HETEROGENEOUS MULTI-CORE PLATFORMS: THE HARTES TOOLCHAIN, Editors: Bertels, Publisher: SPRINGER, Pages: 125-153, ISBN: 978-94-007-1405-2

Book chapter

Cecchi S, Palestini L, Peretti P, Primavera A, Piazza F, Capman F, Thabuteau S, Levy C, Bonastre J-F, Lattanzi A, Ciavattini E, Bettarelli F, Toppi R, Capucci E, Ferrandi F, Lattuada M, Pilato C, Sciuto D, Luk W, Coutinho JGDFet al., 2012, In Car Audio, HARDWARE/SOFTWARE CO-DESIGN FOR HETEROGENEOUS MULTI-CORE PLATFORMS: THE HARTES TOOLCHAIN, Editors: Bertels, Publisher: SPRINGER, Pages: 155-192, ISBN: 978-94-007-1405-2

Book chapter

Todman T, Boehm P, Luk W, 2012, Verification of streaming hardware and software codesigns, 11th International Conference on Field-Programmable Technology (FPT), Publisher: IEEE, Pages: 147-150

Author Web Link
Cite
Citations: 3

Conference paper

Tsoi KH, Luk W, 2011, Power profiling and optimization for heterogeneous multi-core systems, ACM SIGARCH Computer Architecture News, Vol: 39, Pages: 8-13, ISSN: 0163-5964

<jats:p>Processing speed and energy efficiency are two of the most critical issues for computer systems. This paper presents a systematic approach for profiling the power and performance characteristics of application targeting heterogeneous multi-core computing platforms. Our approach enables rapid and automated design space exploration involving optimisation of workload distribution for systems with accelerators such as FPGAs and GPUs. We demonstrate that, with minor modification to the design, it is possible to estimate performance and power efficiency trade off to identify optimized workload distribution. Our approach shows that, for N-body computation, the fastest design which involves 2 CPU cores, 10 FPGA cores and 40960 GPU threads, is 2 times faster than a design with only FPGAs while achieving better overall energy efficiency.</jats:p>

Journal article

Liu Q, Luk W, 2011, Objective-driven workload allocation in heterogeneous computing systems

In this work, we explore heterogeneous computing hardware, including CPUs, GPUs and FPGAs, for scientific computing. We study system metrics such as throughput, energy efficiency and temperature, and formulate the problem of workload allocation among computing hardware in mathematical models with regards to the three metrics. The workload allocation approach is evaluated using Linpack on a hardware platform containing one CPU, one GPU and one FPGA. Results show that the heterogeneous computing system with appropriate workload allocation provides high energy efficiency with peak value at 1.1 GFLOPs/W and reduces power consumption by 56.54%; and that workload allocation schemes are significantly different with regards to different system metrics. © 2011 IEEE.

Abstract
Cite
Citations: 2

Conference paper

Yu C, Cox F, Luk W, Cheung RCCet al., 2011, Hydrate: Hybrid reconfigurable architecture expressions

This paper presents Hydrate (HYbriD Reconfigurable ArchiTecture Expressions), a generic architecture description language for exploring hybrid FPGA designs. Hydrate is based on the XML architecture description for the VPR tool. These expressions consist of variable, repeat and conditional statements to allow flexible, reusable and readable description for FPGA architectures, without modifying the VPR core. Two case studies, involving modern FPGA and floating-point applications, are used to illustrate our approach. Compared with VPR5.0 and VPR6.0, the architecture file format adopted by Hydrate is often much smaller and easier to understand, and over 90% of file size reduction for complex FPGAs can be achieved. Moreover, Hydrate is compatible with different versions of VPR, so that the powerful VPR tool flow can be used to explore future architectures. © 2011 IEEE.

Abstract
Cite

Conference paper

Le Masle A, Chow GCT, Luk W, 2011, Constant power reconfigurable computing

We present Constant Power Reconfigurable Computing, a general and device-independent framework based on a closed-loop control system used to keep the power consumption constant for any reconfigurable computing design targeting FPGA implementation. We develop an on-chip power consumer, an on-chip power monitor and a proportional-integral-derivative controller with circuit primitives available in most commercial FPGAs. We demonstrate the effectiveness of the proposed methodology on a square-and-multiply exponentiation circuit implemented on a Spartan-6 LX45 FPGA board. By reducing the peak autocorrelation values by a factor of 2.7 on average, the proposed Constant Power Reconfigurable Computing approach decreases the information leaked by the power consumption of this system with only 26% area overhead and 28% power overhead. © 2011 IEEE.

Abstract
Cite
Citations: 10

Conference paper

Susanto KW, Luk W, 2011, Automating formal verification of customized soft-processors

Soft-processors, instruction processors implemented in FPGA technology, are often customizable to support domain-specific optimization. However the correctness of customized soft-processors, executing the associated machine code, is often not obvious. This paper proposes a novel approach for verifying the implementation of an application program for a customized soft-processor, based on the ACL2 theorem prover. The correctness proof involves verifying a machine code program executing on the target hardware device against a high-level specification of the application program. We illustrate the proposed approach with several case studies, showing how processors with different custom instructions and with different number of pipelined stages can be automatically produced and verified; such processors have a range of trade-offs in performance, size, power and energy consumption to meet different requirements. © 2011 IEEE.

Abstract
Cite
Citations: 3

Conference paper

Becker T, Jin Q, Luk W, Weston Set al., 2011, Dynamic constant reconfiguration for explicit finite difference option pricing, Pages: 176-181

This paper explores the reconfiguration of slowly changing constants in an explicit finite difference solver for option pricing. Numerical methods for option pricing, such as finite difference, are computationally very complex and can be aided by hardware acceleration. Such hardware implementations can be further improved by specialising the circuit for constants, and reconfiguring the circuit when the constants change. In this paper we demonstrate how this concept can be applied to the pricing of European and American options. We present an analytical optimisation approach that explores the benefit of specialised designs over a static one. The key to this approach is the performance and area estimation of kernels that is based on the parameters of arithmetic operators inside the kernel. This allows us to quickly explore several design options without building full designs. Our experimental results on a Xilinx XC6VLX760 FPGA show that with a partially reconfigurable design performance can be improved by a factor of 4.7 over a design without reconfiguration. © 2011 IEEE.

Abstract
Cite
Citations: 7

Conference paper

Cope B, Cheung PYK, Luk W, Howes Let al., 2011, A systematic design space exploration approach to customising multi-processor architectures: Exemplified using graphics processors, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol: 6760 LNCS, Pages: 63-83, ISSN: 0302-9743

A systematic approach to customising Homogeneous Multi-Processor (HoMP) architectures is described. The approach involves a novel design space exploration tool and a parameterisable system model. Post-fabrication customisation options for using reconfigurable logic with a HoMP are classified. The adoption of the approach in exploring pre- and post-fabrication customisation options to optimise an architecture's critical paths is then described. The approach and steps are demonstrated using the architecture of a graphics processor. We also analyse on-chip and off-chip memory access for systems with one or more processing elements (PEs), and study the impact of the number of threads per PE on the amount of off-chip memory access and the number of cycles for each output. It is shown that post-fabrication customisation of a graphics processor can provide up to four times performance improvement for negligible area cost. © 2011 Springer-Verlag Berlin Heidelberg.

Abstract
Cite

Journal article

Das J, Lam A, Wilton SJE, Leong PHW, Luk Wet al., 2011, An Analytical Model Relating FPGA Architecture to Logic Density and Depth, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, Vol: 19, Pages: 2229-2242, ISSN: 1063-8210

Author Web Link
Cite
Citations: 12

Journal article

Jin Q, Luk W, Thomas DB, 2011, Unifying finite difference option-pricing for hardware acceleration, Pages: 6-9

Explicit finite difference method is widely used in finance for pricing many kinds of options. Its regular computational pattern makes it an ideal candidate for acceleration using reconfigurable hardware. However, because the corresponding hardware designs must be optimised both for the specific option and for the target platform, it is challenging and time consuming to develop designs efficiently and productively. This paper presents a unifying framework for describing and automatically implementing financial explicit finite difference procedures in reconfigurable hardware, allowing parallelised and pipelined implementations to be created from high-level mathematical expressions. The proposed framework is demonstrated using three option pricing problems. Our results show that an implementation from our framework targeting a Virtex-6 device at 310MHz is more than 24 times faster than a software implementation fully optimised by the Intel compiler on a four-core Xeron CPU at 2.66GHz. In addition, the latency of the FPGA solvers is up to 90 times lower than the corresponding software solvers. © 2011 IEEE.

Abstract
Cite
Citations: 2

Conference paper

Niu XY, Tsoi KH, Luk W, 2011, Reconfiguring distributed applications in FPGA accelerated cluster with wireless networking, Pages: 545-550

FPGA accelerators are capable of improving computation and energy efficiency of many applications targeting a cluster of machines. In this work, we focus on an FPGA accelerated cluster system coupled with a wireless network. Comparing with conventional Ethernet based approaches, the proposed system with wireless network enables a lightweight and efficient method for the FPGA devices to exchange information directly. Customisable monitoring facilities are developed to support changing a distributed application dynamically at run time. The N-Body simulation application is used to demonstrate the effectiveness and potential of the proposed system. Experiments show that this approach can achieve up to 4.2 times improvement in latency. By applying the proposed inter-FPGA wireless network to the N-Body application, we achieve enhanced power efficiency while fulfilling thermal constraints in all nodes. © 2011 IEEE.

Abstract
Cite
Citations: 15

Conference paper

Le Masle A, Luk W, Moritz CA, 2011, Parametrized hardware architectures for the lucas primality test, Pages: 124-131

We present our parametric hardware architecture of the NIST approved Lucas probabilistic primality test. To our knowledge, our work is the first hardware architecture for the Lucas test. Our main contributions are a hardware architecture for calculating the Jacobi symbol based on the binary Jacobi algorithm, a pipelined modular add-shift module for calculating the Lucas sequences, methods for dependence analysis and for scheduling of the Lucas sequences computation. Our architecture implemented on a Virtex-5 FPGA is 30% slower but 3 times more energy efficient than the software version running on a Intel Xeon W3505. Our fastest 45 nm ASIC implementation is 3.6 times faster and 400 times more energy efficient than the optimised software implementation in comparable technology. The performance scaling of our architecture is much better than linear in area. Different speed/area/energy trade-offs are available through parametrization. The cell count and the power consumption of our ASIC implementations make them suitable for integration into an embedded system whereas our FPGA implementation would more likely benefit server applications. © 2011 IEEE.

Abstract
Cite
Citations: 5

Conference paper

Liu Q, Mak T, Luo J, Luk W, Yakovlev Aet al., 2011, Power adaptive computing system design in energy harvesting environment, Pages: 33-40

Energy harvesting systems provide a promising alternative to battery-powered systems and create an opportunity for architecture and design method innovation for the exploitation of ambient energy source. In this paper, we propose a two-stage optimization approach to develop power adaptive computing systems which can efficiently use energy harvested from solar source. At design time, an SPMD (single process, multiple data) computation structure with multiple parallel processing units is generated, and a convex optimizer runs at run-time to decide how many processing units can operate simultaneously subject to the instant power supplied from the harvester. The approach is evaluated on three embedded applications. The results show that the proposed approach can develop and manage a computing system for each application to adjust its power consumption with respect to the power supply while maximizing speed. Compared to static systems without adaptability, our power adaptive computing system improves the harvested energy utilization efficiency up to 28.8%. These computation systems can be applied to distributed monitor networks to improve computation capability at nodes. In our experiments, the throughput per watt in a node with a ARM9 processor can be improved 19 times by adding the developed adaptive computing system to the node. © 2011 IEEE.

Abstract
Cite
Citations: 12

Conference paper

Cecchi S, Primavera A, Piazza F, Bettarelli F, Ciavattini E, Toppi R, Coutinho JGF, Luk W, Pilato C, Ferrandi F, Sima V-M, Bertels Ket al., 2011, The hArtes CarLab: A New Approach to Advanced Algorithms Development for Automotive Audio, JOURNAL OF THE AUDIO ENGINEERING SOCIETY, Vol: 59, Pages: 858-869, ISSN: 1549-4950

Author Web Link
Cite
Citations: 3

Journal article

Flynn MJ, Luk W, 2011, Computer System Design: System-on-Chip, Computer System Design: System-on-Chip

The next generation of computer system designers will be less concerned about details of processors and memories, and more concerned about the elements of a system tailored to particular applications. These designers will have a fundamental knowledge of processors and other elements in the system, but the success of their design will depend on the skills in making system-level tradeoffs that optimize the cost, performance and other attributes to meet application requirements. This book provides a new treatment of computer system design, particularly for System-on-Chip (SOC), which addresses the issues mentioned above. It begins with a global introduction, from the high-level view to the lowest common denominator (the chip itself), then moves on to the three main building blocks of an SOC (processor, memory, and interconnect). Next is an overview of what makes SOC unique (its customization ability and the applications that drive it). The final chapter presents future challenges for system design and SOC possibilities. © 2011 John Wiley & Sons, Inc.

Abstract
Cite
Citations: 21

Journal article

Koester M, Luk W, Hagemeyer J, Porrmann M, Rueckert Uet al., 2011, Design Optimizations for Tiled Partially Reconfigurable Systems, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, Vol: 19, Pages: 1048-1061, ISSN: 1063-8210

Author Web Link
Cite
Citations: 29

Journal article

Spacey SA, Wiesemann W, Kuhn D, Luk Wet al., 2011, Robust Software Partitioning with Multiple Instantiation, Informs Journal on Computing

Cite

Journal article

Cope B, Cheung PYK, Luk W, Howes Let al., 2011, A systematic design space exploration approach to customising multi-processor architectures: Exemplified using graphics processors, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol: 6760 LNCS, Pages: 63-83, ISSN: 0302-9743

A systematic approach to customising Homogeneous Multi-Processor (HoMP) architectures is described. The approach involves a novel design space exploration tool and a parameterisable system model. Post-fabrication customisation options for using reconfigurable logic with a HoMP are classified. The adoption of the approach in exploring pre- and post-fabrication customisation options to optimise an architecture's critical paths is then described. The approach and steps are demonstrated using the architecture of a graphics processor. We also analyse on-chip and off-chip memory access for systems with one or more processing elements (PEs), and study the impact of the number of threads per PE on the amount of off-chip memory access and the number of cycles for each output. It is shown that post-fabrication customisation of a graphics processor can provide up to four times performance improvement for negligible area cost. © 2011 Springer-Verlag Berlin Heidelberg.

Abstract
Cite

Journal article

Kurek M, Ilkos I, Luk W, 2011, Customizable security-aware cache for FPGA-based soft processors, Proceedings of the 2011 7th Southern Conference on Programmable Logic, SPL 2011, Pages: 44-50

This paper describes a security-aware cache targeting field-programmable gate array (FPGA) technology. Our design is based on an architecture with a remapping table, which provides resilience against side-channel timing attacks. We show how this cache design can be optimised for FPGA resources by an index decoder with content addressable memory structure, which can be customized to meet various requirements. We show, for the first time, how our security-aware cache can be included in the Leon 3 processor, and its performance and resource usage are evaluated. © 2011 IEEE.

Abstract
Cite

Journal article

Osborne WG, Luk W, Coutinho JGF, Mencer Oet al., 2011, Energy reduction by systematic run-time reconfigurable hardware deactivation, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol: 6760 LNCS, Pages: 354-369, ISSN: 0302-9743

This paper describes a method of developing energy-efficient run-time reconfigurable hardware designs. The key idea is to systematically deactivate part of the hardware using word-length optimisation techniques, and then select the most optimal reconfiguration strategy: multiple bitstream reconfiguration or component multiplexing. When multiplexing between different parts of the circuit, it may not always be possible to gate the clock to the unwanted components in FPGAs. Different methods of achieving the same effect while minimising the area used for the control logic are investigated. A model is used to determine the conditions under which reconfiguring the bitstream is more energy-efficient than multiplexing part of the design, based on power measurements taken on 130nm and 90nm devices. Various case studies, such as ray tracing, B-Splines, vector multiplication and inner product are used to illustrate this approach. © 2011 Springer-Verlag Berlin Heidelberg.

Abstract
Cite

Journal article

Osborne WG, Luk W, Coutinho JGF, Mencer Oet al., 2011, Energy reduction by systematic run-time reconfigurable hardware deactivation, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol: 6760 LNCS, Pages: 354-369, ISSN: 0302-9743

© 2011, Springer-Verlag Berlin Heidelberg. This paper describes a method of developing energy-efficient run-time reconfigurable hardware designs. The key idea is to systematically deactivate part of the hardware using word-length optimisation techniques, and then select the most optimal reconfiguration strategy: multiple bitstream reconfiguration or component multiplexing. When multiplexing between different parts of the circuit, it may not always be possible to gate the clock to the unwanted components in FPGAs. Different methods of achieving the same effect while minimising the area used for the control logic are investigated. A model is used to determine the conditions under which reconfiguring the bitstream is more energy-efficient than multiplexing part of the design, based on power measurements taken on 130nm and 90nm devices. Various case studies, such as ray tracing, B–Splines, vector multiplication and inner product are used to illustrate this approach.

Abstract
Cite

Journal article

ProfessorWayneLuk

Contact

Location

Summary