Imperial College London

Professor Peter Y. K. Cheung

Faculty of EngineeringDyson School of Design Engineering

Professor of Digital Systems
 
 
 
//

Contact

 

+44 (0)20 7594 6200p.cheung Website

 
 
//

Assistant

 

Mrs Wiesia Hsissen +44 (0)20 7594 6261

 
//

Location

 

910BElectrical EngineeringSouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

349 results found

Mak T, Sedcole P, Cheung PYK, Luk Wet al., 2010, Wave-pipelined intra-chip signaling for on-FPGA communications, Integration, the VLSI Journal, Vol: 43, Pages: 188-201

On-FPGA communication is becoming more problematic as the long interconnection performance is deteriorating in technology scaling. In this paper, we address this issue by proposing a novel wave-pipelined signaling scheme to achieve substantial throughput improvement in FPGAs. A new analytical model capturing the electrical characteristics in FPGA interconnects is presented. Based on the model, throughput and power consumption of a wave-pipelined link have been derived analytically and compared to the conventional synchronous links. Two circuit designs are proposed to realize wave-pipelined link using FPGA fabrics. The proposed approaches are also compared with conventional synchronous and asynchronous pipelining techniques. It is shown that the wave-pipelined approach can achieve up to 5.7 times improvement in throughput and 13% improvement in power consumption versus conventional delay-based on-chip communication schemes. Also, trade-offs between power, throughput and area consumption between the proposed and conventional designs are studied. The wave-pipelining approach provides a new alternative for on-FPGA communications and can potentially become a promising solution to mitigate the future interconnect scaling challenge.

Journal article

Becker T, Jamieson P, Luk W, Cheung PYK, Rissa Tet al., 2010, Power characterisation for fine-grain reconfigurable fabrics, International Journal of Reconfigurable Computing, Vol: 2010

This paper proposes a benchmarking methodology for characterising the power consumption of the fine-grain fabric in reconfigurable architectures. This methodology is part of the GroundHog 2009 power benchmarking suite. It covers active and inactive power as well as advanced low-power modes. A method based on random number generators is adopted for comparing activity modes. We illustrate our approach using five field-programmable gate arrays (FPGAs) that span a range of process technologies: Xilinx Virtex-II Pro, Spartan-3E, Spartan-3AN, Virtex-5, and Silicon Blue iCE65. We find that, despite improvements through process technology and low-power modes, current devices need further improvements to be sufficiently power efficient for mobile applications. The Silicon Blue device demonstrates that performance can be traded off to achieve lower leakage.

Journal article

Stott EA, Sedcole NP, Cheung PYK, 2010, Fault tolerance and reliability in field-programmable gate arrays, Pages: 196-210-196-210

Conference paper

Jones DH, Powell A, Bouganis C-S, Cheung PYKet al., 2010, A Salient Region Detector for GPU Using a Cellular Automata Architecture, 17th International Conference on Neural Information Processing, Publisher: SPRINGER-VERLAG BERLIN, Pages: 501-508, ISSN: 0302-9743

Conference paper

Becker T, Luk W, Cheung PYK, 2010, Energy-aware optimisation for run-time reconfiguration, 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE COMPUTER SOC, Pages: 55-62

Conference paper

Lopez S, Sarmiento R, Potter PG, Luk W, Cheung PYKet al., 2010, Exploration of Hardware Sharing for Image Encoders, Design, Automation and Test in Europe Conference and Exhibition (DATE), Publisher: IEEE, Pages: 1737-1742, ISSN: 1530-1591

Conference paper

Stott EA, Wong JS, Sedcole NP, Cheung PYKet al., 2010, Degradation in FPGAs: measurement and modelling, International symposium on field programmable gate arrays, Pages: 229-238-229-238

Conference paper

Mak T, Cheung PYK, Luk W, Lam KPet al., 2009, A DP-network for optimal dynamic routing in network-on-chip, Pages: 119-127

Dynamic routing is desirable because of its substantial improvement in communication bandwidth and intelligent adaptation to faulty links and congested traffics. However, implementation of adaptive routing in a network-on-chip (NoC) system is not trivial and further complicated by the requirements of deadlock-free and real-time optimal decision making. In this paper, we present a deadlock-free routing architecture which employs a dynamic programming (DP) network to provide on-the-fly optimal path planning and network monitoring for packet switching. Also, a new routing strategy called k-step look ahead is introduced. This new strategy can substantially reduced the size of routing table and maintain a high quality of adaptation which leads to a scalable dynamic routing solution with minimal hardware overhead. Our results based on a cycle-accurate simulator demonstrate the effectiveness of the DP-network, which outperforms both the deterministic and adaptive routing algorithms in average delay on various traffic scenarios by 22.3%. Moreover, the hardware overhead for DP-network is insignificant based on the results obtained from the hardware implementations. Copyright 2009 ACM.

Conference paper

Angelopoulou M, Bouganis CS, Cheung PYK, Constantinides GAet al., 2009, Robust Real-Time Super-Resolution on FPGA and an Application to Video Enhancement, ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol: 2

The high density image sensors of state-of-the-art imaging systems provide outputs with high spatial resolution, but require long exposure times. This limits their applicability, due to the motion blur effect. Recent technological advances have lead to adaptive image sensors that can combine several pixels together in real time to form a larger pixel. Larger pixels require shorter exposure times and produce high-frame-rate samples with reduced motion blur. This work proposes combining an FPGA with an adaptive image sensor to produce an output of high resolution both in space and time. The FPGA is responsible for the spatial resolution enhancement of the high-frame-rate samples using super-resolution (SR) techniques in real time. To achieve it, this article proposes utilizing the Iterative Back Projection (IBP) SR algorithm. The original IBP method is modified to account for the presence of noise, leading to an algorithm more robust to noise. An FPGA implementation of this algorithm is presented. The proposed architecture can serve as a general purpose real-time resolution enhancement system, and its performance is evaluated under various noise levels.

Journal article

Fahmy SA, Cheung PYK, Luk W, 2009, High-throughput one-dimensional median and weighted median filters on FPGA, Computers & Digital Techniques, IET, Vol: 3, Pages: 384-394

Most effort in designing median filters has focused on two-dimensional filters with small window sizes, used for image processing. However, recent work on novel image processing algorithms, such as the trace transform, has highlighted the need for architectures that can compute the median and weighted median of large one-dimensional windows, to which the optimisations in the aforementioned architectures do not apply. A set of architectures for computing both the median and weighted median of large, flexibly sized windows through parallel cumulative histogram construction is presented. The architecture uses embedded memories to control the highly parallel bank of histogram nodes, and can implicitly determine window sizes for median and weighted median calculations. The architecture is shown to perform at 72 Msamples, and has been integrated within a trace transform architecture.

Journal article

Wong JSJ, Sedcole P, Cheung PYK, 2009, Self-Measurement of Combinatorial Circuit Delays in FPGAs, ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol: 2, Pages: 1-22

This article proposes a Built-In Self-Test (BIST) method to accurately measure the combinatorial circuit delays on an FPGA. The flexibility of the on-chip clock generation capability found in modern FPGAs is employed to step through a range of frequencies until timing failure in the combinatorial circuit is detected. In this way, the delay of any combinatorial circuit can be determined with a timing resolution of the order of picoseconds. Parallel and optimized implementations of the method for self-characterization of the delay of all the LUTs on an FPGA are also proposed. The method was applied to Altera Cyclone II and III FPGAs . A complete self-characterization of LUTs on a Cyclone II was achieved in 2.5 seconds, utilizing only 13kbit of block RAM to store the results. More extensive tests were carried out on the Cyclone III and the delays of adder circuits and embedded multiplier blocks were successfully measured. This self-measurement method paves the way for matching timing requirements in designs to FPGAs as a means of combating the problem of process variations.

Journal article

Liu Q, Constantinides GA, Masselos K, Cheung PYKet al., 2009, Data-reuse exploration under an on-chip memory constraint for low-power FPGA-based systems, Computers & Digital Techniques, IET, Vol: 3, Pages: 235-246

Contemporary FPGA-based reconfigurable systems have been widely used to implement data-dominated applications. In these applications, data transfer and storage consume a large proportion of the system energy. Exploiting data-reuse can introduce significant power savings, but also introduces the extra requirement for on-chip memory. To aid data-reuse design exploration early during the design cycle, the authors present an optimisation approach to achieve a power-optimal design satisfying an on-chip memory constraint in a targeted FPGA-based platform. The data-reuse exploration problem is mathematically formulated and shown to be equivalent to the multiple-choice knapsack problem. The solution to this problem for an application code corresponds to the decision of which array references are to be buffered on-chip and where loading reused data of the array references into on-chip memory happen in the code, in order to minimise power consumption for a fixed on-chip memory size. The authors also present an experimentally verified power model, capable of providing the relative power information between different data-reuse design options of an application, resulting in a fast and efficient design-space exploration. The experimental results demonstrate that the approach enables us to find the most power-efficient design for all the benchmark circuits tested.

Journal article

Liu Q, Constantinides GA, Masselos K, Cheung Pet al., 2009, Combining Data Reuse With Data-Level Parallelization for FPGA-Targeted Hardware Compilation: A Geometric Programming Framework, Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, Vol: 28, Pages: 305-315

A nonlinear optimization framework is proposed in this paper to automate exploration of the design space consisting of data-reuse (buffering) decisions and loop-level parallelization, in the context of field-programmable-gate-array-targeted hardware compilation. Buffering frequently accessed data in on-chip memories can reduce off-chip memory accesses and open avenues for parallelization. However, the exploitation of both data reuse and parallelization is limited by the memory resources available on-chip. As a result, considering these two problems separately, e.g., first exploring data reuse and then exploring data-level parallelization, based on the data-reuse options determined in the first step, may not yield the performance-optimal designs for limited on-chip memory resources. We consider both problems at the same time, exposing the dependence between the two. We show that this combined problem can be formulated as a nonlinear program and further show that efficient solution techniques exist for this problem, based on recent advances in optimization of so-called geometric programming problems. The results from applying this framework to several real benchmarks implemented on a Xilinx device demonstrate that given different constraints on on-chip memory utilization, the corresponding performance-optimal designs are automatically determined by the framework. We have also implemented designs determined by a two-stage optimization method that first explores data reuse and then explores parallelization on the same platform, and by comparison, the performance-optimal designs proposed by our framework are faster than the designs determined by the two-stage method by up to 5.7 times.

Journal article

M Angelopoulou CB, Cheung PYK, 2009, A sensor-based approach to linear blur identification for real-time video enhancement

Conference paper

Bouganis CS, Park SB, Constantinides GA, Cheung PYKet al., 2009, Synthesis and Optimization of 2D Filter Designs for Heterogeneous FPGAs, ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol: 1, ISSN: 1936-7406

Many image processing applications require fast convolution of an image with one or more 2D filters. Field-Programmable Gate Arrays (FPGAs) are often used to achieve this goal due to their fine grain parallelism and reconfigurability. However, the heterogeneous nature of modern reconfigurable devices is not usually considered during design optimization. This article proposes an algorithm that explores the space of possible implementation architectures of 2D filters, targeting the minimization of the required area, by optimizing the usage of the different components in a heterogeneous device. This is achieved by exploring the heterogeneous nature of modern reconfigurable devices using a Singular Value Decomposition based algorithm, which provides an efficient mapping of filter's implementation requirements to the heterogeneous components of modern FPGAs. In the case of multiple 2D filters, the proposed algorithm also exploits any redundancy that exists within each filter and between different filters in the set, leading to designs with minimized area. Experiments with real filter sets from computer vision applications demonstrate an average of up to 38% reduction in the required area.

Journal article

Liu Y, Bouganis CS, Cheung PYK, 2009, Hardware architectures for eigenvalue computation of real symmetric matrices, IET Proceeding on Computers & Digital Techniques, Vol: 3, Pages: 72-84

Computation of eigenvalues is essential in many applications in the fields of science and engineering. When the application of interest requires the computation of eigenvalues of high throughput or real-time performance, a hardware implementation of an eigenvalue computation block is often employed. The problem of eigenvalue computation of real symmetric matrices is focused upon. For the general case of a symmetric matrix eigenvalue problem, the approximate Jacobi method is proposed, where for the special case of a 3times3 symmetric matrix, an algebraic-based method is introduced. The proposed methods are compared with various other approaches reported in the literature. Results obtained by mapping the above architectures on a field programmable gate array device illustrate the advantages of the proposed methods over the existing ones.

Journal article

Potter PG, Luk W, Cheung P, 2009, Partition-based exploration for reconfigurable JPEG designs, Design, Automation and Test in Europe Conference and Exhibition, Publisher: IEEE, Pages: 886-889, ISSN: 1530-1591

Conference paper

Jamieson P, Becker T, Luk W, Cheung PYK, Rissa T, Pitkaenen Tet al., 2009, Benchmarking Reconfigurable Architectures in the Mobile Domain, 17th Annual IEEE Symposium on Field Programmable Custom Computing Machines, Publisher: IEEE COMPUTER SOC, Pages: 131-+

Conference paper

Becker T, Luk W, Cheung PYK, 2009, Parametric Design for Reconfigurable Software-Defined Radio, 5th International Workshop on Applied Reconfigurable Computing, Publisher: SPRINGER-VERLAG BERLIN, Pages: 15-+, ISSN: 0302-9743

Conference paper

Becker T, Jamieson P, Luk W, Cheung PYK, Rissa Tet al., 2009, POWER CHARACTERISATION FOR THE FABRIC IN FINE-GRAIN RECONFIGURABLE ARCHITECTURES, 5th Southern Conference on Programmable Logic, Publisher: IEEE, Pages: 77-+

Conference paper

Sedcole NP, Stott EA, Cheung PYK, 2009, Compensating for variability in FPGAs by re-mapping and re-placement, Pages: 613-616-613-616

Conference paper

Smith AM, Constantinides GA, Cheung PYK, 2009, Area Estimation and Optimisation of FPGA Routing Fabrics

Conference paper

Liu Q, Constantinides GA, Masselos K, Cheung PYKet al., 2009, Compiling C-like Languages to FPGA Hardware: Some Novel Approaches Targeting Data Memory Organization, The Computer Journal, Vol:  , Pages: bxp020-bxp020

This paper describes our approaches to raise the level of abstraction at which hardware suitable for accelerating computationally intensive applications can be specified. Field-programmable gate arrays are becoming adopted as a computational platform by the high-performance computing community, but there are challenges to extract maximum performance from these devices. Unlike other approaches, our focus is on data memory organization and input-output bandwidth considerations, which are the typical stumbling block of existing hardware compilation schemes. We describe our approaches, which are based on formal optimization techniques, and present some results showing the advantage of exposing the interaction between data memory system design and parallelism extraction to the compiler.

Journal article

Wang L, Mak T, Sedcole P, Cheung PYKet al., 2009, Throughput Maximization for Wave-Pipelined Interconnects Using Cascaded Buffers and Transistor Sizing, IEEE International Symposium on Circuits and Systems (ISCAS 2009), Publisher: IEEE, Pages: 1293-1296, ISSN: 0271-4302

Conference paper

Kahoul A, Smith AM, Constantinides GA, 2009, Heterogeneous Architecture Evaluation: Analysis versus Parameter Sweep, Pages: 133-144

Conference paper

Clarke JA, Constantinides GA, Cheung PYK, 2009, Word-length selection for power minimization via non-linear optimization, ACM Transactions on Design Automation of Electronic Systems, Vol: 14

Journal article

Arifin S, Cheung PYK, 2008, Affective Level Video Segmentation by Utilizing the Pleasure-Arousal-Dominance Information, IEEE TRANSACTIONS ON MULTIMEDIA, Vol: 10, Pages: 1325-1341, ISSN: 1520-9210

Journal article

Turkington K, Constantinides GA, Masselos K, Cheung PYKet al., 2008, Outer Loop Pipelining for Application Specific Datapaths in FPGAs, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol: 16, Pages: 1268-1280

Most hardware compilers apply loop pipelining to increase the parallelism achieved, but pipelining is restricted to the only innermost level in a nested loop. In this work we extend and adapt an existing outer loop pipelining approach known as single dimension software pipelining to generate schedules for field-programmable gate-array (FPGA) hardware coprocessors. Each loop level in nine test loops is pipelined and the resulting schedules are implemented in VHDL and targeted to an Altera Stratix II FPGA. The results show that the fastest solution for all but one of the loops occurs when pipelining is applied one to three levels above the innermost loop. Across the nine test loops we achieve an acceleration over the innermost loop solution of up to seven times, with a mean speedup of 3.2 times. The results suggest that inclusion of outer loop pipelining in future hardware compilers may be worthwhile as it can allow significantly improved results to be achieved at the cost of a small increase in compile time.

Journal article

Becker T, Jamieson P, Luk W, Cheung PYK, Rissa Tet al., 2008, Towards benchmarking energy efficiency of reconfigurable architectures, International Conference on Field Programmable Logic and Applications, Publisher: IEEE, Pages: 691-694

Energy research in reconfigurable architectures often involves legacy benchmarks such as the MCNC benchmarks. These benchmarks, however, are not well-suited for assessing energy consumption of reconfigurable technology, since they lack realistic input stimuli. This paper reviews and categorises a range of computation system benchmarks, and shows that there are no comprehensive benchmarks targeting reconfigurable architectures that would stimulate energy or power research. We review existing energy research in the field which involves microbenchmarks, in-house designs, or legacy benchmark suites used to evaluate power optimisations.

Conference paper

Smith AM, Constantinides GA, Cheung PYK, 2008, Integrated Floorplanning, Module-Selection and Architecture Generation for Reconfigurable Devices, IEEE Transactions on VLSI Systems, Vol: 16, Pages: 733-744

Journal article

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: id=00001081&limit=30&person=true&page=3&respub-action=search.html