345 results found
Becker T, Jamieson P, Luk W, et al., 2010, Power characterisation for fine-grain reconfigurable fabrics, International Journal of Reconfigurable Computing, Vol: 2010
This paper proposes a benchmarking methodology for characterising the power consumption of the fine-grain fabric in reconfigurable architectures. This methodology is part of the GroundHog 2009 power benchmarking suite. It covers active and inactive power as well as advanced low-power modes. A method based on random number generators is adopted for comparing activity modes. We illustrate our approach using five field-programmable gate arrays (FPGAs) that span a range of process technologies: Xilinx Virtex-II Pro, Spartan-3E, Spartan-3AN, Virtex-5, and Silicon Blue iCE65. We find that, despite improvements through process technology and low-power modes, current devices need further improvements to be sufficiently power efficient for mobile applications. The Silicon Blue device demonstrates that performance can be traded off to achieve lower leakage.
Stott EA, Sedcole NP, Cheung PYK, 2010, Fault tolerance and reliability in field-programmable gate arrays, Pages: 196-210-196-210
Mak T, Cheung PYK, Luk W, et al., 2009, A DP-network for optimal dynamic routing in network-on-chip, Pages: 119-127
Dynamic routing is desirable because of its substantial improvement in communication bandwidth and intelligent adaptation to faulty links and congested traffics. However, implementation of adaptive routing in a network-on-chip (NoC) system is not trivial and further complicated by the requirements of deadlock-free and real-time optimal decision making. In this paper, we present a deadlock-free routing architecture which employs a dynamic programming (DP) network to provide on-the-fly optimal path planning and network monitoring for packet switching. Also, a new routing strategy called k-step look ahead is introduced. This new strategy can substantially reduced the size of routing table and maintain a high quality of adaptation which leads to a scalable dynamic routing solution with minimal hardware overhead. Our results based on a cycle-accurate simulator demonstrate the effectiveness of the DP-network, which outperforms both the deterministic and adaptive routing algorithms in average delay on various traffic scenarios by 22.3%. Moreover, the hardware overhead for DP-network is insignificant based on the results obtained from the hardware implementations. Copyright 2009 ACM.
Angelopoulou M, Bouganis CS, Cheung PYK, et al., 2009, Robust Real-Time Super-Resolution on FPGA and an Application to Video Enhancement, ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol: 2
The high density image sensors of state-of-the-art imaging systems provide outputs with high spatial resolution, but require long exposure times. This limits their applicability, due to the motion blur effect. Recent technological advances have lead to adaptive image sensors that can combine several pixels together in real time to form a larger pixel. Larger pixels require shorter exposure times and produce high-frame-rate samples with reduced motion blur. This work proposes combining an FPGA with an adaptive image sensor to produce an output of high resolution both in space and time. The FPGA is responsible for the spatial resolution enhancement of the high-frame-rate samples using super-resolution (SR) techniques in real time. To achieve it, this article proposes utilizing the Iterative Back Projection (IBP) SR algorithm. The original IBP method is modified to account for the presence of noise, leading to an algorithm more robust to noise. An FPGA implementation of this algorithm is presented. The proposed architecture can serve as a general purpose real-time resolution enhancement system, and its performance is evaluated under various noise levels.
Fahmy SA, Cheung PYK, Luk W, 2009, High-throughput one-dimensional median and weighted median filters on FPGA, Computers & Digital Techniques, IET, Vol: 3, Pages: 384-394
Most effort in designing median filters has focused on two-dimensional filters with small window sizes, used for image processing. However, recent work on novel image processing algorithms, such as the trace transform, has highlighted the need for architectures that can compute the median and weighted median of large one-dimensional windows, to which the optimisations in the aforementioned architectures do not apply. A set of architectures for computing both the median and weighted median of large, flexibly sized windows through parallel cumulative histogram construction is presented. The architecture uses embedded memories to control the highly parallel bank of histogram nodes, and can implicitly determine window sizes for median and weighted median calculations. The architecture is shown to perform at 72 Msamples, and has been integrated within a trace transform architecture.
Wong JSJ, Sedcole P, Cheung PYK, 2009, Self-Measurement of Combinatorial Circuit Delays in FPGAs, ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol: 2, Pages: 1-22
This article proposes a Built-In Self-Test (BIST) method to accurately measure the combinatorial circuit delays on an FPGA. The flexibility of the on-chip clock generation capability found in modern FPGAs is employed to step through a range of frequencies until timing failure in the combinatorial circuit is detected. In this way, the delay of any combinatorial circuit can be determined with a timing resolution of the order of picoseconds. Parallel and optimized implementations of the method for self-characterization of the delay of all the LUTs on an FPGA are also proposed. The method was applied to Altera Cyclone II and III FPGAs . A complete self-characterization of LUTs on a Cyclone II was achieved in 2.5 seconds, utilizing only 13kbit of block RAM to store the results. More extensive tests were carried out on the Cyclone III and the delays of adder circuits and embedded multiplier blocks were successfully measured. This self-measurement method paves the way for matching timing requirements in designs to FPGAs as a means of combating the problem of process variations.
Liu Q, Constantinides GA, Masselos K, et al., 2009, Data-reuse exploration under an on-chip memory constraint for low-power FPGA-based systems, Computers & Digital Techniques, IET, Vol: 3, Pages: 235-246
Contemporary FPGA-based reconfigurable systems have been widely used to implement data-dominated applications. In these applications, data transfer and storage consume a large proportion of the system energy. Exploiting data-reuse can introduce significant power savings, but also introduces the extra requirement for on-chip memory. To aid data-reuse design exploration early during the design cycle, the authors present an optimisation approach to achieve a power-optimal design satisfying an on-chip memory constraint in a targeted FPGA-based platform. The data-reuse exploration problem is mathematically formulated and shown to be equivalent to the multiple-choice knapsack problem. The solution to this problem for an application code corresponds to the decision of which array references are to be buffered on-chip and where loading reused data of the array references into on-chip memory happen in the code, in order to minimise power consumption for a fixed on-chip memory size. The authors also present an experimentally verified power model, capable of providing the relative power information between different data-reuse design options of an application, resulting in a fast and efficient design-space exploration. The experimental results demonstrate that the approach enables us to find the most power-efficient design for all the benchmark circuits tested.
Liu Q, Constantinides GA, Masselos K, et al., 2009, Combining Data Reuse With Data-Level Parallelization for FPGA-Targeted Hardware Compilation: A Geometric Programming Framework, Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, Vol: 28, Pages: 305-315
A nonlinear optimization framework is proposed in this paper to automate exploration of the design space consisting of data-reuse (buffering) decisions and loop-level parallelization, in the context of field-programmable-gate-array-targeted hardware compilation. Buffering frequently accessed data in on-chip memories can reduce off-chip memory accesses and open avenues for parallelization. However, the exploitation of both data reuse and parallelization is limited by the memory resources available on-chip. As a result, considering these two problems separately, e.g., first exploring data reuse and then exploring data-level parallelization, based on the data-reuse options determined in the first step, may not yield the performance-optimal designs for limited on-chip memory resources. We consider both problems at the same time, exposing the dependence between the two. We show that this combined problem can be formulated as a nonlinear program and further show that efficient solution techniques exist for this problem, based on recent advances in optimization of so-called geometric programming problems. The results from applying this framework to several real benchmarks implemented on a Xilinx device demonstrate that given different constraints on on-chip memory utilization, the corresponding performance-optimal designs are automatically determined by the framework. We have also implemented designs determined by a two-stage optimization method that first explores data reuse and then explores parallelization on the same platform, and by comparison, the performance-optimal designs proposed by our framework are faster than the designs determined by the two-stage method by up to 5.7 times.
Wang L, Mak T, Sedcole P, et al., 2009, Throughput Maximization for Wave-Pipelined Interconnects Using Cascaded Buffers and Transistor Sizing, IEEE International Symposium on Circuits and Systems (ISCAS 2009), Publisher: IEEE, Pages: 1293-1296
Jamieson P, Becker T, Luk W, et al., 2009, Benchmarking Reconfigurable Architectures in the Mobile Domain, 17th Annual IEEE Symposium on Field Programmable Custom Computing Machines, Publisher: IEEE COMPUTER SOC, Pages: 131-+
M Angelopoulou CB, Cheung PYK, 2009, A sensor-based approach to linear blur identification for real-time video enhancement
Becker T, Jamieson P, Luk W, et al., 2009, POWER CHARACTERISATION FOR THE FABRIC IN FINE-GRAIN RECONFIGURABLE ARCHITECTURES, 5th Southern Conference on Programmable Logic, Publisher: IEEE, Pages: 77-+
Becker T, Luk W, Cheung PYK, 2009, Parametric Design for Reconfigurable Software-Defined Radio, 5th International Workshop on Applied Reconfigurable Computing, Publisher: SPRINGER-VERLAG BERLIN, Pages: 15-+, ISSN: 0302-9743
Potter PG, Luk W, Cheung P, 2009, Partition-based exploration for reconfigurable JPEG designs, Design, Automation and Test in Europe Conference and Exhibition, Publisher: IEEE, Pages: 886-889, ISSN: 1530-1591
Liu Y, Bouganis CS, Cheung PYK, 2009, Hardware architectures for eigenvalue computation of real symmetric matrices, IET Proceeding on Computers & Digital Techniques, Vol: 3, Pages: 72-84
Computation of eigenvalues is essential in many applications in the fields of science and engineering. When the application of interest requires the computation of eigenvalues of high throughput or real-time performance, a hardware implementation of an eigenvalue computation block is often employed. The problem of eigenvalue computation of real symmetric matrices is focused upon. For the general case of a symmetric matrix eigenvalue problem, the approximate Jacobi method is proposed, where for the special case of a 3times3 symmetric matrix, an algebraic-based method is introduced. The proposed methods are compared with various other approaches reported in the literature. Results obtained by mapping the above architectures on a field programmable gate array device illustrate the advantages of the proposed methods over the existing ones.
Bouganis CS, Park SB, Constantinides GA, et al., 2009, Synthesis and Optimization of 2D Filter Designs for Heterogeneous FPGAs, ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol: 1, ISSN: 1936-7406
Many image processing applications require fast convolution of an image with one or more 2D filters. Field-Programmable Gate Arrays (FPGAs) are often used to achieve this goal due to their fine grain parallelism and reconfigurability. However, the heterogeneous nature of modern reconfigurable devices is not usually considered during design optimization. This article proposes an algorithm that explores the space of possible implementation architectures of 2D filters, targeting the minimization of the required area, by optimizing the usage of the different components in a heterogeneous device. This is achieved by exploring the heterogeneous nature of modern reconfigurable devices using a Singular Value Decomposition based algorithm, which provides an efficient mapping of filter's implementation requirements to the heterogeneous components of modern FPGAs. In the case of multiple 2D filters, the proposed algorithm also exploits any redundancy that exists within each filter and between different filters in the set, leading to designs with minimized area. Experiments with real filter sets from computer vision applications demonstrate an average of up to 38% reduction in the required area.
Sedcole NP, Stott EA, Cheung PYK, 2009, Compensating for variability in FPGAs by re-mapping and re-placement, Pages: 613-616-613-616
Smith AM, Constantinides GA, Cheung PYK, 2009, Area Estimation and Optimisation of FPGA Routing Fabrics
Liu Q, Constantinides GA, Masselos K, et al., 2009, Compiling C-like Languages to FPGA Hardware: Some Novel Approaches Targeting Data Memory Organization, The Computer Journal, Vol: , Pages: bxp020-bxp020
This paper describes our approaches to raise the level of abstraction at which hardware suitable for accelerating computationally intensive applications can be specified. Field-programmable gate arrays are becoming adopted as a computational platform by the high-performance computing community, but there are challenges to extract maximum performance from these devices. Unlike other approaches, our focus is on data memory organization and input-output bandwidth considerations, which are the typical stumbling block of existing hardware compilation schemes. We describe our approaches, which are based on formal optimization techniques, and present some results showing the advantage of exposing the interaction between data memory system design and parallelism extraction to the compiler.
Kahoul A, Smith AM, Constantinides GA, 2009, Heterogeneous Architecture Evaluation: Analysis versus Parameter Sweep, Pages: 133-144
Clarke JA, Constantinides GA, Cheung PYK, 2009, Word-length selection for power minimization via non-linear optimization, ACM Transactions on Design Automation of Electronic Systems, Vol: 14
Arifin S, Cheung PYK, 2008, Affective Level Video Segmentation by Utilizing the Pleasure-Arousal-Dominance Information, IEEE TRANSACTIONS ON MULTIMEDIA, Vol: 10, Pages: 1325-1341, ISSN: 1520-9210
Turkington K, Constantinides GA, Masselos K, et al., 2008, Outer Loop Pipelining for Application Specific Datapaths in FPGAs, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol: 16, Pages: 1268-1280
Most hardware compilers apply loop pipelining to increase the parallelism achieved, but pipelining is restricted to the only innermost level in a nested loop. In this work we extend and adapt an existing outer loop pipelining approach known as single dimension software pipelining to generate schedules for field-programmable gate-array (FPGA) hardware coprocessors. Each loop level in nine test loops is pipelined and the resulting schedules are implemented in VHDL and targeted to an Altera Stratix II FPGA. The results show that the fastest solution for all but one of the loops occurs when pipelining is applied one to three levels above the innermost loop. Across the nine test loops we achieve an acceleration over the innermost loop solution of up to seven times, with a mean speedup of 3.2 times. The results suggest that inclusion of outer loop pipelining in future hardware compilers may be worthwhile as it can allow significantly improved results to be achieved at the cost of a small increase in compile time.
Becker T, Jamieson P, Luk W, et al., 2008, Towards benchmarking energy efficiency of reconfigurable architectures, International Conference on Field Programmable Logic and Applications, Publisher: IEEE, Pages: 691-694
Energy research in reconfigurable architectures often involves legacy benchmarks such as the MCNC benchmarks. These benchmarks, however, are not well-suited for assessing energy consumption of reconfigurable technology, since they lack realistic input stimuli. This paper reviews and categorises a range of computation system benchmarks, and shows that there are no comprehensive benchmarks targeting reconfigurable architectures that would stimulate energy or power research. We review existing energy research in the field which involves microbenchmarks, in-house designs, or legacy benchmark suites used to evaluate power optimisations.
Smith AM, Constantinides GA, Cheung PYK, 2008, Integrated Floorplanning, Module-Selection and Architecture Generation for Reconfigurable Devices, IEEE Transactions on VLSI Systems, Vol: 16, Pages: 733-744
Sedcole P, Cheung PYK, 2008, Parametric Yield Modeling and Simulations of FPGA Circuits Considering Within-Die Delay Variations, ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol: 1, ISSN: 1936-7406
Variations in the semiconductor fabrication process results in differences in parameters between transistors on the same die, a problem exacerbated by lithographic scaling. Field-Programmable Gate Arrays may be able to compensate for within-die delay variability, by judicious use of reconfigurability. This article presents two strategies for compensating within-die stochastic delay variability by using reconfiguration: reconfiguring the entire FPGA, and relocating subcircuits within an FPGA. Analytical models for the theoretical bounds on the achievable gains are derived for both strategies and compared to models for worst-case design as well as statistical static timing analysis (SSTA). All models are validated by comparison to circuit-level Monte Carlo simulations. It is demonstrated that significant improvements in circuit yield and timing are possible using SSTA alone, and these improvements can be enhanced by employing reconfiguration-based techniques.
Mak STS, Sedcole P, Cheung PYK, et al., 2008, Interconnection lengths and delays estimation for communication links in FPGAs, The 2008 international workshop on System level interconnect prediction, Publisher: ACM, Pages: 1-10
This paper presents a new stochastic model to predict interconnection lengths of communication links in FPGAs. Based on a stochastic inter-module routing model, expected length and variance of interconnections have been rigorously derived and, thus, delay can be computed based on the length estimate. The theoretical results are compared with experimental results of lengths and delays, which are obtained from implementations of links circuits in an FPGA. The stochastic model provides an accurate prediction of length with an average error of 6.3%. Results also show that the proposed model produces reliable predictions of delay and therefore the methodology can be applied to early stage planning and design optimization for communication links. Moreover, as a byproduct of this work, we also present in this paper an interesting phenomenon which we term "interconnection fringing". The fringing effect is attributed to the competition for routing resources in a communication link and will lengthen interconnections and, therefore, increase the delay.
Mak T, D'Alessandro C, Sedcole P, et al., 2008, Implementation of Wave-Pipelined Interconnects in FPGAs, Publisher: IEEE, Pages: 213-214
Global interconnection and communication at high clock frequencies are becoming more problematic in FPGA. In this paper, we address this problem by presenting an interconnect wave-pipelining strategy, which utilizes the existing programmable interconnects fabrics to provide high-throughput communication in FPGA. Two design approaches for interconnect wave-pipelining, using simple clock phase shifting and asynchronous phase encoding, are presented in this paper. Experimental results from a Xilinx Virtex-5 FPGA device are also presented.
Clarke JA, Constantinides GA, Cheung PYK, 2008, Glitch-Aware Output Switching Activity from Word-Level Statistics, Proc. IEEE International Symposium on Circuits and Systems, Pages: 1792-1795
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.