619 results found
Ang SS, Constantinides GA, Luk W, et al., 2008, Custom parallel caching schemes for hardware-accelerated image compression, JOURNAL OF REAL-TIME IMAGE PROCESSING, Vol: 3, Pages: 289-302
In an effort to achieve lower bandwidth requirements, video compression algorithms have become increasingly complex. Consequently, the deployment of these algorithms on field programmable gate arrays (FPGAs) is becoming increasingly desirable, because of the computational parallelism on these platforms as well as the measure of flexibility afforded to designers. Typically, video data are stored in large and slow external memory arrays, but the impact of the memory access bottleneck may be reduced by buffering frequently used data in fast on-chip memories. The order of the memory accesses, resulting from many compression algorithms are dependent on the input data (Jain in Proceedings of the IEEE, pp. 349–389, 1981). These data-dependent memory accesses complicate the exploitation of data re-use, and subsequently reduce the extent to which an application may be accelerated. In this paper, we present a hybrid memory sub-system which is able to capture data re-use effectively in spite of data-dependent memory accesses. This memory sub-system is made up of a custom parallel cache and a scratchpad memory. Further, the framework is capable of exploiting 2D spatial locality, which is frequently exhibited in the access patterns of image processing applications. In a case study involving the quad-tree structured pulse code modulation (QSDPCM) application, the impact of data dependence on memory accesses is shown to be significant. In comparison with an implementation which only employs an SPM, performance improvements of up to 1.7× and 1.4× are observed through actual implementation on two modern FPGA platforms. These performance improvements are more pronounced for image sequences exhibiting greater inter-frame movements. In addition, reductions of on-chip memory resources by up to 3.2× are achievable using this framework. These results indicate that, on custom hardware platforms, there is substantial scope for improvement in the capture of data re-us
Atasu K, Ozturan C, Dundar G, et al., 2008, CHIPS: Custom Hardware Instruction Processor Synthesis, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol: 27, Pages: 528-541
This paper describes an integer linear programming (ILP) based system called CHIPS that identifies custom instructions for critical code segments, given the available data bandwidth and transfer latencies between custom logic and a baseline processor with architecturally visible state registers. Our approach enables designers to optionally constrain the number of input and output operands for custom instructions. We describe a design flow to identify promising area, performance and code size tradeoffs. We study the effect of input/output constraints, register file ports, and compiler transformations such as if-conversion. Our experiments show that, in most cases, the solutions with the highest performance are identified when the input/output constraints are removed. However, input/output constraints help our algorithms identify frequently used code segments, reducing the overall area overhead. Results for 11 benchmarks covering cryptography and multimedia are shown, with speed-ups between 1.7 and 6.6 times, code size reductions between 6\% and 72\%, and area costs ranging between 12 and 256 adders for maximum speed-up. Our ILP based approach scales well: benchmarks with basic blocks consisting of more than 1000 instructions can be optimally solved, most of the time within a few seconds.
Bower JA, Cho WN, Luk W, 2007, Unifying FPGA hardware development, Pages: 113-120
In current FPGA development environments complex projects often end up in an ad-hoc tangle of programming systems; examples include Perl, Makefiles, and Verilog and/or VHDL. To combat this we develop an approach to FPGA development in which a single specification is used to combine: high- and low-level description of custom hardware, parameterisation of existing IP and project build. In this paper we present an abstract overview of our unified approach and a prototype implementation called YAHDL, composed ofa set of libraries written in the object-oriented software language Ruby. To explore YAHDL's effectiveness we apply it to an existing project, creating FPGA hardware designs for floating-point Monte Carlo simulations. With this case-study we show it is possible to use YAHDL to simplify the generation of application specific instances of our Monte Carlo architectures while achieving performance in the 200-300MHz range. © 2007 IEEE.
Luk W, 2007, Field-programmable technology: Today's and tomorrow's, Proceedings of the Topical Workshop on Electronics for Particle Physics, TWEPP 2007, Pages: 47-53
Good: Moore's Law, bad: productivity gap vision: unified design synthesis and analysis devices and design today - growing gap: amount of I/O and amount of logic - enhance optimality and re-use: I/O driven devices tomorrow - hybrid FPGA: multi-granularity fabric - 3D FPGA: customisable system-in-package design tomorrow - guided synthesis: optimised and portable design - data representation optimisation - upgradable and self-tuned design.
Fahmy SA, Bouganis C, Cheung PYK, et al., 2007, Real-time hardware acceleration of the trace transform, Journal of Real-Time Image Processing, Vol: 2, Pages: 235-248, ISSN: 1861-8200
Sedcole P, Cheung PYK, Constantinides GA, et al., 2007, Run-Time Integration of Reconfigurable Video Processing Systems, IEEE Trans VLSI Systems, Vol: 15, Pages: 1003-1016
Ho C, Yu C, Leong P, et al., 2007, Domain-Specific FPGA: Architecture and Floating Point Applications, International Conference on Field Programmable Logic and Applications (FPL), Pages: 196-201
This paper presents a novel architecture for domain-specific FPGA devices. This architecture can be optimised for both speed and density by exploiting domain-specific information to produce efficient reconfigurable logic with multiple granularity. In the reconfigurable logic, general-purpose fine-grained units are used for implementing control logic and bit-oriented operations, while domain-specific coarse-grained units and heterogeneous blocks are used for implementing datapaths; the precise amount of each type of resources can be customised to suit specific application domains. Issues and challenges associated with the design flow and the architecture modelling are addressed. Examples of the proposed architecture for speeding up floating point applications are illustrated. Current results indicate that the proposed architecture can achieve 2.5 times improvement in speed and 18 times reduction in area on average, when compared with traditional FPGA devices on selected floating point benchmark circuits.
Cheung RCC, Lee D-U, Luk W, et al., 2007, Hardware generation of arbitrary random number distributions from uniform distributions via the inversion method, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, Vol: 15, Pages: 952-962, ISSN: 1063-8210
Mak TST, Sedcole P, Cheung PYK, et al., 2007, Average interconnection delay estimation for on-FPGA communication links, Electronics Letters, Vol: 43, Pages: 918-919
A new method is presented and an analytical expression is derived for average interconnection delay estimation. This method is directly applicable to predicting the average delay for high-bandwidth communication links implemented on FPGAs. The theoretical results are compared with the measured data from the actual circuits and an average error of 4.6% is reported.
Thomas DB, Luk W, 2007, Non-uniform random number generation through piecewise linear approximations, 16th International Conference on Field Programmable Logic and Applications, Publisher: INST ENGINEERING TECHNOLOGY-IET, Pages: 312-321, ISSN: 1751-8601
Thomas DB, Luk W, 2007, High quality uniform random number generation using LUT optimised state-transition matrices, 4th IEEE International Conference on Field Programmable Technology, Publisher: SPRINGER, Pages: 77-92, ISSN: 0922-5773
Becker T, Luk W, Cheung PYK, 2007, Enhancing relocatability of partial bitstreams for run-time reconfiguration, 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, Publisher: IEEE COMPUTER SOC, Pages: 35-+
Cope B, Cheung PYK, Luk W, 2007, Bridging the gap between FPGAs and multi-processor architectures: A video processing perspective, 18th IEEE International Conference on Application-Specific Systems, Architectures and Processors, Publisher: IEEE, Pages: 308-+, ISSN: 1063-6862
Atasu K, Dimond RG, Mencer O, et al., 2007, Optimizing Instruction-set Extensible Processors under Data Bandwidth Constraints, Design, Automation and Test in Europe Conference and Exhibition (DATE), Pages: 588-593
Juvonen MPT, Coutinho JGF, Luk W, 2007, Hardware architectures for adaptive background modelling, 2007 3RD SOUTHERN CONFERENCE ON PROGRAMMABLE LOGIC, PROCEEDINGS, Pages: 149-+
Todman T, Luk W, 2007, Domain Specific Transformations for Hardware Ray Tracing, 30th WoTUG Technical Meeting 2007, Publisher: IOS PRESS, Pages: 479-+, ISSN: 1383-7575
De Bosschere K, Luk W, Martorell X, et al., 2007, High-performance embedded architecture and compilation roadmap, Transactions on High-Performance Embedded Architectures and Compilers I, Vol: 4050, Pages: 5-29, ISSN: 0302-9743
Osborne WG, Cheung RCC, Coutinho JGF, et al., 2007, Automatic accuracy-guaranteed bit-width optimization for fixed and floating-point systems, 17th International Conference on Field Programmable Logic and Applications, Publisher: IEEE, Pages: 617-620, ISSN: 1946-1488
Thomas DB, Luk W, 2007, Sampling from the Multivariate Gaussian distribution using reconfigurable hardware, 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, Publisher: IEEE COMPUTER SOC, Pages: 3-+
Thomas DB, Luk W, 2007, A domain specific language for reconfigurable path-based Monte Carlo simulations, Annual International Conference on Field Programmable Technology, Publisher: IEEE, Pages: 97-104
Todman T, Fu H, Mencer O, et al., 2007, Improving bounds for FPGA logic minimization, Annual International Conference on Field Programmable Technology, Publisher: IEEE, Pages: 245-248
Osborne WG, Coutinho JGF, Cheung RCC, et al., 2007, Instrumented multi-stage word-length optimization, Annual International Conference on Field Programmable Technology, Publisher: IEEE, Pages: 89-96
Thomas DB, Bower JA, Luk W, 2007, Automatic generation and optimisation of reconfigurable financial Monte-Carlo simulations, 18th IEEE International Conference on Application-Specific Systems, Architectures and Processors, Publisher: IEEE, Pages: 168-173, ISSN: 2160-0511
Mak TST, Sedcole P, Cheung PYK, et al., 2007, A hybrid analog-digital routing network for NoC dynamic routing, 1st International Symposium on Networks-on-Chip, Publisher: IEEE COMPUTER SOC, Pages: 173-+
Thomas DB, Luk W, Stumpf M, 2007, Reconfigurable hardware acceleration of canonical graph labelling, 3rd International Workshop on Applied Reconfigurable Computing, Publisher: SPRINGER-VERLAG BERLIN, Pages: 302-+, ISSN: 0302-9743
Wilton SJE, Ho CH, Leong PHW, et al., 2007, A Synthesizable Datapath-Oriented Embedded FPGA Fabric, 15th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Publisher: ASSOC COMPUTING MACHINERY, Pages: 33-41
Sano K, Pell O, Luk W, et al., 2007, FPGA-based streaming computation for lattice Boltzmann method, Annual International Conference on Field Programmable Technology, Publisher: IEEE, Pages: 233-+
Mencer O, Fu H, Luk W, 2007, Optimizing Logarithmic Arithmetic on FPGAs\r\n, IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Pages: 163-172
This paper proposes optimizations of the methods and\r\nparameters used in both mathematical approximation and\r\nhardware design for logarithmic number system (LNS)\r\narithmetic. First, we introduce a general polynomial approximation\r\napproach with an adaptive divide-in-halves\r\nsegmentation method for evaluation of LNS arithmetic\r\nfunctions. Second, we develop a library generator that automatically\r\ngenerates optimized LNS arithmetic units with\r\na wide bit-width range from 21 to 64 bits, to support LNS\r\napplication development and design exploration. The basic\r\narithmetic units are tested on practical FPGA boards\r\nas well as software simulation. When compared with existing\r\nLNS designs, our generated units provide in most cases\r\n6% to 37% reduction in area and 20% to 50% reduction\r\nin latency. The key challenge for LNS remains on the application\r\nlevel. We show the performance of LNS versus\r\nfloating-point for realistic applications: digital sine/cosine\r\nwaveform generator, matrix multiplication and radiative\r\nMonte Carlo simulation. Our infrastructure for fast prototyping\r\nLNS FPGA applications allows us to efficiently\r\nstudy LNS number representation and its tradeoffs in speed\r\nand size when compared with floating-point designs.
Ang S-S, Constantinides GA, Luk W, et al., 2007, A Hybrid Memory Sub-system for Video Coding Applications
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.