620 results found
Jin Q, Becker T, Luk W, et al., 2012, Optimising explicit finite difference option pricing for dynamic constant reconfiguration, Pages: 165-172
This paper demonstrates a novel optimisation methodology to adjust stencil based numerical procedures from the algorithm level, so as to reduce not only the amount of hardware resource consumption per kernel but also the amount of computation required to achieve desired result accuracy, when mapping the algorithm to reconfigurable hardware using dynamic constant reconfiguration. As a result, less area is needed to support run-time reconfiguration, and less computational steps are required in the numerical procedure to obtain a result with given error tolerance. We analyse one thousand fixed point implementations on a Virtex-6 XC6VLX760 FPGA for randomly generated option pricing problems, which are representative of industrial computation. When comparing optimised implementations to the un-optimised ones, the reconfiguration area upper bound is reduced by 22%; the average number of computational steps is reduced by 23%; and the area-computation-time product is reduced by 40%; while the numerical errors of the results are kept below the error tolerant level used in industry. © 2012 IEEE.
Niu X, Jin Q, Luk W, et al., 2012, Exploiting run-time reconfiguration in stencil computation, Pages: 173-180
Stencil computation is computationally intensive and required by many applications. This paper proposes an approach to exploit run-time reconfigurability of field-programmable accelerators for stencil computation. System throughput is optimized by partitioning, analysing and scheduling tasks in applications to remove idle functions. To evaluate the proposed approach, Reverse Time Migration (RTM), a high performance application, is developed. Our optimized runtime reconfigurable solution, which targets a Virtex-6 FPGA in a Maxeler MAX3424A system, can achieves an improved throughput of 102.8 GFlop/s, up to two orders of magnitude faster than the CPU reference designs, 1.59 times faster than the best published GPU and FPGA results, and 1.45 times faster than an optimized static implementation. © 2012 IEEE.
Chau TCP, Luk W, Cheung PYK, et al., 2012, Adaptive sequential Monte Carlo approach for real-time applications, Pages: 527-530
This paper presents an adaptive Sequential Monte Carlo approach for real-time applications. Sequential Monte Carlo method is employed to estimate the states of dynamic systems using weighted particles. The proposed approach reduces the run-time computation complexity by adapting the size of the particle set. Multiple processing elements on FPGAs are dynamically allocated for improved energy efficiency without violating real-time constraints. A robot localisation application is developed based on the proposed approach. Compared to a non-adaptive implementation, the dynamic energy consumption is reduced by up to 70% without affecting the quality of solutions. © 2012 IEEE.
Todman T, Luk W, 2012, Verification of streaming designs by combining symbolic simulation and equivalence checking, Pages: 203-208
As design complexity grows, verification becomes a bottleneck in design development and implementation. This paper describes a novel approach for verifying reconfigurable streaming designs, based on symbolic simulation and equivalence checking. Compared with numerical simulation, symbolic simulation provides a more informative way of showing a design behaved as expected; equivalence checking enables automatic checking of equivalence of symbolic expressions. Our approach has been implemented for designs targeting Maxeler technologies, using an easy-to-use symbolic simulator and the Yices equivalence checker, together with other facilities such as an output combiner to support an automated verification flow. Several benchmarks including, including one-dimensional convolution and finite difference computation, are used to evaluate the proposed approach. © 2012 IEEE.
The synthesis and mapping of applications to configurable embedded systems is a notoriously hard process. Tools have a wide range of parameters, which interact in very unpredictable ways, thus creating a large and complex design space. When exploring this space, designers must understand the interfaces to the various tools and apply, often manually, a sequence of tool-specific transformations making this an extremely cumbersome and error-prone process. This paper describes the use of aspect-oriented techniques for capturing synthesis strategies for tuning the performance of applications' kernels. We illustrate the use of this approach when designing application-specific architectures generated by a high-level synthesis tool. The results highlight the impact of the various strategies when targeting custom hardware and expose the difficulties in devising these strategies. © 2012 IEEE.
Pnevmatikatos D, Becker T, Brokalakis A, et al., 2012, FASTER: Facilitating analysis and synthesis technologies for effective reconfiguration, Pages: 234-241
The FASTER project aims to ease the definition, implementation and use of dynamically changing hardware systems. Our motivation stems from the promise reconfigurable systems hold for achieving better performance and extending product functionality and lifetime via the addition of new features that work at hardware speed. This is a clear advantage over the more straightforward software component adaptivity. However, designing a changing hardware system is both challenging and time consuming. The FASTER project will facilitate the use of reconfigurable technology by providing a complete methodology that enables designers to easily specify, analyse, implement and verify applications on platforms with general-purpose processors and acceleration modules implemented in the latest reconfigurable technology. To better adapt to different application requirements, the tool-chain will support both region-based and micro-reconfiguration and provide a flexible run-time system that will efficiently manage the reconfigurable resources. We will use applications from the embedded, high performance computing, and desktop domains to demonstrate the potential benefits of the FASTER tools on metrics such as performance, power consumption and total ownership cost. © 2012 IEEE.
Spacey S, Luk W, Kelly PHJ, et al., 2012, Improving communication latency with the write-only architecture, JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, Vol: 72, Pages: 1617-1627, ISSN: 0743-7315
Liu Q, Todman T, Luk W, et al., 2012, Optimizing Hardware Design by Composing Utility-Directed Transformations, IEEE TRANSACTIONS ON COMPUTERS, Vol: 61, Pages: 1800-1812, ISSN: 0018-9340
Cheung K, Schultz SR, Luk W, 2012, A large-scale spiking neural network accelerator for FPGA systems, Pages: 113-120, ISSN: 0302-9743
Spiking neural networks (SNN) aim to mimic membrane potential dynamics of biological neurons. They have been used widely in neuromorphic applications and neuroscience modeling studies. We design a parallel SNN accelerator for producing large-scale cortical simulation targeting an off-the-shelf Field-Programmable Gate Array (FPGA)-based system. The accelerator parallelizes synaptic processing with run time proportional to the firing rate of the network. Using only one FPGA, this accelerator is estimated to support simulation of 64K neurons 2.5 times real-time, and achieves a spike delivery rate which is at least 1.4 times faster than a recent GPU accelerator with a benchmark toroidal network. © 2012 Springer-Verlag.
Coutinho JGF, Carvalho T, Durand S, et al., 2012, Experiments with the LARA aspect-oriented approach, Pages: 27-30
This demonstration presents a novel design-flow and aspect-oriented language called LARA , which is currently used to guide the mapping of high-level C application codes to heterogeneous high-performance embedded systems. In particular, LARA is capable of capturing complex strategies and schemes involving: hardware/software partitioning, code specialization, source code transformations and code instrumentation. A key element of LARA, and a distinguishing feature from existing approaches, is its ability to support the specification of non-functional requirements and user knowledge in a non-invasive way in the exploration of suitable implementations. The design-flow incorporates several tools, such as a LARA frontend, a hardware/software partitioning tool, an aspect weaver, cost estimator, and a source-level transformation engine. All these components can be coordinated as part of an elaborate application mapping strategy using LARA. In this demonstration, we illustrate how non-functional cross-cutting concerns such as runtime monitorization and performance are codified and described in LARA and how the weaving process affects selected applications. Furthermore, we also explain how third-party tools, such as gprof, can be incorporated into the design-flow and aspect description, for instance, to affect the hardware/software partitioning process. We demonstrate how LARA can be used to extract run-time information, such as range values of variables, and can control code transformations and compiler optimizations addressing customized implementations of the corresponding computations on FPGAs. © 2012 ACM.
Cardoso JMP, Carvalho T, Coutinho JGF, et al., 2012, LARA: An aspect-oriented programming language for embedded systems, Pages: 179-190
The development of applications for high-performance embedded systems is typically a long and error-prone process. In addition to the required functions, developers must consider various and often conflicting non-functional application requirements such as performance and energy efficiency. The complexity of this process is exacerbated by the multitude of target architectures and the associated retargetable mapping tools. This paper introduces an Aspect-Oriented Programming (AOP) approach that conveys domain knowledge and non-functional requirements to optimizers and mapping tools. We describe a novel AOP language, LARA, which allows the specification of compilation strategies to enable efficient generation of software code and hardware cores for alternative target architectures. We illustrate the use of LARA for code instrumentation and analysis, and for guiding the application of compiler and hardware synthesis optimizations. An important LARA feature is its capability to deal with different join points, action models, and attributes, and to generate an aspect intermediate representation. We present examples of our aspect-oriented hardware/software design flow for mapping real-life application codes to embedded platforms based on Field Programmable Gate Array (FPGA) technology. © 2012 ACM.
Tse AHT, Thomas D, Luk W, 2012, Design Exploration of Quadrature Methods in Option Pricing, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, Vol: 20, Pages: 818-826, ISSN: 1063-8210
Thomas DB, Luk W, 2012, The LUT-SR Family of Uniform Random Number Generators for FPGA Architectures, IEEE Transactions on Very Large Scale Integration Systems, Vol: PP
Field-programmable gate array (FPGA) optimized random number generators (RNGs) are more resource-efficient than software-optimized RNGs because they can take advantage of bitwise operations and FPGA-specific features. However, it is difficult to concisely describe FPGA-optimized RNGs, so they are not commonly used in real-world designs. This paper describes a type of FPGA RNG called a LUT-SR RNG, which takes advantage of bitwise XOR operations and the ability to turn lookup tables (LUTs) into shift registers of varying lengths. This provides a good resource-quality balance compared to previous FPGA-optimized generators, between the previous high-resource high-period LUT-FIFO RNGs and low-resource low-quality LUT-OPT RNGs, with quality comparable to the best software generators. The LUT-SR generators can also be expressed using a simple C++ algorithm contained within this paper, allowing 60 fully-specified LUT-SR RNGs with different characteristics to be embedded in this paper, backed up by an online set of very high speed integrated circuit hardware description language (VHDL) generators and test benches.
Tse AHT, Chow GCT, Jin Q, et al., 2012, Optimising performance of quadrature methods with reduced precision, Pages: 251-263, ISSN: 0302-9743
This paper presents a generic precision optimisation methodology for quadrature computation targeting reconfigurable hardware to maximise performance at a given error tolerance level. The proposed methodology optimises performance by considering integration grid density versus mantissa size of floating-point operators. The optimisation provides the number of integration points and mantissa size with maximised throughput while meeting given error tolerance requirement. Three case studies show that the proposed reduced precision designs on a Virtex-6 SX475T FPGA are up to 6 times faster than comparable FPGA designs with double precision arithmetic. They are up to 15.1 times faster and 234.9 times more energy efficient than an i7-870 quad-core CPU, and are 1.2 times faster and 42.2 times more energy efficient than a Tesla C2070 GPU. © 2012 Springer-Verlag.
Liu Q, Luk W, 2012, Heterogeneous systems for energy efficient scientific computing, Pages: 64-75, ISSN: 0302-9743
This paper introduces a novel approach for exploring heterogeneous computing engines which include GPUs and FPGAs as accelerators. Our goal is to systematically automate finding solutions for such engines that maximize energy efficiency while meeting requirements in throughput and in resource constraints. The proposed approach, based on a linear programming model, enables optimization of system throughput and energy efficiency, and analysis of energy efficiency sensitivity and power consumption issues. It can be used in evaluating current and future computing hardware and interfaces to identify appropriate combinations. A heterogeneous system containing a CPU, a GPU and an FPGA with a PCI Express interface is studied based on the High Performance Linpack application. Results indicate that such a heterogeneous computing system is able to provide energy-efficient solutions to scientific computing with various performance demands. The improvement of system energy efficiency is more sensitive to some of the system components, for example in the studied system concurrently improving the energy efficiency of the interface and the GPU by 10 times could lead to over 10 times improvement of the system energy efficiency. © 2012 Springer-Verlag.
Efficient communication between nodes is critical for achieving high performance in a computer cluster. Based on a dedicated inter-accelerator network, we enhance this communication with advanced networking functions, such as broadcasting and priority routing. This work enables decoupling user applications from physical network implementations, improving overall communication efficiency and modularity. A performance model is introduced taking into account application and platform specific parameters. Experiments are performed for various network configurations and application patterns. The results show up to a 55% reduction of communication time when employing our approach. © 2012 Springer-Verlag.
Jin Q, Dong D, Tse AHT, et al., 2012, Multi-level customisation framework for curve based Monte Carlo financial simulations, Pages: 187-201, ISSN: 0302-9743
One of the main challenges when accelerating financial applications using reconfigurable hardware is the management of design complexity. This paper proposes a multi-level customisation framework for automatic generation of complex yet highly efficient curve based financial Monte Carlo simulators on reconfigurable hardware. By identifying multiple levels of functional specialisations and the optimal data format for the Monte Carlo simulation, we allow different levels of programmability in our framework to retain good performance and support multiple applications. Designs targeting a Virtex-6 SX475T FPGA generated by our framework are about 40 times faster than single-core software implementations on an i7-870 quad-core CPU at 2.93 GHz; they are over 10 times faster and 20 times more energy efficient than 4-core implementations on the same i7-870 quad-core CPU, and are over three times more energy efficient and 36% faster than a highly optimised implementation on an NVIDIA Tesla C2070 GPU at 1.15 GHz. In addition, our framework is platform independent and can be extended to support CPU and GPU applications. © 2012 Springer-Verlag.
Liu Q, Todman T, Luk W, et al., 2012, Automated Mapping of the MapReduce Pattern onto Parallel Computing Platforms, JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, Vol: 67, Pages: 65-78, ISSN: 1939-8018
Yiu KFC, Lu Y, Ho CH, et al., 2012, Reconfigurable FPGA-based switching path frequency-domain echo canceller with applications to voice control device, DIGITAL SIGNAL PROCESSING, Vol: 22, Pages: 376-390, ISSN: 1051-2004
Ng N, Yoshida N, Niu XY, et al., 2012, Session types: towards safe and fast reconfigurable programming, ACM SIGARCH Computer Architecture News, Vol: 40, Pages: 22-22, ISSN: 0163-5964
This paper introduces a new programming framework based on the theory of session types for safe, recongurable parallel designs.We apply the session type theory to C and Java programming languages and demonstrate that the sessionbased languages can offer a clear and tractable framework to describe communications between parallel components and guarantee communication-safety and deadlock-freedom by compile-time type checking.Many representative communication topologies such as a ring or scatter-gather can be programmed and verified in session-based programming languages. Case studies involving N-body simulation and K-means clustering are used to illustrate the session-based programming style and to demonstrate that the session-based languages perform competitively against MPI counterparts in an FPGA-based heterogeneous cluster, as well as the potential of integrating them with FPGA acceleration.
© Springer Science+Business Media B.V. 2012. In the last decade automotive audio has been gaining great attention by the scientific and industrial communities. In this context, a new approach to test and develop advanced audio algorithms for an heterogeneous embedded platform has been proposed within the European hArtes project. A real audio laboratory installed in a real car (hArtes CarLab) has been developed employing professional audio equipment. The algorithms can be tested and validated on a PC exploiting each application as a plug-in of the real time NU-Tech framework. Then a set of tools (hArtes Toolchain) can be used to generate code for the embedded platform starting from the plug-in implementation. An overview of the whole system is here presented, taking into consideration a complete set of audio algorithms developed for the advanced car infotainment system (ACIS) that is composed of three main different applications regarding the In Car listening and communication experience. Starting from a high level description of the algorithms, several implementations on different levels of hardware abstraction are presented, along with empirical results on both the design process undergone and the performance results achieved.
Heinrich G, Logemann F, Hahn V, et al., 2012, Audio array processing for telepresence, Hardware/Software Co-design for Heterogeneous Multi-core Platforms: The hArtes Toolchain, Pages: 125-153, ISBN: 9789400714052
© Springer Science+Business Media B.V. 2012. This chapter presents embedded implementations of two audio array processing algorithms for a telepresence application as usage examples of the hArtes tool-chain and platform. The first algorithm, multi-channel wide-band beamforming, may be used to record an acoustic field in a room with an array of microphones, the second one, wave-field synthesis, to render an acoustic field with an array of loudspeakers. While these algorithms have parallelisms and kernel functions typical for their algorithm class, they are chosen to be simple in structure, which makes it easier to follow implementation considerations. Starting from an overview of the application and structure of the algorithms in question, several implementations on different levels of hardware abstraction are presented, along with empirical results on both the design process supported and the processing performance achieved.
Santambrogio MD, Pnevmatikatos D, Papadimitriou K, et al., 2012, Smart Technologies for Effective Reconfiguration: The FASTER approach, 2012 7TH INTERNATIONAL WORKSHOP ON RECONFIGURABLE AND COMMUNICATION-CENTRIC SYSTEMS-ON-CHIP (RECOSOC)
Cardoso JMP, Teixeira J, Alves JC, et al., 2012, Specifying Compiler Strategies for FPGA-based Systems, 20th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE, Pages: 192-199
Shan Y, Wang Z, Wang W, et al., 2012, FPGA based Memory Efficient High Resolution Stereo Vision System for Video Tolling, 11th International Conference on Field-Programmable Technology (FPT), Publisher: IEEE, Pages: 29-32
Papadimitriou K, Pilato C, Pnevmatikatos D, et al., 2012, Novel Design Methods and a Tool Flow for Unleashing Dynamic Reconfiguration, 15th IEEE International Conference on Computational Science and Engineering (CSE) / 10th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC), Publisher: IEEE, Pages: 391-398, ISSN: 1949-0828
Niu X, Tsoi KH, Luk W, 2012, Self-Adaptive Heterogeneous Cluster with Wireless Network, 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) / Workshop on High Performance Data Intensive Computing, Publisher: IEEE, Pages: 306-311, ISSN: 2164-7062
Sato Y, Inoguchi Y, Luk W, et al., 2012, Evaluating Reconfigurable Dataflow Computing Using the Himeno Benchmark, International Conference on Reconfigurable Computing and FPGAs (ReConFig), Publisher: IEEE, ISSN: 2325-6532
Todman T, Boehm P, Luk W, 2012, Verification of streaming hardware and software codesigns, 11th International Conference on Field-Programmable Technology (FPT), Publisher: IEEE, Pages: 147-150
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.