620 results found
Hmid SN, Coutinho JGF, Luk W, 2016, A Transfer-Aware Runtime System for Heterogeneous Asynchronous Parallel Execution, ACM SIGARCH Computer Architecture News, Vol: 43, Pages: 40-45, ISSN: 0163-5964
Wang S, Niu X, Ma N, et al., 2016, A scalable dataflow accelerator for real time onboard hyperspectral image classification, Rio de Janeiro, Brazil, Publisher: Springer International Publishing, Pages: 105-116, ISSN: 0302-9743
Real-time hyperspectral image classification is a necessary primitive in many remotely sensed image analysis applications. Previous work has shown that Support Vector Machines (SVMs) can achieve high classification accuracy, but unfortunately it is very computationally expensive. This paper presents a scalable dataflow accelerator on FPGA for real-time SVM classification of hyperspectral images.To address data dependencies, we adapt multi-class classifier based on Hamming distance. The architecture is scalable to high problem dimensionality and available hardware resources. Implementation results show that the FPGA design achieves speedups of 26x, 1335x, 66x and 14x compared with implementations on ZYNQ, ARM, DSP and Xeon processors. Moreover, one to two orders of magnitude reduction in power consumption is achieved for the AVRIS hyperspectral image datasets.
Ma Y, Zhang C, Luk W, 2016, Hybrid two-stage HW/SW partitioning algorithm for dynamic partial reconfigurable FPGAs, Qinghua Daxue Xuebao/Journal of Tsinghua University, Vol: 56, ISSN: 1000-0054
© 2016, Press of Tsinghua University. All right reserved. More and more hardware platforms are providing dynamic partial reconfiguration; thus, traditional hardware/software partitioning algorithms are no longer applicable. Some studies have analyzed the dynamic partial reconfiguration as mixed-integer linear programming (MILP) models to get solutions. However, the MILP models are slow and can only handle small problems. This paper uses heuristic algorithms to determine the status of some critical tasks to reduce the scale of the MILP problem for large problems. Tests show that this method is about 200 times faster with the same solution quality as the traditional mathematical programming method.
Arram J, Kaplan T, Luk W, et al., 2016, Leveraging FPGAS for accelerating short read alignment, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol: 14, Pages: 668-677, ISSN: 1545-5963
One of the key challenges facing genomics today is how to efficiently analyze the massive amounts of data produced by next-generation sequencing platforms. With general-purpose computing systems struggling to address this challenge, specialized processors such as the Field-Programmable Gate Array (FPGA) are receiving growing interest. The means by which to leverage this technology for accelerating genomic data analysis is however largely unexplored. In this paper, we present a runtime reconfigurable architecture for accelerating short read alignment using FPGAS. This architecture exploits the reconfigurability of FPGAS to allow the development of fast yet flexible alignment designs. We apply this architecture to develop an alignment design which supports exact and approximate alignment with up to two mismatches. Our design is based on the FM-index, with optimizations to improve the alignment performance. In particular, the $n$ -step FM-index, index oversampling, a seed-and-compare stage, and bi-directional backtracking are included. Our design is implemented and evaluated on a 1U Maxeler MPC-X2000 dataflow node with eight Altera Stratix-V FPGAS. Measurements show that our design is 28 times faster than Bowtie2 running with 16 threads on dual Intel Xeon E5-2640 CPUs, and nine times faster than Soap3-dp running on an NVIDIA Tesla C2070 GPU.
Grigoras P, Burovskiy P, Luk W, 2016, CASK - Open-source custom architectures for sparse kernels, 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '16), Publisher: ACM, Pages: 179-184
Sparse matrix vector multiplication (SpMV) is an important kernel in many scientific applications. To improve the performance and applicability of FPGA based SpMV, we propose an approach for exploiting properties of the input matrix to generate optimised custom architectures. The architectures generated by our approach are between 3.8 to 48 times faster than the worst case architectures for each matrix, showing the benefits of instance specific design for SpMV.
Cardoso JMP, Coutinho JGF, Carvalho T, et al., 2016, Performance-driven instrumentation and mapping strategies using the LARA aspect-oriented programming approach, SOFTWARE-PRACTICE & EXPERIENCE, Vol: 46, Pages: 251-287, ISSN: 0038-0644
Targett JS, Niu X, Russell F, et al., 2016, Lower precision for higher accuracy: precision and resolution exploration for shallow water equations, FPT 2015, Publisher: IEEE, Pages: 208-211
Accurate forecasts of future climate with numerical models of atmosphere and ocean are of vital importance. However, forecast quality is often limited by the available computational power. This paper investigates the acceleration of a C-grid shallow water model through the use of reduced precision targeting FPGA technology. Using a double-gyre scenario, we show that the mantissa length of variables can be reduced to 14 bits without affecting the accuracy beyond the error inherent in the model. Our reduced precision FPGA implementation runs 5.4 times faster than a double precision FPGA implementation, and 12 times faster than a multi-Threaded CPU implementation. Moreover, our reduced precision FPGA implementation uses 39 times less energy than the CPU implementation and can compute a 100×100 grid for the same energy that the CPU implementation would take for a 29×29 grid.
Arram J, Pflanzer M, Kaplan T, et al., 2016, FPGA acceleration of reference-based compression for genomic data, 2015 International Conference on Field Programmable Technology, Publisher: IEEE
One of the key challenges facing genomics today is efficiently storing the massive amounts of data generated by next-generation sequencing platforms. Reference-based compression is a popular strategy for reducing the size of genomic data, whereby sequence information is encoded as a mapping to a known reference sequence. Determining the mapping is a computationally intensive problem, and is the bottleneck of most reference-based compression tools currently available. This paper presents the first FPGA acceleration of reference-based compression for genomic data. We develop a new mapping algorithm based on the FM-index search operation which includes optimisations targeting the compression ratio and speed. Our hardware design is implemented on a Maxeler MPC-X2000 node comprising 8 Altera Stratix V FPGAs. When evaluated against compression tools currently available, our tool achieves a superior compression ratio, compression time, and energy consumption for both FASTA and FASTQ formats. For example, our tool achieves a 30% higher compression ratio and is 71.9 times faster than the fastqz tool.
Cheung K, Schultz SR, Luk W, 2016, NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors, Frontiers in Neuroscience, Vol: 9, ISSN: 1662-4548
NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation.
Kurek M, Becker T, Guo C, et al., 2016, Self-aware hardware acceleration of financial applications on a heterogeneous cluster, Natural Computing Series, Pages: 241-260
© Springer International Publishing Switzerland 2016. This chapter describes self-awareness in four financial applications. We apply some of the design patterns of Chapter 5 and techniques of Chapter 7. We describe three applications briefly, highlighting the links to self-awareness and self-expression. The applications are (i) a hybrid genetic programming and particle swarm optimisation approach for high-frequency trading, with fitness function evaluation accelerated by FPGA; (ii) an adaptive point process model for currency trading, accelerated by FPGA hardware; (iii) an adaptive line arbitrator synthesising high-reliability and low-latency feeds from redundant data feeds (A/B feeds) using FPGA hardware. Finally, we describe in more detail a generic optimisation approach for reconfigurable designs automating design optimisation, using reconfigurable hardware to speed up the optimisation process, applied to applications including a quadrature-based financial application. In each application, the hardware-accelerated self-aware approaches give significant benefits: up to 55× speedup for hardware-accelerated design optimisation compared to software hill climbing.
Niu X, Todman T, Luk W, 2016, Self-adaptive hardware acceleration on a heterogeneous cluster, Natural Computing Series, Pages: 167-192
© Springer International Publishing Switzerland 2016. Building a cluster of computers is a common technique to significantly improve the throughput of computationally intensive applications. Communication networks connect hundreds to thousands of compute nodes to form a cluster system, where a parallelisable application workload is distributed into the compute nodes. Theoretically, heterogeneous clusters with various types of processing units are more efficient than homogeneous clusters, since some types of processing units perform better than others on certain applications. A heterogeneous cluster can achieve better cluster performance by adapting cluster configurations to assign applications to processing elements that fit well with the applications. In this chapter we describe how to build a heterogeneous cluster that can adapt to application requirements. Section 9.1 provides an overview of heterogeneous computing. Section 9.2 presents the commonly used hardware and software architectures of heterogeneous clusters. Section 9.3 discusses the use of self-awareness and self-adaptivity in two runtime scenarios of a heterogeneous cluster, and Section 9.4 presents the experimental results. Finally, Section 9.5 discusses approaches to formally verify the developed applications.
Li T, Heinis T, Luk W, 2016, Hashing-Based Approximate DBSCAN, Pages: 31-45
Düben PD, Russell FP, Niu X, et al., 2015, On the use of programmable hardware and reduced numerical precision in earth-system modeling, Journal of Advances in Modeling Earth Systems, Vol: 7, Pages: 1393-1408, ISSN: 1942-2466
Programmable hardware, in particular Field Programmable Gate Arrays (FPGAs), promises a significant increase in computational performance for simulations in geophysical fluid dynamics compared with CPUs of similar power consumption. FPGAs allow adjusting the representation of floating-point numbers to specific application needs. We analyze the performance-precision trade-off on FPGA hardware for the two-scale Lorenz '95 model. We scale the size of this toy model to that of a high-performance computing application in order to make meaningful performance tests. We identify the minimal level of precision at which changes in model results are not significant compared with a maximal precision version of the model and find that this level is very similar for cases where the model is integrated for very short or long intervals. It is therefore a useful approach to investigate model errors due to rounding errors for very short simulations (e.g., 50 time steps) to obtain a range for the level of precision that can be used in expensive long-term simulations. We also show that an approach to reduce precision with increasing forecast time, when model errors are already accumulated, is very promising. We show that a speed-up of 1.9 times is possible in comparison to FPGA simulations in single precision if precision is reduced with no strong change in model error. The single-precision FPGA setup shows a speed-up of 2.8 times in comparison to our model implementation on two 6-core CPUs for large model setups.
Inggs G, Thomas DB, Constantinides G, et al., 2015, Seeing shapes in clouds: On the performance-cost trade-off for heterogeneous infrastructure-as-a-service, Second International Workshop on FPGAs for Software Programmers (FSP 2015)
In the near future FPGAs will be available by the hour, however this newInfrastructure as a Service (IaaS) usage mode presents both an opportunity anda challenge: The opportunity is that programmers can potentially traderesources for performance on a much larger scale, for much shorter periods oftime than before. The challenge is in finding and traversing the trade-off forheterogeneous IaaS that guarantees increased resources result in the greatestpossible increased performance. Such a trade-off is Pareto optimal. The Paretooptimal trade-off for clusters of heterogeneous resources can be found bysolving multiple, multi-objective optimisation problems, resulting in anoptimal allocation of tasks to the available platforms. Solving theseoptimisation programs can be done using simple heuristic approaches or formalMixed Integer Linear Programming (MILP) techniques. When pricing 128 financialoptions using a Monte Carlo algorithm upon a heterogeneous cluster of MulticoreCPU, GPU and FPGA platforms, the MILP approach produces a trade-off that is upto 110% faster than a heuristic approach, and over 50% cheaper. These resultssuggest that high quality performance-resource trade-offs of heterogeneous IaaSare best realised through a formal optimisation approach.
Russell FP, Düben PD, Niu X, et al., Architectures and Precision Analysis for Modelling Atmospheric Variables with Chaotic Behaviour, FCCM 2015, Publisher: IEEE, Pages: 171-178
The computationally intensive nature of atmospheric modelling is an ideal target for hardware acceleration. Performance of hardware designs can be improved through the use of reduced precision arithmetic, but maintaining appropriate accuracy is essential. We explore reduced precision optimisation for simulating chaotic systems, targeting atmospheric modelling in which even minor changes in arithmetic behaviour can have a significant impact on system behaviour. Hence, standard techniques for comparing numerical accuracy are inappropriate. We use the Hellinger distance to compare statistical behaviour between reduced-precision CPU implementations to guide FPGA designs of a chaotic system, and analyse accuracy, performance and power efficiency of the resulting implementations. Our results show that with only a limited loss in accuracy corresponding to less than 10% uncertainly in input parameters, a single Xilinx Virtex 6 SXT475 FPGA can be 13 times faster and 23 times more power efficient than a 6-core Intel Xeon X5650 processor.
Pnevmatikatos D, Papadimitriou K, Becker T, et al., 2015, FASTER: Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration, MICROPROCESSORS AND MICROSYSTEMS, Vol: 39, Pages: 321-338, ISSN: 0141-9331
Thomas DB, Guo L, Guo C, et al., 2015, Pipelined Genetic Propagation, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Genetic Algorithms (GAs) are a class of numericaland combinatorial optimisers which are especially useful forsolving complex non-linear and non-convex problems. However,the required execution time often limits their application to smallscaleor latency-insensitive problems, so techniques to increasethe computational efficiency of GAs are needed. FPGA-basedacceleration has significant potential for speeding up geneticalgorithms, but existing FPGA GAs are limited by the generationalapproaches inherited from software GAs. Many partsof the generational approach do not map well to hardware,such as the large shared population memory and intrinsic loopcarrieddependency. To address this problem, this paper proposesa new hardware-oriented approach to GAs, called PipelinedGenetic Propagation (PGP), which is intrinsically distributedand pipelined. PGP represents a GA solver as a graph ofloosely coupled genetic operators, which allows the solution to bescaled to the available resources, and also to dynamically changetopology at run-time to explore different solution strategies.Experiments show that pipelined genetic propagation is effectivein solving seven different applications. Our PGP design is 5 timesfaster than a recent FPGA-based GA system, and 90 times fasterthan a CPU-based GA system.
Niu X, Chau TCP, Jin Q, et al., 2015, Automating Elimination of Idle Functions by Runtime Reconfiguration, ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, Vol: 8, ISSN: 1936-7406
Shao S, Guo L, Guo C, et al., Recursive pipelined genetic propagation for bilevel optimisation, FPL
Gan L, Fu H, Luk W, et al., 2015, Solving the global atmospheric equations through heterogeneous reconfigurable platforms, ACM Transactions on Reconfigurable Technology and Systems, Vol: 8, ISSN: 1936-7414
One of the most essential and challenging components in climate modeling is the atmospheric model. To solve multiphysical atmospheric equations, developers have to face extremely complex stencil kernels that are costly in terms of both computing and memory resources. This article aims to accelerate the solution of global shallow water equations (SWEs), which is one of the most essential equation sets describing atmospheric dynamics. We first design a hybrid methodology that employs both the host CPU cores and the field-programmable gate array (FPGA) accelerators to work in parallel. Through a careful adjustment of the computational domains, we achieve a balanced resource utilization and a further improvement of the overall performance. By decomposing the resource-demanding SWE kernel, we manage to map the double-precision algorithm into three FPGAs. Moreover, by using fixed-point and reduced-precision floating point arithmetic, we manage to build a fully pipelined mixed-precision design on a single FPGA, which can perform 428 floating-point and 235 fixed-point operations per cycle. The mixed-precision design with four FPGAs running together can achieve a speedup of 20 over a fully optimized design on a CPU rack with two eight-core processorsand is 8 times faster than the fully optimized Kepler GPU design. As for power efficiency, the mixed-precision design with four FPGAs is 10 times more power efficient than a Tianhe-1A supercomputer node.
Hung E, Levine J, Stott E, et al., 2015, Delay-Bounded Routing for Shadow Registers, 23rd ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Publisher: Association for Computing Machinery., Pages: 56-65
The on-chip timing behaviour of synchronous circuits can be quantified at run-time by adding shadow registers, which allow designers to sample the most critical paths of a circuit at a different point in time than the user register would normally. In order to sample these paths precisely, the path skew between the user and the shadow register must be tightly controlled and consistent across all paths that are shadowed. Unlike a custom IC, FPGAs contain prefabricated resources from which composing an arbitrary routing delay is not trivial. This paper presents a method for inserting shadow registers with a minimum skew bound, whilst also reducing the maximum skew. To preserve circuit timing, we apply this to FPGA circuits post place-and-route, using only the spare resources left behind. We find that our techniques can achieve an average STA reported delay bound of ± 200ps on a Xilinx device despite incomplete timing information, and achieve <1ps accuracy against our own delay model.
Arram J, Luk W, Jiang P, 2015, Ramethy: Reconfigurable acceleration of bisulfite sequence alignment, Pages: 250-259
This paper proposes a novel reconfigurable architecture for accelerating DNA sequence alignment. This architecture is applied to bisulfite sequence alignment, a stage in recently developed bioinformatics pipelines for cancer and non-invasive prenatal diagnosis. Alignment is currently the bottleneck in such pipelines, accounting for over 50% of the total analysis time. Our design, Ramethy (Reconfigurable Acceleration of METHYlation data analysis), performs alignment of short reads with up to two mismatches. Ramethy is based on the FM-index, which we optimise to reduce the number of search steps and improve approximate matching performance. We implement Ramethy on a 1U Maxeler MPC-X1000 dataow node consisting of 8 Altera Stratix-V FPGAs. Measured results show a 14.9 times speedup compared to soap2 running with 16 threads on dual Intel Xeon E5-2650 CPUs, and 3.8 times speedup compared to soap3-dp running on an NVIDIA GTX 580 GPU. Upper-bound performance estimates for the MPC-X1000 indicate a maximum speedup of 88.4 times and 22.6 times compared to soap2 and soap3-dp respectively. In addition to runtime, Ramethy consumes over an order of magnitude lower energy while having accuracy identical to soap2 and soap3-dp, making it a strong candidate for integration into bioinformatics pipelines.
Niu X, Luk W, Wang Y, 2015, EURECA: On-chip configuration generation for effective dynamic data access, Pages: 74-83
© Copyright ACM. This paper describes Effective Utilities for Run-timE Configuration Adaptation (EURECA), a novel memory architecture for supporting effective dynamic data access in reconfigurable devices. EURECA exploits on-chip configuration generation to reconfigure active connections in such devices cycle by cycle. When integrated into a baseline architecture based on the Virtex-6 SX475T, the EURECA memory architecture introduces small area, delay and power overhead. Three benchmark applications are developed with the proposed architecture targeting social networking (Memcached), scientific computing (sparse matrix-vector multiplication), and in-memory database (large-scale sorting). Compared with conventional static designs, up to 14.9 times reduction in area, 2.2 times reduction in critical-path delay, and 32.1 times reduction in area-delay product are achieved.
Bsoul AAM, Wilton SJE, Tsoi KH, et al., 2015, An FPGA Architecture and CAD Flow Supporting Dynamically Controlled Power Gating, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol: 24, Pages: 178-191, ISSN: 1063-8210
Leakage power is an important component of the total power consumption in field-programmable gate arrays (FPGAs) built using 90-nm and smaller technology nodes. Power gating was shown to be effective at reducing the leakage power. Previous techniques focus on turning OFF unused FPGA resources at configuration time; the benefit of this approach depends on resource utilization. In this paper, we present an FPGA architecture that enables dynamically controlled power gating, in which FPGA resources can be selectively powered down at run-time. This could lead to significant overall energy savings for applications having modules with long idle times. We also present a CAD flow that can be used to map applications to the proposed architecture. We study the area and power tradeoffs by varying the different FPGA architecture parameters and power gating granularity. The proposed CAD flow is used to map a set of benchmark circuits that have multiple power-gated modules to the proposed architecture. Power savings of up to 83% are achievable for these circuits. Finally, we study a control system of a robot that is used in endoscopy. Using the proposed architecture combined with clock gating results in up to 19% energy savings in this application.
Denholm S, Inoue H, Takenaka T, et al., 2015, Network-level FPGA acceleration of low latency market data feed arbitration, IEICE Transactions on Information and Systems, Vol: E98D, Pages: 288-297, ISSN: 0916-8532
Financial exchanges provide market data feeds to update their members about changes in the market. Feed messages are often used in time-critical automated trading applications, and two identical feeds (A and B feeds) are provided in order to reduce message loss. A key challenge is to support A/B line arbitration efficiently to compensate for missing packets, while offering flexibility for various operational modes such as prioritising for low latency or for high data reliability. This paper presents a reconfigurable acceleration approach for A/B arbitration operating at the network level, capable of supporting any messaging protocol. Two modes of operation are provided simultaneously: one prioritising low latency, and one prioritising high reliability with three dynamically configurable windowing methods. We also present a model for message feed processing latencies that is useful for evaluating scalability in future applications. We outline a new low latency, high throughput architecture and demonstrate a cycle-accurate testing framework to measure the actual latency of packets within the FPGA. We implement and compare the performance of the NASDAQ TotalView-ITCH, OPRA and ARCA market data feed protocols using a Xilinx Virtex-6 FPGA. For high reliability messages we achieve latencies of 42ns for TotalView-ITCH and 36.75ns for OPRA and ARCA. 6ns and 5.25ns are obtained for low latency messages. The most resource intensive protocol, TotalView-ITCH, is also implemented in a Xilinx Virtex- 5 FPGA within a network interface card; it is used to validate our approach with real market data. We offer latencies 10 times lower than an FPGA-based commercial design and 4.1 times lower than the hardware-accelerated IBM PowerEN processor, with throughputs more than double the required 10Gbps line rate.
Grigoras P, Burovskiy P, Hung E, et al., 2015, Accelerating SpMV on FPGAs by Compressing Nonzero Values, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE, Pages: 64-67
Burovskiy P, Grigoras P, Sherwin S, et al., 2015, Efficient Assembly for High Order Unstructured FEM Meshes, 25th International Conference on Field Programmable Logic and Applications, Publisher: IEEE, ISSN: 1946-1488
Funie AI, Grigoras P, Burovskiy P, et al., 2015, Reconfigurable Acceleration of Fitness Evaluation in Trading Strategies, 26th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 210-217, ISSN: 2160-0511
Shao S, Guo L, Guo C, et al., 2015, Recursive Pipelined Genetic Propagation for Bilevel Optimisation, 25th International Conference on Field Programmable Logic and Applications, Publisher: IEEE, ISSN: 1946-1488
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.