540 results found
Shao S, Guo L, Guo C, et al., Recursive pipelined genetic propagation for bilevel optimisation, FPL
Inggs G, Thomas DB, Luk W, 2017, A Domain Specific Approach to High Performance Heterogeneous Computing, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, Vol: 28, Pages: 2-15, ISSN: 1045-9219
Arram J, Pflanzer M, Kaplan T, et al., 2016, FPGA acceleration of reference-based compression for genomic data, Pages: 9-16
© 2015 IEEE.One of the key challenges facing genomics today is efficiently storing the massive amounts of data generated by next-generation sequencing platforms. Reference-based compression is a popular strategy for reducing the size of genomic data, whereby sequence information is encoded as a mapping to a known reference sequence. Determining the mapping is a computationally intensive problem, and is the bottleneck of most reference-based compression tools currently available. This paper presents the first FPGA acceleration of reference-based compression for genomic data. We develop a new mapping algorithm based on the FM-index search operation which includes optimisations targeting the compression ratio and speed. Our hardware design is implemented on a Maxeler MPC-X2000 node comprising 8 Altera Stratix V FPGAs. When evaluated against compression tools currently available, our tool achieves a superior compression ratio, compression time, and energy consumption for both FASTA and FASTQ formats. For example, our tool achieves a 30% higher compression ratio and is 71.9 times faster than the fastqz tool.
Cardoso JMP, Coutinho JGF, Carvalho T, et al., 2016, Performance-driven instrumentation and mapping strategies using the LARA aspect-oriented programming approach, Software - Practice and Experience, Vol: 46, Pages: 251-287, ISSN: 0038-0644
Copyright © 2014 John Wiley & Sons, Ltd.Summary The development of applications for high-performance embedded systems is a long and error-prone process because in addition to the required functionality, developers must consider various and often conflicting nonfunctional requirements such as performance and/or energy efficiency. The complexity of this process is further exacerbated by the multitude of target architectures and mapping tools. This article describes LARA, an aspect-oriented programming language that allows programmers to convey domain-specific knowledge and nonfunctional requirements to a toolchain composed of source-to-source transformers, compiler optimizers, and mapping/synthesis tools. LARA is sufficiently flexible to target different tools and host languages while also allowing the specification of compilation strategies to enable efficient generation of software code and hardware cores (using hardware description languages) for hybrid target architectures - a unique feature to the best of our knowledge not found in any other aspect-oriented programming language. A key feature of LARA is its ability to deal with different models of join points, actions, and attributes. In this article, we describe the LARA approach and evaluate its impact on code instrumentation and analysis and on selecting critical code sections to be migrated to hardware accelerators for two embedded applications from industry.
Grigoras P, Burovskiy P, Luk W, 2016, CASK - Open-source custom architectures for sparse kernels, Pages: 179-184
© 2016 ACM.Sparse matrix vector multiplication (SpMV) is an impor- tant kernel in many scientific applications. To improve the performance and applicability of FPGA based SpMV, we propose an approach for exploiting properties of the input matrix to generate optimised custom architectures. The ar- chitectures generated by our approach are between 3.8 to 48 times faster than the worst case architectures for each matrix, showing the benefits of instance specific design for SpMV.
Grigoras P, Burovskiy P, Luk W, et al., 2016, Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA
© 2016 EPFL.Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. This problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. The efficiency of the proposed architecture combined with the effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems.
Hmid SN, Coutinho JGF, Luk W, 2016, A Transfer-Aware Runtime System for Heterogeneous Asynchronous Parallel Execution, ACM SIGARCH Computer Architecture News, Vol: 43, Pages: 40-45, ISSN: 0163-5964
Hung E, Todman T, Luk W, 2016, Transparent In-Circuit Assertions for FPGAs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Pages: 1-1, ISSN: 0278-0070
Kurek M, Deisenroth MP, Luk W, et al., 2016, Knowledge Transfer in Automatic Optimisation of Reconfigurable Designs, Pages: 84-87
© 2016 IEEE.This paper presents a novel approach for automatic optimisation of reconfigurable design parameters based on knowledge transfer. The key idea is to make use of insights derived from optimising related designs to benefit future optimisations. We show how to use designs targeting one device to speed up optimisation of another device. The proposed approach is evaluated based on various applications including computational finance and seismic imaging. It is capable of achieving up to 35% reduction in optimisation time in producing designs with similar performance, compared to alternative optimisation methods.
Li T, Heinis T, Luk W, 2016, Hashing-based approximate DBSCAN, Pages: 31-45, ISSN: 0302-9743
© Springer International Publishing Switzerland 2016.Analyzing massive amounts of data and extracting value from it has become key across different disciplines. As the amounts of data grow rapidly, however, current approaches for data analysis struggle. This is particularly true for clustering algorithms where distance calculations between pairs of points dominate overall time. Crucial to the data analysis and clustering process, however, is that it is rarely straightforward. Instead, parameters need to be determined through several iterations. Entirely accurate results are thus rarely needed and instead we can sacrifice precision of the final result to accelerate the computation. In this paper we develop ADvaNCE, a new approach to approximating DBSCAN. ADvaNCE uses two measures to reduce distance calculation overhead: (1) locality sensitive hashing to approximate and speed up distance calculations and (2) representative point selection to reduce the number of distance calculations. Our experiments show that our approach is in general one order of magnitude faster (at most 30x in our experiments) than the state of the art.
© 2006 IEEE.This article consists of a collection of slides from the author's conference presentation on optimial custom instruction processors. Some of the specific topics discussed include: the special features and system specifications of extensible processors; design flow capabilities; instruction set selection and bandwidth considerations; applications specific processor synthesis; and both current and future areas of processor technology development.
Ma Y, Zhang C, Luk W, 2016, Hybrid two-stage HW/SW partitioning algorithm for dynamic partial reconfigurable FPGAs, Qinghua Daxue Xuebao/Journal of Tsinghua University, Vol: 56, ISSN: 1000-0054
© 2016, Press of Tsinghua University. All right reserved.More and more hardware platforms are providing dynamic partial reconfiguration; thus, traditional hardware/software partitioning algorithms are no longer applicable. Some studies have analyzed the dynamic partial reconfiguration as mixed-integer linear programming (MILP) models to get solutions. However, the MILP models are slow and can only handle small problems. This paper uses heuristic algorithms to determine the status of some critical tasks to reduce the scale of the MILP problem for large problems. Tests show that this method is about 200 times faster with the same solution quality as the traditional mathematical programming method.
Niu X, Ng N, Yuki T, et al., 2016, EURECA Compilation: Automatic Optimisation of Cycle-Reconfigurable Circuits, 26th International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, ISSN: 1946-1488
Stroobandt D, Varbanescu AL, Ciobanu CB, et al., 2016, EXTRA: Towards the Exploitation of eXascale Technology for Reconfigurable Architectures, 11th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), Publisher: IEEE
Targett JS, Niu X, Russell F, et al., 2016, Lower precision for higher accuracy: Precision and resolution exploration for shallow water equations, Pages: 208-211
© 2015 IEEE.Accurate forecasts of future climate with numerical models of atmosphere and ocean are of vital importance. However, forecast quality is often limited by the available computational power. This paper investigates the acceleration of a C-grid shallow water model through the use of reduced precision targeting FPGA technology. Using a double-gyre scenario, we show that the mantissa length of variables can be reduced to 14 bits without affecting the accuracy beyond the error inherent in the model. Our reduced precision FPGA implementation runs 5.4 times faster than a double precision FPGA implementation, and 12 times faster than a multi-Threaded CPU implementation. Moreover, our reduced precision FPGA implementation uses 39 times less energy than the CPU implementation and can compute a 100×100 grid for the same energy that the CPU implementation would take for a 29×29 grid.
Wang S, Niu X, Ma N, et al., 2016, A scalable dataflow accelerator for real time onboard hyperspectral image classification, Pages: 105-116, ISSN: 0302-9743
© Springer International Publishing Switzerland 2016.Real-time hyperspectral image classification is a necessary primitive in many remotely sensed image analysis applications. Previous work has shown that Support Vector Machines (SVMs) can achieve high classification accuracy, but unfortunately it is very computationally expensive. This paper presents a scalable dataflow accelerator on FPGA for real-time SVM classification of hyperspectral images.To address data dependencies, we adapt multi-class classifier based on Hamming distance. The architecture is scalable to high problem dimensionality and available hardware resources. Implementation results show that the FPGA design achieves speedups of 26x, 1335x, 66x and 14x compared with implementations on ZYNQ, ARM, DSP and Xeon processors. Moreover, one to two orders of magnitude reduction in power consumption is achieved for the AVRIS hyperspectral image datasets.
Zhou H, Niuy X, Yuan J, et al., 2016, Connect on the fly: Enhancing and prototyping of cycle-reconfigurable modules
© 2016 EPFL.This paper introduces cycle-reconfigurable modules that enhance FPGA architectures with efficient support for dynamic data accesses: data accesses with accessed data size and location known only at runtime. The proposed module adopts new reconfiguration strategies based on dynamic FIFOs, dynamic caches, and dynamic shared memories to significantly reduce configuration generation and routing complexity. We develop a prototype FPGA chip with the proposed cycle-reconfigurable module in the SMIC 130-nm technology. The integrated module takes less than the chip area of 39 CLBs, and reconfigures thousands of runtime connections in 1.2 ns. Applications for largescale sorting, sparse matrix-vector multiplication, and Memcached are developed. The proposed modules enable 1.4 and 11 times reduction in area-delay product compared with those applications mapped to previous architectures and conventional FPGAs.
Arram J, Luk W, Jiang P, 2015, Ramethy: Reconfigurable acceleration of bisulfite sequence alignment, Pages: 250-259
This paper proposes a novel reconfigurable architecture for accelerating DNA sequence alignment. This architecture is applied to bisulfite sequence alignment, a stage in recently developed bioinformatics pipelines for cancer and non-invasive prenatal diagnosis. Alignment is currently the bottleneck in such pipelines, accounting for over 50% of the total analysis time. Our design, Ramethy (Reconfigurable Acceleration of METHYlation data analysis), performs alignment of short reads with up to two mismatches. Ramethy is based on the FM-index, which we optimise to reduce the number of search steps and improve approximate matching performance. We implement Ramethy on a 1U Maxeler MPC-X1000 dataow node consisting of 8 Altera Stratix-V FPGAs. Measured results show a 14.9 times speedup compared to soap2 running with 16 threads on dual Intel Xeon E5-2650 CPUs, and 3.8 times speedup compared to soap3-dp running on an NVIDIA GTX 580 GPU. Upper-bound performance estimates for the MPC-X1000 indicate a maximum speedup of 88.4 times and 22.6 times compared to soap2 and soap3-dp respectively. In addition to runtime, Ramethy consumes over an order of magnitude lower energy while having accuracy identical to soap2 and soap3-dp, making it a strong candidate for integration into bioinformatics pipelines.
Bsoul AAM, Wilton SJE, Tsoi KH, et al., 2015, An FPGA Architecture and CAD Flow Supporting Dynamically Controlled Power Gating, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol: 24, Pages: 178-191, ISSN: 1063-8210
Leakage power is an important component of the total power consumption in field-programmable gate arrays (FPGAs) built using 90-nm and smaller technology nodes. Power gating was shown to be effective at reducing the leakage power. Previous techniques focus on turning OFF unused FPGA resources at configuration time; the benefit of this approach depends on resource utilization. In this paper, we present an FPGA architecture that enables dynamically controlled power gating, in which FPGA resources can be selectively powered down at run-time. This could lead to significant overall energy savings for applications having modules with long idle times. We also present a CAD flow that can be used to map applications to the proposed architecture. We study the area and power tradeoffs by varying the different FPGA architecture parameters and power gating granularity. The proposed CAD flow is used to map a set of benchmark circuits that have multiple power-gated modules to the proposed architecture. Power savings of up to 83% are achievable for these circuits. Finally, we study a control system of a robot that is used in endoscopy. Using the proposed architecture combined with clock gating results in up to 19% energy savings in this application.
Burovskiy P, Grigoras P, Sherwin S, et al., 2015, Efficient assembly for high order unstructured FEM meshes
© 2015 Imperial College.The Finite Element Method (FEM) is a common numerical technique used for solving Partial Differential Equations (PDEs) on complex domain geometries. Large and unstructured FEM meshes are used to represent the computation domains which makes an efficient mapping of the Finite Element Method onto FPGAS particularly challenging. The focus of this paper is on assembly mapping, a key kernel of FEM, which induces the sparse and unstructured nature of the problem. We translate FEM vector assembly mapping into data access scheduling to perform vector assembly directly on the FPGA, as part of the hardware pipeline. We show how to efficiently partition the problem into dense and sparse sub-problems which map well onto FPGAS. The proposed approach, implemented on a single FPGA could outperform highly optimised FEM software running on two Xeon E5-2640 processors.
Cheung K, Schultz SR, Luk W, 2015, NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors., Front Neurosci, Vol: 9, ISSN: 1662-4548
NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation.
Cheung P, Luk W, Silvano C, 2015, Preface
Ciobanu CB, Varbanescu AL, Pnevmatikatos D, et al., 2015, EXTRA: Towards an Efficient Open Platform for Reconfigurable High Performance Computing, IEEE 18th International Conference on Computational Science and Engineering (CSE), Publisher: IEEE, Pages: 339-342
Denholm S, Inoue H, Takenaka T, et al., 2015, Network-level FPGA acceleration of low latency market data feed arbitration, IEICE Transactions on Information and Systems, Vol: E98D, Pages: 288-297, ISSN: 0916-8532
Financial exchanges provide market data feeds to update their members about changes in the market. Feed messages are often used in time-critical automated trading applications, and two identical feeds (A and B feeds) are provided in order to reduce message loss. A key challenge is to support A/B line arbitration efficiently to compensate for missing packets, while offering flexibility for various operational modes such as prioritising for low latency or for high data reliability. This paper presents a reconfigurable acceleration approach for A/B arbitration operating at the network level, capable of supporting any messaging protocol. Two modes of operation are provided simultaneously: one prioritising low latency, and one prioritising high reliability with three dynamically configurable windowing methods. We also present a model for message feed processing latencies that is useful for evaluating scalability in future applications. We outline a new low latency, high throughput architecture and demonstrate a cycle-accurate testing framework to measure the actual latency of packets within the FPGA. We implement and compare the performance of the NASDAQ TotalView-ITCH, OPRA and ARCA market data feed protocols using a Xilinx Virtex-6 FPGA. For high reliability messages we achieve latencies of 42ns for TotalView-ITCH and 36.75ns for OPRA and ARCA. 6ns and 5.25ns are obtained for low latency messages. The most resource intensive protocol, TotalView-ITCH, is also implemented in a Xilinx Virtex- 5 FPGA within a network interface card; it is used to validate our approach with real market data. We offer latencies 10 times lower than an FPGA-based commercial design and 4.1 times lower than the hardware-accelerated IBM PowerEN processor, with throughputs more than double the required 10Gbps line rate.
Düben PD, Russell FP, Niu X, et al., 2015, On the use of programmable hardware and reduced numerical precision in earth-system modeling, Journal of Advances in Modeling Earth Systems, Vol: 7, Pages: 1393-1408
© 2015. The Authors.Programmable hardware, in particular Field Programmable Gate Arrays (FPGAs), promises a significant increase in computational performance for simulations in geophysical fluid dynamics compared with CPUs of similar power consumption. FPGAs allow adjusting the representation of floating-point numbers to specific application needs. We analyze the performance-precision trade-off on FPGA hardware for the two-scale Lorenz '95 model. We scale the size of this toy model to that of a high-performance computing application in order to make meaningful performance tests. We identify the minimal level of precision at which changes in model results are not significant compared with a maximal precision version of the model and find that this level is very similar for cases where the model is integrated for very short or long intervals. It is therefore a useful approach to investigate model errors due to rounding errors for very short simulations (e.g., 50 time steps) to obtain a range for the level of precision that can be used in expensive long-term simulations. We also show that an approach to reduce precision with increasing forecast time, when model errors are already accumulated, is very promising. We show that a speed-up of 1.9 times is possible in comparison to FPGA simulations in single precision if precision is reduced with no strong change in model error. The single-precision FPGA setup shows a speed-up of 2.8 times in comparison to our model implementation on two 6-core CPUs for large model setups.
Funie AI, Grigoras P, Burovskiy P, et al., 2015, Reconfigurable Acceleration of Fitness Evaluation in Trading Strategies, 26th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 210-217, ISSN: 1063-6862
Gan L, Fu H, Luk W, et al., 2015, Solving the global atmospheric equations through heterogeneous reconfigurable platforms, ACM Transactions on Reconfigurable Technology and Systems, Vol: 8, ISSN: 1936-7414
One of the most essential and challenging components in climate modeling is the atmospheric model. To solve multiphysical atmospheric equations, developers have to face extremely complex stencil kernels that are costly in terms of both computing and memory resources. This article aims to accelerate the solution of global shallow water equations (SWEs), which is one of the most essential equation sets describing atmospheric dynamics. We first design a hybrid methodology that employs both the host CPU cores and the field-programmable gate array (FPGA) accelerators to work in parallel. Through a careful adjustment of the computational domains, we achieve a balanced resource utilization and a further improvement of the overall performance. By decomposing the resource-demanding SWE kernel, we manage to map the double-precision algorithm into three FPGAs. Moreover, by using fixed-point and reduced-precision floating point arithmetic, we manage to build a fully pipelined mixed-precision design on a single FPGA, which can perform 428 floating-point and 235 fixed-point operations per cycle. The mixed-precision design with four FPGAs running together can achieve a speedup of 20 over a fully optimized design on a CPU rack with two eight-core processorsand is 8 times faster than the fully optimized Kepler GPU design. As for power efficiency, the mixed-precision design with four FPGAs is 10 times more power efficient than a Tianhe-1A supercomputer node.
Grigoras P, Burovskiy P, Hung E, et al., 2015, Accelerating SpMV on FPGAS by compressing nonzero values, Pages: 64-67
© 2015 IEEE.Sparse matrix vector multiplication (SpMV) is an important kernel in many areas of scientific computing, especially as a building block for iterative linear system solvers. We study how loss less nonzero compression can be used to overcome memory bandwidth limitations in FPGA-based SpMV implementations. We introduce a dictionary-based compression algorithm which reduces redundant nonzero values to improve memory bandwidth without reducing computation efficiency by making use of spare FPGA resources. We show how a sparse matrix in the CSR format can be converted to the proposed storage format on the CPU and that average compression ratios of 1.14 - 1.40 and up to 2.65 times can be achieved, over CSR, for relevant matrices in our benchmarks.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.