619 results found
Cardamone S, Kimmitt JRR, Burton HGA, et al., 2019, Field-programmable gate arrays and quantum Monte Carlo: Power efficient coprocessing for scalable high-performance computing, INTERNATIONAL JOURNAL OF QUANTUM CHEMISTRY, Vol: 119, ISSN: 0020-7608
Wang E, Davis J, Zhao R, et al., 2019, Deep Neural Network Approximation for Custom Hardware: Where We've Been, Where We're Going, ACM Computing Surveys, Vol: 52, Pages: 40:1-40:39, ISSN: 0360-0300
Deep neural networks have proven to be particularly effective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardware-oriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms of both throughput and energy efficiency. Application-tailored accelerators, when co-designed with approximation-based network training methods, transform large, dense and computationally expensive networks into small, sparse and hardware-efficient alternatives, increasing the feasibility of network deployment. In this article, we provide a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their effectiveness for custom hardware implementation. We also include proposals for future research based on a thorough analysis of current trends. This article represents the first survey providing detailed comparisons of custom hardware accelerators featuring approximation for both convolutional and recurrent neural networks, through which we hope to inspire exciting new developments in the field.
Li W, He C, Fu H, et al., 2019, A real-time tree crown detection approach for large-scale remote sensing images on FPGAs, Remote Sensing, Vol: 11, ISSN: 2072-4292
The on-board real-time tree crown detection from high-resolution remote sensing images is beneficial for avoiding the delay between data acquisition and processing, reducing the quantity of data transmission from the satellite to the ground, monitoring the growing condition of individual trees, and discovering the damage of trees as early as possible, etc. Existing high performance platform based tree crown detection studies either focus on processing images in a small size or suffer from high power consumption or slow processing speed. In this paper, we propose the first FPGA-based real-time tree crown detection approach for large-scale satellite images. A pipelined-friendly and resource-economic tree crown detection algorithm (PF-TCD) is designed through reconstructing and modifying the workflow of the original algorithm into three computational kernels on FPGAs. Compared with the well-optimized software implementation of the original algorithm on an Intel 12-core CPU, our proposed PF-TCD obtains the speedup of 18.75 times for a satellite image with a size of 12,188 × 12,576 pixels without reducing the detection accuracy. The image processing time for the large-scale remote sensing image is only 0.33 s, which satisfies the requirements of the on-board real-time data processing on satellites.
Noronha DH, Zhao R, Goeders J, et al., 2019, On-chip FPGA debug instrumentation for machine learning applications, Pages: 110-115
© 2019 Association for Computing Machinery. FPGAs provide a promising implementation option for many machine learning applications. Although simulations or software models can be used to explore the design space of these applications, often the final behaviour can not be evaluated until the design is mapped to the FPGA and integrated into the target system. This may be because long run-times are required, or because the environment can not be adequately described using a software model. Once unexpected behaviour is observed, on-chip debug is notoriously difficult; typically a design is instrumented with on-chip trace buffers that record the run-time behaviour for later interrogation. In this paper, we describe instrumentation that can accelerate the process of debugging machine learning applications implemented on an FPGA. Unlike previous work, our instrumentation is optimized to take advantage of characteristics of this application domain. Our instruments gather useful domain-specific information about the observed variables instead of recording the raw values of those elements. Results show that the proposed instruments provide at least 17.8x longer visibility in the most conservative of our experiments at a low area and latency cost.
Xu J, Fu H, Shi W, et al., 2019, Performance Tuning and Analysis for Stencil-Based Applications on POWER8 Processor, ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, Vol: 15, ISSN: 1544-3566
Liu S, Chu RSW, Wang X, et al., 2019, Optimizing CNN-Based Hyperspectral Image Classification on FPGAs, Pages: 17-31, ISSN: 0302-9743
© 2019, Springer Nature Switzerland AG. Hyperspectral image (HSI) classification has been widely adopted in remote sensing imagery analysis applications which require high classification accuracy and real-time processing speed. Convolutional neural networks (CNNs)-based methods have been proven to achieve state-of-the-art accuracy in classifying HSIs. However, CNN models are often too computationally intensive to achieve real-time response due to the high dimensional nature of HSI, compared to traditional methods such as Support Vector Machines (SVMs). Besides, previous CNN models used in HSI are not specially designed for efficient implementation on embedded devices such as FPGAs. This paper proposes a novel CNN-based algorithm for HSI classification which takes into account hardware efficiency and thus is more hardware friendly compared to prior CNN models. An optimized and customized architecture which maps the proposed algorithm on FPGA is then proposed to support real-time on-board classification with low power consumption. Implementation results show that our proposed accelerator on a Xilinx Zynq 706 FPGA board achieves more than 70 $$\times $$ faster than an Intel 8-core Xeon CPU and 3 $$\times $$ faster than an NVIDIA GeForce 1080 GPU. Compared to previous SVM-based FPGA accelerators, we achieve comparable processing speed but provide a much higher classification accuracy.
Russell FP, Targett JS, Luk W, 2018, From Tensor Algebra to Hardware Accelerators: Generating Streaming Architectures for Solving Partial Differential Equations, 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, ISSN: 1063-6862
Hardware accelerators are attractive targets for running scientific simulations due to their power efficiency. Since, large software simulations can take person years to develop, it is often impractical to use hardware acceleration, which requires significantly more development effort and expertise than software development. We present the design and implementation of a proof-of-concept compiler toolchain which enables rapid prototyping of hardware finite difference solvers for partial differential equations, generated from a high-level domain specific language. Multiple fields, grid staggering and non-linear terms are supported. We demonstrate that our approach is practical by generating and evaluating hardware designs derived from the heat and simplified shallow water equations.
Shao S, Tsai J, Mysior M, et al., 2018, Towards Hardware Accelerated Reinforcement Learning for Application-Specific Robotic Control, 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, ISSN: 1063-6862
Reinforcement Learning (RL) is an area of machine learning in which an agent interacts with the environment by making sequential decisions. The agent receives reward from the environment based on how good the decisions are and tries to find an optimal decision-making policy that maximises its longterm cumulative reward. This paper presents a novel approach which has showon promise in applying accelerated simulation of RL policy training to automating the control of a real robot arm for specific applications. The approach has two steps. First, design space exploration techniques are developed to enhance performance of an FPGA accelerator for RL policy training based on Trust Region Policy Optimisation (TRPO), which results in a 43% speed improvement over a previous FPGA implementation, while achieving 4.65 times speed up against deep learning libraries running on GPU and 19.29 times speed up against CPU. Second, the trained RL policy is transferred to a real robot arm. Our experiments show that the trained arm can successfully reach to and pick up predefined objects, demonstrating the feasibility of our approach.
Zhao R, Liu S, Ng H, et al., 2018, Hardware Compilation of Deep Neural Networks: An Overview (invited), IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) 2018, Publisher: IEEE, Pages: 1-8
Deploying a deep neural network model on a reconfigurable platform, such as an FPGA, is challenging due to the enormous design spaces of both network models and hardware design. A neural network model has various layer types, connection patterns and data representations, and the corresponding implementation can be customised with different architectural and modular parameters. Rather than manually exploring this design space, it is more effective to automate optimisation throughout an end-to-end compilation process. This paper provides an overview of recent literature proposing novel approaches to achieve this aim. We organise materials to mirror a typical compilation flow: front end, platform-independent optimisation and back end. Design templates for neural network accelerators are studied with a specific focus on their derivation methodologies. We also review previous work on network compilation and optimisation for other hardware platforms to gain inspiration regarding FPGA implementation. Finally, we propose some future directions for related research.
Liu S, Fan H, Niu X, et al., Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA, ACM Transactions on Reconfigurable Technology and Systems, ISSN: 1936-7406
Convolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognitionproblems, showing very large accuracy improvement. In recent years, deconvolution layers are widely usedas key components in the state-of-the-art CNNs for end-to-end training and models to support tasks suchas image segmentation and super resolution. However, the deconvolution algorithms are computationallyintensive which limits their applicability to real time applications. Particularly, there has been little researchon the efficient implementations of deconvolution algorithms on FPGA platforms which have been widelyused to accelerate CNN algorithms by practitioners and researchers due to their high performance and powerefficiency. In this work, we propose and develop deconvolution architecture for efficient FPGA implementation.FPGA-based accelerators are proposed for both deconvolution and CNN algorithms. Besides, memory sharingbetween the computation modules is proposed for the FPGA-based CNN accelerator as well as for otheroptimization techniques. A non-linear optimization model based on the performance model is introduced toefficiently explore the design space in order to achieve optimal processing speed of the system and improvepower efficiency. Furthermore, a hardware mapping framework is developed to automatically generate thelow-latency hardware design for any given CNN model on the target device. Finally, we implement ourdesigns on Xilinx Zynq ZC706 board and the deconvolution accelerator achieves a performance of 90.1 GOPSunder 200MHz working frequency and a performance density of 0.10 GOPS/DSP using 32-bit quantization,which significantly outperforms previous designs on FPGAs. A real-time application of scene segmentationon Cityscapes Dataset is used to evaluate our CNN accelerator on Zynq ZC706 board, and the system achievesa performance of 107 GOPS and 0.12 GOPS/DSP using 16-bit quantization, and supports up to 17 frames persecond for 512x512 image in
Fan X, Wu D, Cao W, et al., 2018, Stream processing dual-track CGRA for object inference, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol: 26, Pages: 1098-1111, ISSN: 1063-8210
With the development of machine learning technology, the exploration of energy-efficient and flexible architectures for object inference algorithms is of growing interest in recent years. However, not many publications concentrate on a coarse-grained reconfigurable architecture (CGRA) for object inference algorithms. This paper provides a stream processing, dual-track programming CGRA-based approach to address the inherent computing characteristics of algorithms in object inference. Based on the proposed approach, an architecture called stream dual-track CGRA (SDT-CGRA) is presented as an implementation prototype. To evaluate the performance, the SDT-CGRA is realized in Verilog HDL and implemented in Semiconductor Manufacturing International Corporation 55-nm process, with the footprint of 5.19 mm & #x00B2; at 450 MHz. Seven object inference algorithms, including convolutional neural network (CNN), k-means, principal component analysis (PCA), spatial pyramid matching (SPM), linear support vector machine (SVM), Softmax, and Joint Bayesian, are selected as benchmarks. The experimental results show that the SDT-CGRA can gain on average 343.8 times and 17.7 times higher energy efficiency for Softmax, PCA, and CNN, 621.0 times and 1261.8 times higher energy efficiency for k-means, SPM, linear-SVM, and Joint-Bayesian algorithms when compared with the Intel Xeon E5-2637 CPU and the Nvidia TitanX graphics processing unit. When compared with the state-of-the-art solutions of AlexNet on field-programmable gate array and CGRA, the proposed SDT-CGRA can achieve a 1.78 times increase in energy efficiency and a 13 times speedup, respectively.
Papaphilippou P, Luk W, Accelerating database systems using FPGAs: A survey, The International Conference on Field-Programmable Logic and Applications (FPL), 2018, Publisher: IEEE
Database systems are key to a variety of applications, and FPGA-based accelerators have shown promise in supporting high-performance database systems. This survey presents a systematic review of research relating to accelerating analytical database systems using FPGAs. The review includes studies of database acceleration frameworks and accelerator implementations for various database operators. Finally, the survey includes some promising future technologies and discussion on the challenges to be addressed by the future research in this area.
Zhao R, Ng H-C, Luk W, et al., Towards efficient convolutional neural network for domain-specific applications on FPGA, 28th International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE
FPGA becomes a popular technology for imple-menting Convolutional Neural Network (CNN) in recent years.Most CNN applications on FPGA are domain-specific, e.g.,detecting objects from specific categories, in which commonly-used CNN models pre-trained on general datasets may not beefficient enough. This paper presents TuRF, an end-to-end CNNacceleration framework to efficiently deploy domain-specific ap-plications on FPGA by transfer learning that adapts pre-trainedmodels to specific domains, replacing standard convolution layerswith efficient convolution blocks, and applying layer fusion toenhance hardware design performance. We evaluate TuRF bydeploying a pre-trained VGG-16 model for a domain-specificimage recognition task onto a Stratix V FPGA. Results showthat designs generated by TuRF achieve better performance thanprior methods for the original VGG-16 and ResNet-50 models,while for the optimised VGG-16 model TuRF designs are moreaccurate and easier to process.
Shao S, Mencer O, Luk W, 2018, Dataflow Design for Optimal Incremental SVM Training, 15th International Conference on Field-Programmable Technology (FPT), Publisher: IEEE, Pages: 197-200
This paper proposes a new parallel architecture for incremental training of a Support Vector Machine (SVM), which produces an optimal solution based on manipulating the Karush-Kuhn-Tucker (KKT) conditions. Compared to batch training methods, our approach avoids re-training from scratch when training dataset changes. The proposed architecture is the first to adopt an efficient dataflow organisation. The main novelty is a parametric description of the parallel dataflow architecture, which deploys customisable arithmetic units for dense linear algebraic operations involved in updating the KKT conditions. The proposed architecture targets on-line SVM training applications. Experimental evaluation with real world financial data shows that our architecture implemented on Stratix-V FPGA achieved significant speedup against LIBSVM on Core i7-4770 CPU.
Lee K-H, Leong MCW, Chow MCK, et al., 2018, FEM-based Soft Robotic Control Framework for Intracavitary Navigation, IEEE International Conference on Real-time Computing and Robotics (RCAR), Publisher: IEEE, Pages: 11-16
Bio-inspired robotic structure composed of soft actuation units has attracted increasing research interests in its potential and capacity of complying with unstructured and dynamic environment, as well as providing safe interaction with human; however, this inevitably poses technical challenging to achieve steady, reliable control due to the remarkable non-linearity of its kinematics and dynamics. To resolve this challenge, we propose a novel control framework that can characterize the kinematics of a soft continuum robot through the hyper-elastic Finite-element modeling (FEM). This enables frequent updates of the Jacobian mapping from the user motion input to the end-effector's point of view. Experimental validation has been conducted to show the feasibility of controlling the soft robot for intracavitary path following. This could be the first success to demonstrate the perspectives of achieving stable, accurate and effective manipulation under large change of robot morphology without having to deduce its analytical model. It is anticipated to draw further extensive attention on resolving the bottleneck against the application of FEM, namely its intensive computation.
Liu S, Niu X, Luk W, 2018, A Low-Power Deconvolutional Accelerator for Convolutional Neural Network Based Segmentation on FPGA: Abstract Only., 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Publisher: ACM, Pages: 293-293
Zhao R, Niu X, Luk W, 2018, Automatic Optimising CNN with Depthwise Separable Convolution on FPGA: (Abstact Only)., 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Publisher: ACM, Pages: 285-285
Ng H-C, Liu S, Luk W, 2018, ADAM: Automated Design Analysis and Merging for Speeding up FPGA Development., FPGAInternational Symposium on Field Programmable Gate Arrays, Publisher: ACM, Pages: 189-198
This paper introduces ADAM, an approach for merging multiple FPGA designs into a single hardware design, so that multiple place-and-route tasks can be replaced by a single task to speed up functional evaluation of designs, especially during the development process. ADAM has three key elements. First, a novel approximate maximum common subgraph detection algorithm with linear time complexity to maximize sharing of resources in the merged design. Second, a prototype tool implementing this common subgraph detection algorithm for dataflow graphs derived from Verilog designs; this tool would also generate the appropriate control circuits to enable selection of the original designs at runtime. Third, a comprehensive analysis of compilation time versus degree of similarity to identify the optimized user parameters for the proposed approach. Experimental results show that ADAM can reduce compilation time by around 5 times when each design is 95% similar to the others, and the compilation time is reduced from 1 hour to 10 minutes in the case of binomial filters.
Lee KH, Fu KCD, Guo Z, et al., 2018, MR Safe Robotic Manipulator for MRI-Guided Intracardiac Catheterization, IEEE/ASME Transactions on Mechatronics, Vol: 23, Pages: 586-595, ISSN: 1083-4435
This paper introduces a robotic manipulator to realize robot-assisted intracardiac catheterization in magnetic resonance imaging (MRI) environment. MRI can offer high-resolution images to visualize soft tissue features such as scars or edema. We hypothesize that robotic catheterization, combined with the enhanced monitoring of lesions creation using MRI intraoperatively, will significantly improve the procedural safety, accuracy, and effectiveness. This is designed particularly for cardiac electrophysiological (EP) intervention, which is an effective treatment of arrhythmia. We present the first MR Safe robot for intracardiac EP intervention. The robot actuation features small hysteresis, effective force transmission, and quick response, which has been experimentally verified for its capability to precisely telemanipulate a standard clinically used EP catheter. We also present timely techniques for real-time positional tracking in MRI and intraoperative image registration, which can be integrated with the presented manipulator to im prove the performance of teleoperated robotic catheterization.
Li W, He C, Fu H, et al., 2018, An FPGA-based tree crown detection approach for remote sensing images, 16th IEEE International Conference on Field-Programmable Technology (ICFPT), Publisher: IEEE, Pages: 231-234
Cross A-I, Guo L, Luk W, et al., 2018, CRRS: Custom Regression and Regularisation Solver for Large-scale Linear Systems, 28th International Conference on Field Programmable Logic and Applications (FPL), Publisher: IEEE, Pages: 389-393, ISSN: 1946-1488
Voss N, Bacis M, Mencer O, et al., 2017, Convolutional Neural Networks on Dataflow Engines, 35th IEEE International Conference on Computer Design (ICCD), Publisher: IEEE, Pages: 435-438, ISSN: 1063-6404
In this paper we discuss a high performance implementation for Convolutional Neural Networks (CNNs) inference on the latest generation of Dataflow Engines (DFEs). We discuss the architectural choices made during the design phase taking into account the DFE chip properties. We then perform design space exploration, considering the memory bandwidth and resources utilisation constraints derived from the used DFE and the chosen architecture. Finally, we discuss the high performance implementation and compare the obtained performance against other implementations, showing that our proposed design reaches 2,450 GOPS when running VGG16 as a test case.
Cooper B, Girdlestone S, Burovskiy P, et al., 2017, Quantum Chemistry in Dataflow: Density-Fitting MP2., Journal of Chemical Theory and Computation, Vol: 13, Pages: 5265-5272, ISSN: 1549-9618
We demonstrate the use of dataflow technology in the computation of the correlation energy in molecules at the Møller-Plesset perturbation theory (MP2) level. Specifically, we benchmark density fitting (DF)-MP2 for as many as 168 atoms (in valinomycin) and show that speed-ups between 3 and 3.8 times can be achieved when compared to the MOLPRO package run on a single CPU. Acceleration is achieved by offloading the matrix multiplications steps in DF-MP2 to Dataflow Engines (DFEs). We project that the acceleration factor could be as much as 24 with the next generation of DFEs.
Deep neural networks (DNNs) have attracted significant attention for their excellent accuracy especially in areas such as computer vision and artificial intelligence. To enhance their performance, technologies for their hardware acceleration are being studied. FPGA technology is a promising choice for hardware acceleration, given its low power consumption and high flexibility which makes it suitable particularly for embedded systems. However, complex DNN models may need more computing and memory resources than those available in many current FPGAs. This paper presents FP-BNN, a binarized neural network (BNN) for FPGAs, which drastically cuts down the hardware consumption while maintaining acceptable accuracy. We introduce a Resource-Aware Model Analysis (RAMA) method, and remove the bottleneck involving multipliers by bit-level XNOR and shifting operations, and the bottleneck of parameter access by data quantization and optimized on-chip storage. We evaluate the FP-BNN accelerator designs for MNIST multi-layer perceptrons (MLP), Cifar-10 ConvNet, and AlexNet on a Stratix-V FPGA system. An inference performance of Tera opartions per second with acceptable accuracy loss is obtained, which shows improvement in speed and energy efficiency over other computing platforms.
He C, Fu H, Luk W, et al., 2017, Exploring the potential of reconfigurable platforms for order book update, Field Programmable Logic and Applications (FPL), 2017, Publisher: IEEE, ISSN: 1946-1488
The order book update (OBU) algorithm is widely used in financial exchanges for rebuilding order books. The number of messages produced has drastically increased over time. The software solutions become more and more difficult to scale with the growing message rate and meet the requirement of low latency. This paper explores the potential of reconfigurable platforms in revolutionizing the order book architecture, and proposes a novel order book update algorithm optimized for maximal throughput and minimal latency. Our approach has three main contributions. First, we derive a fixed tick data structure for the order book that is easier to be mapped to the hardware. Second, we design a customized cache storing the top five levels of the order book to further reduce the latency. Third, we propose a hardware-friendly order book update algorithm based on the data structures we proposed. In the experiment, our FPGA-based solution can process 1.2-1.5 million messages per second with the throughput of 10Gb/s and the latency of 132-288 nanoseconds, which is 90-157 times faster than a CPU-based solution, and 5.2-6.6 times faster than an existing FPGA-based solution.
Fan H, Niu X, Liu Q, et al., 2017, F-C3D: FPGA-based 3-Dimensional Convolutional Neural Network, 27th International Conference on Field Programmable Logic and Applications (FPL), Publisher: IEEE, ISSN: 1946-1488
Targett JS, Düben P, Luk W, 2017, Validating optimisations for chaotic simulations, Field Programmable Logic and Applications (FPL), 2017, Publisher: IEEE, ISSN: 1946-1488
It is non-trivial to optimise computations of chaotic systems since slightly perturbed simulations diverge exponentially over time due to the well-known butterfly effect if bit-reproducible results are not achieved. Therefore, two model setups that show the same quality in the representation of a chaotic system will show uncorrelated behaviour if integrated long enough, hence it is challenging to check whether a given optimisation degrades model quality. Most models in computational fluid dynamics show chaotic behaviour. In this paper we focus on models of atmosphere and ocean that are vital for predictions of future weather and climate. Since forecast quality is usually limited by the available computational power, optimisation is highly desirable. We describe a new method for accepting or rejecting an optimised implementation of a reconfigurable design to simulate dynamics of a chaotic system. We apply this method to optimise numerical precision to a minimal level of stencil computations that can be used in an idealised ocean model, and show the performance improvements gained on an FPGA. The proposed method enables precision reduction for the FPGA so that it computes up to 9 times faster with 6 times lower energy consumption than an implementation on the same device with double precision arithmetic, while ensuring the optimised design to have acceptable numerical behaviour.
Ng HC, Liu S, Luk W, 2017, Reconfigurable acceleration of genetic sequence alignment: A survey of two decades of efforts, 2017 27th International Conference on Field Programmable Logic and Applications (FPL), Publisher: IEEE, ISSN: 1946-1488
Genetic sequence alignment has always been a computational challenge in bioinformatics. Depending on the problem size, software-based aligners can take multiple CPU-days to process the sequence data, creating a bottleneck point in bioinformatic analysis flow. Reconfigurable accelerator can achieve high performance for such computation by providing massive parallelism, but at the expense of programming flexibility and thus has not been commensurately used by practitioners. Therefore, this paper aims to provide a thorough survey of the proposed accelerators by giving a qualitative categorization based on their algorithms and speedup. A comprehensive comparison between work is also presented so as to guide selection for biologist, and to provide insight on future research direction for FPGA scientists.
Shao S, Luk W, 2017, Customised pearlmutter propagation: A hardware architecture for trust region policy optimisation, Field Programmable Logic and Applications (FPL), 2017, Publisher: IEEE, ISSN: 1946-1488
Reinforcement Learning (RL) is an area of machine learning in which an agent interacts with the environment by making sequential decisions. The agent receives reward from the environment to find an optimal policy that maximises the reward. Trust Region Policy Optimisation (TRPO) is a recent policy optimisation algorithm that achieves superior results in various RL benchmarks, but is computationally expensive. This paper proposes Customised Pearlmutter Propagation (CPP), a novel hardware architecture that accelerates TRPO on FPGA. We use the Pearlmutter Algorithm to address the key computational bottleneck of TRPO in a hardware efficient manner, avoiding symbolic differentiation with change of variables. Experimental evaluation using robotic locomotion benchmarks demonstrates that the proposed CPP architecture implemented on Stratix-V FPGA can achieve up to 20 times speed-up against 6-threaded Keras deep learning library with Theano backend running on a Core i7-5930K CPU.
Russell FP, Düben PD, Niu X, et al., 2017, Exploiting the chaotic behaviour of atmospheric models with reconfigurable architectures, Computer Physics Communications, Vol: 221, Pages: 160-173, ISSN: 0010-4655
Reconfigurable architectures are becoming mainstream: Amazon, Microsoft and IBM are supporting such architectures in their data centres. The computationally intensive nature of atmospheric modelling is an attractive target for hardware acceleration using reconfigurable computing. Performance of hardware designs can be improved through the use of reduced-precision arithmetic, but maintaining appropriate accuracy is essential. We explore reduced-precision optimisation for simulating chaotic systems, targeting atmospheric modelling, in which even minor changes in arithmetic behaviour will cause simulations to diverge quickly. The possibility of equally valid simulations having differing outcomes means that standard techniques for comparing numerical accuracy are inappropriate. We use the Hellinger distance to compare statistical behaviour between reduced-precision CPU implementations to guide reconfigurable designs of a chaotic system, then analyse accuracy, performance and power efficiency of the resulting implementations. Our results show that with only a limited loss in accuracy corresponding to less than 10% uncertainty in input parameters, the throughput and energy efficiency of a single-precision chaotic system implemented on a Xilinx Virtex-6 SX475T Field Programmable Gate Array (FPGA) can be more than doubled.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.