137 results found
Vink DA, Rajagopal A, Venieris SI, et al., 2020, Caffe barista: brewing caffe with FPGAs in the training loop, Publisher: arXiv
As the complexity of deep learning (DL) models increases, their computerequirements increase accordingly. Deploying a Convolutional Neural Network(CNN) involves two phases: training and inference. With the inference tasktypically taking place on resource-constrained devices, a lot of research hasexplored the field of low-power inference on custom hardware accelerators. Onthe other hand, training is both more compute- and memory-intensive and isprimarily performed on power-hungry GPUs in large-scale data centres. CNNtraining on FPGAs is a nascent field of research. This is primarily due to thelack of tools to easily prototype and deploy various hardware and/oralgorithmic techniques for power-efficient CNN training. This work presentsBarista, an automated toolflow that provides seamless integration of FPGAs intothe training of CNNs within the popular deep learning framework Caffe. To thebest of our knowledge, this is the only tool that allows for such versatile andrapid deployment of hardware and algorithms for the FPGA-based training ofCNNs, providing the necessary infrastructure for further research anddevelopment.
Rajagopal A, Vink DA, Venieris SI, et al., 2020, Multi-Precision Policy Enforced Training (MuPPET): A precision-switching strategy for quantised fixed-point training of CNNs, Publisher: arXiv
Large-scale convolutional neural networks (CNNs) suffer from very longtraining times, spanning from hours to weeks, limiting the productivity andexperimentation of deep learning practitioners. As networks grow in size andcomplexity, training time can be reduced through low-precision datarepresentations and computations. However, in doing so the final accuracysuffers due to the problem of vanishing gradients. Existing state-of-the-artmethods combat this issue by means of a mixed-precision approach utilising twodifferent precision levels, FP32 (32-bit floating-point) and FP16/FP8(16-/8-bit floating-point), leveraging the hardware support of recent GPUarchitectures for FP16 operations to obtain performance gains. This work pushesthe boundary of quantised training by employing a multilevel optimisationapproach that utilises multiple precisions including low-precision fixed-pointrepresentations. The novel training strategy, MuPPET, combines the use ofmultiple number representation regimes together with a precision-switchingmechanism that decides at run time the transition point between precisionregimes. Overall, the proposed strategy tailors the training process to thehardware-level capabilities of the target hardware architecture and yieldsimprovements in training time and energy efficiency compared tostate-of-the-art approaches. Applying MuPPET on the training of AlexNet,ResNet18 and GoogLeNet on ImageNet (ILSVRC12) and targeting an NVIDIA TuringGPU, MuPPET achieves the same accuracy as standard full-precision training withtraining-time speedup of up to 1.84$\times$ and an average speedup of1.58$\times$ across the networks.
Rajagopal A, Bouganis C-S, 2020, Now that I can see, I can improve: Enabling data-driven finetuning of CNNs on the edge, Publisher: arXiv
In today's world, a vast amount of data is being generated by edge devicesthat can be used as valuable training data to improve the performance ofmachine learning algorithms in terms of the achieved accuracy or to reduce thecompute requirements of the model. However, due to user data privacy concernsas well as storage and communication bandwidth limitations, this data cannot bemoved from the device to the data centre for further improvement of the modeland subsequent deployment. As such there is a need for increased edgeintelligence, where the deployed models can be fine-tuned on the edge, leadingto improved accuracy and/or reducing the model's workload as well as its memoryand power footprint. In the case of Convolutional Neural Networks (CNNs), boththe weights of the network as well as its topology can be tuned to adapt to thedata that it processes. This paper provides a first step towards enabling CNNfinetuning on an edge device based on structured pruning. It explores theperformance gains and costs of doing so and presents an extensible open-sourceframework that allows the deployment of such approaches on a wide range ofnetwork architectures and devices. The results show that on average, data-awarepruning with retraining can provide 10.2pp increased accuracy over a wide rangeof subsets, networks and pruning levels with a maximum improvement of 42.0ppover pruning and retraining in a manner agnostic to the data being processed bythe network.
Olaizola J, Bouganis C-S, Argandoña ESD, et al., 2020, Real-time servo press force estimation based on dual particle filter., IEEE Transactions on Industrial Electronics, Vol: 67, Pages: 4088-4097, ISSN: 0278-0046
The ability to monitor the quality of the metal forming process as well as the machine's condition is of significant importance in modern industrial processes. In the case where a physical device (i.e., sensor) cannot be deployed due to the characteristics of the system, models that rely on the estimation of both the applied force and the dynamic behavior of the machine (i.e., system) are adopted. The development of such models and the corresponding algorithms used to estimate the above-mentioned quantities has attracted the interest of the community. The main contribution of this paper is the estimation of a servo press force by employing a novel dual particle filter based algorithm, achieving a maximum relative error in the force estimation of 3.6%. Moreover, to address real-time performance requirements, this paper proposes a field programmable gate array based accelerator that improves the sampling rate by a factor of 200 compared to a processor-based solution, thus enabling the deployment of the system in many realistic scenarios.
Yu Z, Bouganis C-S, 2020, A Parameterisable FPGA-Tailored Architecture for YOLOv3-Tiny., Publisher: Springer, Pages: 330-344
Kouris A, Venieris S, Bouganis C-S, 2020, A throughput-latency co-optimised cascade of convolutional neural network classifiers, Design, Automation and Test in Europe Conference (DATE 2020), Publisher: IEEE
Convolutional Neural Networks constitute a promi-nent AI model for classification tasks, serving a broad span ofdiverse application domains. To enable their efficient deploymentin real-world tasks, the inherent redundancy of CNNs is fre-quently exploited to eliminate unnecessary computational costs.Driven by the fact that not all inputs require the same amount ofcomputation to drive a confident prediction, multi-precision cas-cade classifiers have been recently introduced. FPGAs comprise apromising platform for the deployment of such input-dependentcomputation models, due to their enhanced customisation ca-pabilities. Current literature, however, is limited to throughput-optimised cascade implementations, employing large batching atthe expense of a substantial latency aggravation prohibiting theirdeployment on real-time scenarios. In this work, we introduce anovel methodology for throughput-latency co-optimised cascadedCNN classification, deployed on a custom FPGA architecturetailored to the target application and deployment platform,with respect to a set of user-specified requirements on accuracyand performance. Our experiments indicate that the proposedapproach achieves comparable throughput gains with relatedstate-of-the-art works, under substantially reduced overhead inlatency, enabling its deployment on latency-sensitive applications.
Kouris A, Venieris S, Bouganis C-S, 2019, Towards efficient on-board deployment of DNNs on intelligent autonomous systems, 18th IEEE-Computer-Society Annual Symposium on VLSI (ISVLSI), Publisher: IEEE COMPUTER SOC, Pages: 570-575, ISSN: 2159-3469
With their unprecedented performance in major AI tasks, deep neural networks (DNNs) have emerged as a primary building block in modern autonomous systems. Intelligent systems such as drones, mobile robots and driverless cars largely base their perception, planning and application-specific tasks on DNN models. Nevertheless, due to the nature of these applications, such systems require on-board local processing in order to retain their autonomy and meet latency and throughput constraints. In this respect, the large computational and memory demands of DNN workloads pose a significant barrier on their deployment on the resource-and power-constrained compute platforms that are available on-board. This paper presents an overview of recent methods and hardware architectures that address the system-level challenges of modern DNN-enabled autonomous systems at both the algorithmic and hardware design level. Spanning from latency-driven approximate computing techniques to high-throughput mixed-precision cascaded classifiers, the presented set of works paves the way for the on-board deployment of sophisticated DNN models on robots and autonomous systems.
Ahmadi N, Bouganis C, Constandinou T, 2019, End-to-end hand kinematics decoding from local field potentials using temporal convolutional network, IEEE Biomedical Circuits and Systems (BioCAS) Conference, Publisher: IEEE
In recent years, local field potentials (LFPs) haveemerged as a promising alternative input signal for brain-machine interfaces (BMIs). Several studies have demonstratedthat LFP-based BMIs could provide long-term recording stabilityand comparable decoding performance to their spike counter-parts. Despite the compelling results, however, most LFP-basedBMIs still make use of hand-crafted features which can betime-consuming and suboptimal. In this paper, we propose anend-to-end system approach based on temporal convolutionalnetwork (TCN) to automatically extract features and decodekinematics of hand movements directly from raw LFP signals.We benchmark its decoding performance against traditionalapproach incorporating long short-term memory (LSTM) de-coders driven by hand-crafted LFP features. Experimental re-sults demonstrate significant performance improvement of theproposed approach compared to the traditional approach. Thissuggests the suitability of TCN-based end-to-end system and itspotential for providng stable and high decoding performanceLFP-based BMIs.
Vasileiadis M, Bouganis C-S, Tzovaras D, 2019, Multi-person 3D pose estimation from 3D cloud data using 3D convolutional neural networks, Computer Vision and Image Understanding, Vol: 185, Pages: 12-23, ISSN: 1077-3142
Human pose estimation is considered one of the major challenges in the field of Computer Vision, playing an integral role in a large variety of technology domains. While, in the last few years, there has been an increased number of research approaches towards CNN-based 2D human pose estimation from RGB images, respective work on CNN-based 3D human pose estimation from depth/3D data has been rather limited, with current approaches failing to outperform earlier methods, partially due to the utilization of depth maps as simple 2D single-channel images, instead of an actual 3D world representation. In order to overcome this limitation, and taking into consideration recent advances in 3D detection tasks of similar nature, we propose a novel fully-convolutional, detection-based 3D-CNN architecture for 3D human pose estimation from 3D data. The architecture follows the sequential network architecture paradigm, generating per-voxel likelihood maps for each human joint, from a 3D voxel-grid input, and is extended, through a bottom-up approach, towards multi-person 3D pose estimation, allowing the algorithm to simultaneously estimate multiple human poses, without its runtime complexity being affected by the number of people within the scene. The proposed multi-person architecture, which is the first within the scope of 3D human pose estimation, is comparatively evaluated on three single person public datasets, achieving state-of-the-art performance, as well as on a public multi-person dataset achieving high recognition accuracy.
Liu J, Bouganis C, Cheung PYK, 2019, Context-based image acquisition from memory in digital systems, Journal of Real-Time Image Processing, Vol: 16, Pages: 1057-1076, ISSN: 1861-8200
A key consideration in the design of image and video processing systems is the ever increasing spatial resolution of the captured images, which has a major impact on the performance requirements of the memory subsystem. This is further amplified by the facts that the memory bandwidth requirements and energy consumption of accessing the captured images have started to become the bottlenecks in the design of high-performance image processing systems. Inspired by the successful application of progressive image sampling techniques in various image processing tasks, this work proposes the concept of Context-based Image Acquisition for hardware systems that efficiently trades image quality for reduced cost of the image acquisition process. Based on the proposed framework, a hardware architecture is developed which alters the conventional memory access pattern, to progressively and adaptively access pixels from a memory subsystem. The sampled pixels are used to reconstruct an approximation to the ground truth, which is stored in a high-performance image buffer for further processing. An instance of the architecture is prototyped on an FPGA and its performance evaluation shows that a saving of up to 85 % of memory accessing time and 33 %/45 % of image acquisition time/energy are achieved on a set of benchmarks while maintaining a high PSNR.
Kostavelis I, Vasileiadis M, Skartados E, et al., 2019, Understanding of human behavior with a robotic agent through daily activity analysis, International Journal of Social Robotics, Vol: 11, Pages: 437-462, ISSN: 1875-4791
Personal assistive robots to be realized in the near future should have the ability to seamlessly coexist with humans in unconstrained environments, with the robot’s capability to understand and interpret the human behavior during human–robot cohabitation significantly contributing towards this end. Still, the understanding of human behavior through a robot is a challenging task as it necessitates a comprehensive representation of the high-level structure of the human’s behavior from the robot’s low-level sensory input. The paper at hand tackles this problem by demonstrating a robotic agent capable of apprehending human daily activities through a method, the Interaction Unit analysis, that enables activities’ decomposition into a sequence of units, each one associated with a behavioral factor. The modelling of human behavior is addressed with a Dynamic Bayesian Network that operates on top of the Interaction Unit, offering quantification of the behavioral factors and the formulation of the human’s behavioral model. In addition, light-weight human action and object manipulation monitoring strategies have been developed, based on RGB-D and laser sensors, tailored for onboard robot operation. As a proof of concept, we used our robot to evaluate the ability of the method to differentiate among the examined human activities, as well as to assess the capability of behavior modeling of people with Mild Cognitive Impairment. Moreover, we deployed our robot in 12 real house environments with real users, showcasing the behavior understanding ability of our method in unconstrained realistic environments. The evaluation process revealed promising performance and demonstrated that human behavior can be automatically modeled through Interaction Unit analysis, directly from robotic agents.
Ahmadi N, Cavuto ML, Feng P, et al., 2019, Towards a distributed, chronically-implantable neural interface, 9th IEEE/EMBS International Conference on Neural Engineering (NER), Publisher: IEEE, Pages: 719-724, ISSN: 1948-3546
We present a platform technology encompassing a family of innovations that together aim to tackle key challenges with existing implantable brain machine interfaces. The ENGINI (Empowering Next Generation Implantable Neural Interfaces) platform utilizes a 3-tier network (external processor, cranial transponder, intracortical probes) to inductively couple power to, and communicate data from, a distributed array of freely-floating mm-scale probes. Novel features integrated into each probe include: (1) an array of niobium microwires for observing local field potentials (LFPs) along the cortical column; (2) ultra-low power instrumentation for signal acquisition and data reduction; (3) an autonomous, self-calibrating wireless transceiver for receiving power and transmitting data; and (4) a hermetically-sealed micropackage suitable for chronic use. We are additionally engineering a surgical tool, to facilitate manual and robot-assisted insertion, within a streamlined neurosurgical workflow. Ongoing work is focused on system integration and preclinical testing.
Kouris A, Venieris SI, Rizakis M, et al., 2019, Approximate LSTMs for time-constrained inference: Enabling fast reaction in self-driving cars, Publisher: arXiv
The need to recognise long-term dependencies in sequential data such as videostreams has made LSTMs a prominent AI model for many emerging applications.However, the high computational and memory demands of LSTMs introducechallenges in their deployment on latency-critical systems such as self-drivingcars which are equipped with limited computational resources on-board. In thispaper, we introduce an approximate computing scheme combining model pruning andcomputation restructuring to obtain a high-accuracy approximation of the resultin early stages of the computation. Our experiments demonstrate that using theproposed methodology, mission-critical systems responsible for autonomousnavigation and collision avoidance are able to make informed decisions based onapproximate calculations within the available time budget, meeting theirspecifications on safety and robustness.
De Souza Rosa L, Bouganis C, Bonato V, 2019, Scaling up modulo scheduling for high-level synthesis, Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol: 38, Pages: 912-925, ISSN: 0278-0070
High-Level Synthesis tools have been increasingly used within the hardware design community to bridge the gap between productivity and the need to design large and complex systems. When targeting heterogeneous systems, where the CPU and the FPGA fabric are both available to perform computations, a design space exploration is usually carried out for deciding which parts of the initial code should be mapped to the FPGA fabric such as the overall system’s performance is enhanced by accelerating its computation via dedicated processors. As the targeted systems become more complex and larger, leading to a large design space exploration, the fast estimative of the possible acceleration that can be obtained by mapping certain functionality into the FPGA fabric is of paramount importance. Loop pipelining, which is responsible for the majority of HLS compilation time, is a key optimization towards achieving high-performance acceleration kernels. A new modulo scheduling algorithm is proposed, which reformulates the classical modulo scheduling problem and leads to a reduced number of integer linear problems solved, resulting in large computational savings. Moreover, the proposed approach has a controlled trade-off between solution quality and computation time. Results show the scalability is improved efficiently from quadratic, for the state-of-the-art method, to linear, for the proposed approach, while the optimized loop suffers a 1% (geomean) increment in the total number of cycles.
Boikos K, Bouganis C-S, 2019, A scalable FPGA-based architecture for depth estimation in SLAM, ARC 2019, Publisher: Springer, Pages: 181-196
The current state of the art of Simultaneous Localisation and Mapping, or SLAM, on low power embedded systems is about sparse localisation and mapping with low resolution results in the name of efficiency. Meanwhile, research in this field has provided many advances for information rich processing and semantic understanding, combined with high computational requirements for real-time processing. This work provides a solution to bridging this gap, in the form of a scalable SLAM-specific architecture for depth estimation for direct semi-dense SLAM. Targeting an off-the-shelf FPGA-SoC this accelerator architecture achieves a rate of more than 60 mapped frames/sec at a resolution of 640×480 achieving performance on par to a highly-optimised parallel implementation on a high-end desktop CPU with an order of magnitude improved power consumption. Furthermore, the developed architecture is combined with our previous work for the task of tracking, to form the first complete accelerator for semi-dense SLAM on FPGAs, establishing the state of the art in the area of embedded low-power systems.
Ahmadi N, Constandinou TG, Bouganis C-S, 2019, Decoding Hand Kinematics from Local Field Potentials Using Long Short-Term Memory (LSTM) Network, 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER 2019), Pages: 1-5
Local field potential (LFP) has gained increasing interest as an alternativeinput signal for brain-machine interfaces (BMIs) due to its informativefeatures, long-term stability, and low frequency content. However, despitethese interesting properties, LFP-based BMIs have been reported to yield lowdecoding performances compared to spike-based BMIs. In this paper, we propose anew decoder based on long short-term memory (LSTM) network which aims toimprove the decoding performance of LFP-based BMIs. We compare offline decodingperformance of the proposed LSTM decoder to a commonly used Kalman filter (KF)decoder on hand kinematics prediction tasks from multichannel LFPs. We alsobenchmark the performance of LFP-driven LSTM decoder against KF decoder drivenby two types of spike signals: single-unit activity (SUA) and multi-unitactivity (MUA). Our results show that LFP-driven LSTM decoder achievessignificantly better decoding performance than LFP-, SUA-, and MUA-driven KFdecoders. This suggests that LFPs coupled with LSTM decoder could provide highdecoding performance, robust, and low power BMIs.
Venieris S, Bouganis C, 2019, fpgaConvNet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs, IEEE Transactions on Neural Networks and Learning Systems, Vol: 30, Pages: 326-342, ISSN: 2162-2388
Since neural networks renaissance, convolutional neural networks (ConvNets) have demonstrated a state-of-the-art performance in several emerging artificial intelligence tasks. The deployment of ConvNets in real-life applications requires power-efficient designs that meet the application-level performance needs. In this context, field-programmable gate arrays (FPGAs) can provide a potential platform that can be tailored to application-specific requirements. However, with the complexity of ConvNet models increasing rapidly, the ConvNet-to-FPGA design space becomes prohibitively large. This paper presents fpgaConvNet, an end-to-end framework for the optimized mapping of ConvNets on FPGAs. The proposed framework comprises an automated design methodology based on the synchronous dataflow (SDF) paradigm and defines a set of SDF transformations in order to efficiently navigate the architectural design space. By proposing a systematic multiobjective optimization formulation, the presented framework is able to generate hardware designs that are cooptimized for the ConvNet workload, the target device, and the application's performance metric of interest. Quantitative evaluation shows that the proposed methodology yields hardware designs that improve the performance by up to 6.65x over highly optimized graphics processing unit designs for the same power constraints and achieve up to 2.94x higher performance density compared with the state-of-the-art FPGA-based ConvNet architectures.
Kouris A, Bouganis C, 2019, Learning to fly by myself: a self-supervised CNN-based approach for autonomous navigation, Intelligent Robots and Systems (IROS 2018), 2018 IEEE/RSJ International Conference on, Publisher: IEEE
Nowadays, Unmanned Aerial Vehicles (UAVs)are becoming increasingly popular facilitated by their extensive availability. Autonomous navigation methods can act as an enabler for the safe deployment of drones on a wide range of real-world civilian applications. In this work, we introduce a self-supervised CNN-based approach for indoor robot navigation. Our method addresses the problem of real-time obstacle avoidance, by employing a regression CNN that predicts the agent's distance-to-collision in view of the raw visual input of its on-board monocular camera. The proposed CNN is trained on our custom indoor-flight dataset which is collected and annotated with real-distance labels, in a self-supervised manner using external sensors mounted on an UAV. By simultaneously processing the current and previous input frame, the proposed CNN extracts spatio-temporal features that encapsulate both static appearance and motion information to estimate the robot's distance to its closest obstacle towards multiple directions. These predictions are used to modulate the yaw and linear velocity of the UAV, in order to navigate autonomously and avoid collisions. Experimental evaluation demonstrates that the proposed approach learns a navigation policy that achieves high accuracy on real-world indoor flights, outperforming previously proposed methods from the literature.
Kouris A, Kyrkou C, Bouganis C-S, 2019, Informed Region Selection for Efficient UAV-based Object Detectors: Altitude-aware Vehicle Detection with CyCAR Dataset, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Publisher: IEEE, Pages: 51-58, ISSN: 2153-0858
Ahmadi N, Cavuto ML, Feng P, et al., 2019, Towards a Distributed, Chronically-Implantable Neural Interface., Publisher: IEEE, Pages: 719-724
Montgomerie-Corcoran A, Venieris SI, Bouganis C-S, 2019, Power-Aware FPGA Mapping of Convolutional Neural Networks., Publisher: IEEE, Pages: 327-330
Vasileiadis M, Bouganis C-S, Stavropoulos G, et al., 2019, Optimising 3D-CNN Design towards Human Pose Estimation on Low Power Devices., Publisher: BMVA Press, Pages: 42-42
Kouris A, Venieris SI, Bouganis C-S, 2018, CascadeC(NN): pushing the performance limits of quantisation in convolutional neural networks, 28th International Conference on Field Programmable Logic and Applications (FPL), Publisher: IEEE, Pages: 155-162, ISSN: 1946-1488
This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, aiming to perform high-throughput inference. A two-stage architecture tailored for any given CNN-FPGA pair is generated, consisting of a low-and high-precision unit in a cascade. A confidence evaluation unit is employed to identify misclassified cases from the excessively low-precision unit and forward them to the high-precision unit for re-processing. Experiments demonstrate that the proposed toolflow can achieve a performance boost up to 55% for VGG-16 and 48% for AlexNet over the baseline design for the same resource budget and accuracy, without the need of retraining the model or accessing the training data.
Kyrkou C, Theocharides T, Bouganis C-S, et al., 2018, Boosting the hardware-efficiency of cascade support vector machines for embedded classification applications, International Journal of Parallel Programming, Vol: 46, Pages: 1220-1246, ISSN: 0885-7458
Support Vector Machines (SVMs) are considered as a state-of-the-art classification algorithm capable of high accuracy rates for a different range of applications. When arranged in a cascade structure, SVMs can efficiently handle problems where the majority of data belongs to one of the two classes, such as image object classification, and hence can provide speedups over monolithic (single) SVM classifiers. However, the SVM classification process is still computationally demanding due to the number of support vectors. Consequently, in this paper we propose a hardware architecture optimized for cascaded SVM processing to boost performance and hardware efficiency, along with a hardware reduction method in order to reduce the overheads from the implementation of additional stages in the cascade, leading to significant resource and power savings. The architecture was evaluated for the application of object detection on 800×600 resolution images on a Spartan 6 Industrial Video Processing FPGA platform achieving over 30 frames-per-second. Moreover, by utilizing the proposed hardware reduction method we were able to reduce the utilization of FPGA custom-logic resources by ∼30%, and simultaneously observed ∼20% peak power reduction compared to a baseline implementation.
Ahmadi N, Constandinou T, Bouganis C, 2018, Estimation of neuronal firing rate using Bayesian Adaptive Kernel Smoother (BAKS), PLoS ONE, Vol: 13, ISSN: 1932-6203
Neurons use sequences of action potentials (spikes) to convey information across neuronal networks. In neurophysiology experiments, information about external stimuli or behavioral tasks has been frequently characterized in term of neuronal firing rate. The firing rate is conventionally estimated by averaging spiking responses across multiple similar experiments (or trials). However, there exist a number of applications in neuroscience research that require firing rate to be estimated on a single trial basis. Estimating firing rate from a single trial is a challenging problem and current state-of-the-art methods do not perform well. To address this issue, we develop a new method for estimating firing rate based on a kernel smoothing technique that considers the bandwidth as a random variable with prior distribution that is adaptively updated under an empirical Bayesian framework. By carefully selecting the prior distribution together with Gaussian kernel function, an analytical expression can be achieved for the kernel bandwidth. We refer to the proposed method as Bayesian Adaptive Kernel Smoother (BAKS). We evaluate the performance of BAKS using synthetic spike train data generated by biologically plausible models: inhomogeneous Gamma (IG) and inhomogeneous inverse Gaussian (IIG). We also apply BAKS to real spike train data from non-human primate (NHP) motor and visual cortex. We benchmark the proposed method against established and previously reported methods. These include: optimized kernel smoother (OKS), variable kernel smoother (VKS), local polynomial fit (Locfit), and Bayesian adaptive regression splines (BARS). Results using both synthetic and real data demonstrate that the proposed method achieves better performance compared to competing methods. This suggests that the proposed method could be useful for understanding the encoding mechanism of neurons in cognitive-related tasks. The proposed method could also potentially improve the performance of brain-mac
Ahmadi N, Constandinou TG, Bouganis C, 2018, Spike rate estimation using Bayesian Adaptive Kernel Smoother (BAKS) and its application to brain machine interfaces, 40th International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Publisher: IEEE
Brain Machine Interfaces (BMIs) mostly utilise spike rate as an input feature for decoding a desired motor output as it conveys a useful measure to the underlying neuronal activity. The spike rate is typically estimated by a using non-overlap binning method that yields a coarse estimate. There exist several methods that can produce a smooth estimate which could potentially improve the decoding performance. However, these methods are relatively computationally heavy for real-time BMIs. To address this issue, we propose a new method for estimating spike rate that is able to yield a smooth estimate and also amenable to real-time BMIs. The proposed method, referred to as Bayesian adaptive kernel smoother (BAKS), employs kernel smoothing technique that considers the bandwidth as a random variable with prior distribution which is adaptively updated through a Bayesian framework. With appropriate selection of prior distribution and kernel function, an analytical expression can be achieved for the kernel bandwidth. We apply BAKS and evaluate its impact on of fline BMI decoding performance using Kalman filter. The results show that overlap BAKS improved the decoding performance up to 3.33% and 12.93% compared to overlap and non-overlapbinning methods, respectively, depending on the window size. This suggests the feasibility and the potential use of BAKS method for real-time BMIs.
Venieris SI, Bouganis C-S, 2018, f-CNNx: a toolflow for mapping multiple convolutional neural networks on FPGAs, 28th International Conference on Field Programmable Logic and Applications
The predictive power of Convolutional Neural Networks (CNNs) has been an integral factor for emerging latency-sensitive applications, such as autonomous drones and vehicles. Such systems employ multiple CNNs, each one trained for a particular task. The efficient mapping of multiple CNNs on a single FPGA device is a challenging task as the allocation of compute resources and external memory bandwidth needs to be optimised at design time. This paper proposes f-CNNx, an automated toolflow for the optimised mapping of multiple CNNs on FPGAs, comprising a novel multi-CNN hardware architecture together with an automated design space exploration method that considers the user-specified performance requirements for each model to allocate compute resources and generate a synthesisable accelerator. Moreover, f-CNNx employs a novel scheduling algorithm that alleviates the limitations of the memory bandwidth contention between CNNs and sustains the high utilisation of the architecture. Experimental evaluation shows that f-CNNx's designs outperform contention-unaware FPGA mappings by up to 50% and deliver up to 6.8x higher performance-per-Watt over highly optimised GPU designs for multi-CNN systems.
Rizakis M, Venieris SI, Kouris A, et al., 2018, Approximate FPGA-based LSTMs under computation time constraints, ARC 2018: 14th International Symposium on Applied Reconfigurable Computing, Publisher: Springer, Pages: 3-15, ISSN: 0302-9743
Recurrent Neural Networks, with the prominence of LongShort-Term Memory (LSTM) networks, have demonstrated state-of-the-art accuracy in several emerging Artificial Intelligence tasks. Neverthe-less, the highest performing LSTM models are becoming increasinglydemanding in terms of computational and memory load. At the sametime, emerging latency-sensitive applications including mobile robots andautonomous vehicles often operate under stringent computation timeconstraints. In this paper, we address the challenge of deploying com-putationally demanding LSTMs at a constrained time budget by intro-ducing an approximate computing scheme that combines iterative low-rank compression and pruning, along with a novel FPGA-based LSTMarchitecture. Combined in an end-to-end framework, the approximationmethod parameters are optimised and the architecture is configuredto address the problem of high-performance LSTM execution in time-constrained applications. Quantitative evaluation on a real-life imagecaptioning application indicates that the proposed system required up to6.5×less time to achieve the same application-level accuracy comparedto a baseline method, while achieving an average of 25×higher accuracyunder the same computation time constraints.
Venieris SI, Kouris A, Bouganis C-S, 2018, Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions., ACM Comput. Surv., Vol: 51, Pages: 56:1-56:1, ISSN: 0360-0300
n the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performancein various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, severalsoftware frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context,reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integratedin the existing deep learning ecosystem to provide a tunable balance between performance, power consumptionand programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising acomparative study of their key characteristics which include the supported applications, architectural choices,design space exploration methods and achieved performance. Moreover, major challenges and objectivesintroduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniformevaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation ofCNN-to-FPGA toolflows.
Shafique M, Theocharides T, Bouganis C-S, et al., 2018, An overview of next-generation architectures for machine learning: roadmap, opportunities and challenges in the IoT era, Design, Automation & Test in Europe Conference & Exhibition (DATE), Publisher: IEEE, Pages: 827-832, ISSN: 1558-1101
The number of connected Internet of Things (IoT) devices are expected to reach over 20 billion by 2020. These range from basic sensor nodes that log and report the data to the ones that are capable of processing the incoming information and taking an action accordingly. Machine learning, and in particular deep learning, is the de facto processing paradigm for intelligently processing these immense volumes of data. However, the resource inhibited environment of IoT devices, owing to their limited energy budget and low compute capabilities, render them a challenging platform for deployment of desired data analytics. This paper provides an overview of the current and emerging trends in designing highly efficient, reliable, secure and scalable machine learning architectures for such devices. The paper highlights the focal challenges and obstacles being faced by the community in achieving its desired goals. The paper further presents a roadmap that can help in addressing the highlighted challenges and thereby designing scalable, high-performance, and energy efficient architectures for performing machine learning on the edge.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.