126 results found
Ahmadi N, Cavuto ML, Feng P, et al., 2019, Towards a distributed, chronically-implantable neural interface, 9th IEEE/EMBS International Conference on Neural Engineering (NER), Publisher: IEEE, Pages: 719-724, ISSN: 1948-3546
We present a platform technology encompassing a family of innovations that together aim to tackle key challenges with existing implantable brain machine interfaces. The ENGINI (Empowering Next Generation Implantable Neural Interfaces) platform utilizes a 3-tier network (external processor, cranial transponder, intracortical probes) to inductively couple power to, and communicate data from, a distributed array of freely-floating mm-scale probes. Novel features integrated into each probe include: (1) an array of niobium microwires for observing local field potentials (LFPs) along the cortical column; (2) ultra-low power instrumentation for signal acquisition and data reduction; (3) an autonomous, self-calibrating wireless transceiver for receiving power and transmitting data; and (4) a hermetically-sealed micropackage suitable for chronic use. We are additionally engineering a surgical tool, to facilitate manual and robot-assisted insertion, within a streamlined neurosurgical workflow. Ongoing work is focused on system integration and preclinical testing.
Vasileiadis M, Bouganis C-S, Tzovaras D, 2019, Multi-person 3D pose estimation from 3D cloud data using 3D convolutional neural networks, Computer Vision and Image Understanding, ISSN: 1077-3142
Human pose estimation is considered one of the major challenges in the field of Computer Vision, playing an integral role in a large variety of technology domains. While, in the last few years, there has been an increased number of research approaches towards CNN-based 2D human pose estimation from RGB images, respective work on CNN-based 3D human pose estimation from depth/3D data has been rather limited, with current approaches failing to outperform earlier methods, partially due to the utilization of depth maps as simple 2D single-channel images, instead of an actual 3D world representation. In order to overcome this limitation, and taking into consideration recent advances in 3D detection tasks of similar nature, we propose a novel fully-convolutional, detection-based 3D-CNN architecture for 3D human pose estimation from 3D data. The architecture follows the sequential network architecture paradigm, generating per-voxel likelihood maps for each human joint, from a 3D voxel-grid input, and is extended, through a bottom-up approach, towards multi-person 3D pose estimation, allowing the algorithm to simultaneously estimate multiple human poses, without its runtime complexity being affected by the number of people within the scene. The proposed multi-person architecture, which is the first within the scope of 3D human pose estimation, is comparatively evaluated on three single person public datasets, achieving state-of-the-art performance, as well as on a public multi-person dataset achieving high recognition accuracy.
De Souza Rosa L, Bouganis C, Bonato V, 2019, Scaling up modulo scheduling for high-level synthesis, Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol: 38, Pages: 912-925, ISSN: 0278-0070
High-Level Synthesis tools have been increasingly used within the hardware design community to bridge the gap between productivity and the need to design large and complex systems. When targeting heterogeneous systems, where the CPU and the FPGA fabric are both available to perform computations, a design space exploration is usually carried out for deciding which parts of the initial code should be mapped to the FPGA fabric such as the overall system’s performance is enhanced by accelerating its computation via dedicated processors. As the targeted systems become more complex and larger, leading to a large design space exploration, the fast estimative of the possible acceleration that can be obtained by mapping certain functionality into the FPGA fabric is of paramount importance. Loop pipelining, which is responsible for the majority of HLS compilation time, is a key optimization towards achieving high-performance acceleration kernels. A new modulo scheduling algorithm is proposed, which reformulates the classical modulo scheduling problem and leads to a reduced number of integer linear problems solved, resulting in large computational savings. Moreover, the proposed approach has a controlled trade-off between solution quality and computation time. Results show the scalability is improved efficiently from quadratic, for the state-of-the-art method, to linear, for the proposed approach, while the optimized loop suffers a 1% (geomean) increment in the total number of cycles.
Boikos K, Bouganis C-S, A Scalable FPGA-based Architecture for Depth Estimation in SLAM, ARC 2019
The current state of the art of Simultaneous Localisation and Mapping, orSLAM, on low power embedded systems is about sparse localisation and mappingwith low resolution results in the name of efficiency. Meanwhile, research inthis field has provided many advances for information rich processing andsemantic understanding, combined with high computational requirements forreal-time processing. This work provides a solution to bridging this gap, inthe form of a scalable SLAM-specific architecture for depth estimation fordirect semi-dense SLAM. Targeting an off-the-shelf FPGA-SoC this acceleratorarchitecture achieves a rate of more than 60 mapped frames/sec at a resolutionof 640x480 achieving performance on par to a highly-optimised parallelimplementation on a high-end desktop CPU with an order of magnitude improvedpower consumption. Furthermore, the developed architecture is combined with ourprevious work for the task of tracking, to form the first complete acceleratorfor semi-dense SLAM on FPGAs, establishing the state of the art in the area ofembedded low-power systems.
Ahmadi N, Constandinou T, Bouganis C, 2019, Decoding Hand Kinematics from Local Field Potentials Using Long Short-Term Memory (LSTM) Network, 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER 2019), Pages: 1-5
Local eld potential (LFP) has gained increasing interest as an alternative input signal for brain-machine interfaces (BMIs) due to its informative features, long-term stability, and low frequency content. However, despite these interesting properties, LFP-based BMIs have been reported to yield low decoding performances compared to spike-based BMIs. In this paper, we propose a new decoder based on long short-term memory (LSTM) network which aims to improve the decoding performance of LFP-based BMIs. We compare of ine decoding performance of the proposed LSTM decoder to a commonly used Kalman lter (KF) decoder on hand kinematics prediction tasks from multichannel LFPs. We also benchmark the performance of LFP-driven LSTM decoder against KF decoder driven by two types of spike signals: singleunit activity (SUA) and multi-unit activity (MUA). Our results show that LFP-driven LSTM decoder achieves signi cantly better decoding performance than LFP-, SUA-, and MUAdrivenKF decoders. This suggests that LFPs coupled with LSTM decoder could provide high decoding performance, robust, and low power BMIs.
Venieris S, Bouganis C, 2019, fpgaConvNet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs, IEEE Transactions on Neural Networks and Learning Systems, Vol: 30, Pages: 326-342, ISSN: 2162-2388
Since neural networks renaissance, convolutional neural networks (ConvNets) have demonstrated a state-of-the-art performance in several emerging artificial intelligence tasks. The deployment of ConvNets in real-life applications requires power-efficient designs that meet the application-level performance needs. In this context, field-programmable gate arrays (FPGAs) can provide a potential platform that can be tailored to application-specific requirements. However, with the complexity of ConvNet models increasing rapidly, the ConvNet-to-FPGA design space becomes prohibitively large. This paper presents fpgaConvNet, an end-to-end framework for the optimized mapping of ConvNets on FPGAs. The proposed framework comprises an automated design methodology based on the synchronous dataflow (SDF) paradigm and defines a set of SDF transformations in order to efficiently navigate the architectural design space. By proposing a systematic multiobjective optimization formulation, the presented framework is able to generate hardware designs that are cooptimized for the ConvNet workload, the target device, and the application's performance metric of interest. Quantitative evaluation shows that the proposed methodology yields hardware designs that improve the performance by up to 6.65x over highly optimized graphics processing unit designs for the same power constraints and achieve up to 2.94x higher performance density compared with the state-of-the-art FPGA-based ConvNet architectures.
Kouris A, Bouganis C, 2019, Learning to fly by myself: a self-supervised CNN-based approach for autonomous navigation, Intelligent Robots and Systems (IROS 2018), 2018 IEEE/RSJ International Conference on, Publisher: IEEE
Nowadays, Unmanned Aerial Vehicles (UAVs)are becoming increasingly popular facilitated by their extensive availability. Autonomous navigation methods can act as an enabler for the safe deployment of drones on a wide range of real-world civilian applications. In this work, we introduce a self-supervised CNN-based approach for indoor robot navigation. Our method addresses the problem of real-time obstacle avoidance, by employing a regression CNN that predicts the agent's distance-to-collision in view of the raw visual input of its on-board monocular camera. The proposed CNN is trained on our custom indoor-flight dataset which is collected and annotated with real-distance labels, in a self-supervised manner using external sensors mounted on an UAV. By simultaneously processing the current and previous input frame, the proposed CNN extracts spatio-temporal features that encapsulate both static appearance and motion information to estimate the robot's distance to its closest obstacle towards multiple directions. These predictions are used to modulate the yaw and linear velocity of the UAV, in order to navigate autonomously and avoid collisions. Experimental evaluation demonstrates that the proposed approach learns a navigation policy that achieves high accuracy on real-world indoor flights, outperforming previously proposed methods from the literature.
Ahmadi N, Constandinou TG, Bouganis C-S, Decoding Hand Kinematics from Local Field Potentials Using Long Short-Term Memory (LSTM) Network, Arxiv preprint
Local field potential (LFP) has gained increasing interest as an alternativeinput signal for brain-machine interfaces (BMIs) due to its informativefeatures, long-term stability, and low frequency content. However, despitethese interesting properties, LFP-based BMIs have been reported to yield lowdecoding performances compared to spike-based BMIs. In this paper, we propose anew decoder based on long short-term memory (LSTM) network which aims toimprove the decoding performance of LFP-based BMIs. We compare offline decodingperformance of the proposed LSTM decoder to a commonly used Kalman filter (KF)decoder on hand kinematics prediction tasks from multichannel LFPs. We alsobenchmark the performance of LFP-driven LSTM decoder against KF decoder drivenby two types of spike signals: single-unit activity (SUA) and multi-unitactivity (MUA). Our results show that LFP-driven LSTM decoder achievessignificantly better decoding performance than LFP-, SUA-, and MUA-driven KFdecoders. This suggests that LFPs coupled with LSTM decoder could provide highdecoding performance, robust, and low power BMIs.
Kouris A, Venieris SI, Bouganis C-S, 2018, CascadeC(NN): pushing the performance limits of quantisation in convolutional neural networks, 28th International Conference on Field Programmable Logic and Applications (FPL), Publisher: IEEE, Pages: 155-162, ISSN: 1946-1488
This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, aiming to perform high-throughput inference. A two-stage architecture tailored for any given CNN-FPGA pair is generated, consisting of a low-and high-precision unit in a cascade. A confidence evaluation unit is employed to identify misclassified cases from the excessively low-precision unit and forward them to the high-precision unit for re-processing. Experiments demonstrate that the proposed toolflow can achieve a performance boost up to 55% for VGG-16 and 48% for AlexNet over the baseline design for the same resource budget and accuracy, without the need of retraining the model or accessing the training data.
Kyrkou C, Theocharides T, Bouganis C-S, et al., 2018, Boosting the hardware-efficiency of cascade support vector machines for embedded classification applications, International Journal of Parallel Programming, Vol: 46, Pages: 1220-1246, ISSN: 0885-7458
Support Vector Machines (SVMs) are considered as a state-of-the-art classification algorithm capable of high accuracy rates for a different range of applications. When arranged in a cascade structure, SVMs can efficiently handle problems where the majority of data belongs to one of the two classes, such as image object classification, and hence can provide speedups over monolithic (single) SVM classifiers. However, the SVM classification process is still computationally demanding due to the number of support vectors. Consequently, in this paper we propose a hardware architecture optimized for cascaded SVM processing to boost performance and hardware efficiency, along with a hardware reduction method in order to reduce the overheads from the implementation of additional stages in the cascade, leading to significant resource and power savings. The architecture was evaluated for the application of object detection on 800×600 resolution images on a Spartan 6 Industrial Video Processing FPGA platform achieving over 30 frames-per-second. Moreover, by utilizing the proposed hardware reduction method we were able to reduce the utilization of FPGA custom-logic resources by ∼30%, and simultaneously observed ∼20% peak power reduction compared to a baseline implementation.
Ahmadi N, Constandinou T, Bouganis C, 2018, Estimation of neuronal firing rate using Bayesian Adaptive Kernel Smoother (BAKS), PLoS ONE, Vol: 13, ISSN: 1932-6203
Neurons use sequences of action potentials (spikes) to convey information across neuronal networks. In neurophysiology experiments, information about external stimuli or behavioral tasks has been frequently characterized in term of neuronal firing rate. The firing rate is conventionally estimated by averaging spiking responses across multiple similar experiments (or trials). However, there exist a number of applications in neuroscience research that require firing rate to be estimated on a single trial basis. Estimating firing rate from a single trial is a challenging problem and current state-of-the-art methods do not perform well. To address this issue, we develop a new method for estimating firing rate based on a kernel smoothing technique that considers the bandwidth as a random variable with prior distribution that is adaptively updated under an empirical Bayesian framework. By carefully selecting the prior distribution together with Gaussian kernel function, an analytical expression can be achieved for the kernel bandwidth. We refer to the proposed method as Bayesian Adaptive Kernel Smoother (BAKS). We evaluate the performance of BAKS using synthetic spike train data generated by biologically plausible models: inhomogeneous Gamma (IG) and inhomogeneous inverse Gaussian (IIG). We also apply BAKS to real spike train data from non-human primate (NHP) motor and visual cortex. We benchmark the proposed method against established and previously reported methods. These include: optimized kernel smoother (OKS), variable kernel smoother (VKS), local polynomial fit (Locfit), and Bayesian adaptive regression splines (BARS). Results using both synthetic and real data demonstrate that the proposed method achieves better performance compared to competing methods. This suggests that the proposed method could be useful for understanding the encoding mechanism of neurons in cognitive-related tasks. The proposed method could also potentially improve the performance of brain-mac
Venieris SI, Bouganis C-S, f-CNNx: a toolflow for mapping multiple convolutional neural networks on FPGAs, 28th International Conference on Field Programmable Logic and Applications
The predictive power of Convolutional Neural Networks (CNNs) has been an integral factor for emerging latency-sensitive applications, such as autonomous drones and vehicles. Such systems employ multiple CNNs, each one trained for a particular task. The efficient mapping of multiple CNNs on a single FPGA device is a challenging task as the allocation of compute resources and external memory bandwidth needs to be optimised at design time. This paper proposes f-CNNx, an automated toolflow for the optimised mapping of multiple CNNs on FPGAs, comprising a novel multi-CNN hardware architecture together with an automated design space exploration method that considers the user-specified performance requirements for each model to allocate compute resources and generate a synthesisable accelerator. Moreover, f-CNNx employs a novel scheduling algorithm that alleviates the limitations of the memory bandwidth contention between CNNs and sustains the high utilisation of the architecture. Experimental evaluation shows that f-CNNx's designs outperform contention-unaware FPGA mappings by up to 50% and deliver up to 6.8x higher performance-per-Watt over highly optimised GPU designs for multi-CNN systems.
Rizakis M, Venieris SI, Kouris A, et al., 2018, Approximate FPGA-based LSTMs under computation time constraints, ARC 2018: 14th International Symposium on Applied Reconfigurable Computing, Publisher: Springer, Pages: 3-15, ISSN: 0302-9743
Recurrent Neural Networks, with the prominence of LongShort-Term Memory (LSTM) networks, have demonstrated state-of-the-art accuracy in several emerging Artificial Intelligence tasks. Neverthe-less, the highest performing LSTM models are becoming increasinglydemanding in terms of computational and memory load. At the sametime, emerging latency-sensitive applications including mobile robots andautonomous vehicles often operate under stringent computation timeconstraints. In this paper, we address the challenge of deploying com-putationally demanding LSTMs at a constrained time budget by intro-ducing an approximate computing scheme that combines iterative low-rank compression and pruning, along with a novel FPGA-based LSTMarchitecture. Combined in an end-to-end framework, the approximationmethod parameters are optimised and the architecture is configuredto address the problem of high-performance LSTM execution in time-constrained applications. Quantitative evaluation on a real-life imagecaptioning application indicates that the proposed system required up to6.5×less time to achieve the same application-level accuracy comparedto a baseline method, while achieving an average of 25×higher accuracyunder the same computation time constraints.
Venieris SI, Kouris A, Bouganis C-S, 2018, Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions., ACM Comput. Surv., Vol: 51, Pages: 56:1-56:1, ISSN: 0360-0300
n the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performancein various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, severalsoftware frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context,reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integratedin the existing deep learning ecosystem to provide a tunable balance between performance, power consumptionand programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising acomparative study of their key characteristics which include the supported applications, architectural choices,design space exploration methods and achieved performance. Moreover, major challenges and objectivesintroduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniformevaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation ofCNN-to-FPGA toolflows.
Shafique M, Theocharides T, Bouganis C-S, et al., 2018, An overview of next-generation architectures for machine learning: roadmap, opportunities and challenges in the IoT era, Design, Automation & Test in Europe Conference & Exhibition (DATE), Publisher: IEEE, Pages: 827-832, ISSN: 1558-1101
The number of connected Internet of Things (IoT) devices are expected to reach over 20 billion by 2020. These range from basic sensor nodes that log and report the data to the ones that are capable of processing the incoming information and taking an action accordingly. Machine learning, and in particular deep learning, is the de facto processing paradigm for intelligently processing these immense volumes of data. However, the resource inhibited environment of IoT devices, owing to their limited energy budget and low compute capabilities, render them a challenging platform for deployment of desired data analytics. This paper provides an overview of the current and emerging trends in designing highly efficient, reliable, secure and scalable machine learning architectures for such devices. The paper highlights the focal challenges and obstacles being faced by the community in achieving its desired goals. The paper further presents a roadmap that can help in addressing the highlighted challenges and thereby designing scalable, high-performance, and energy efficient architectures for performing machine learning on the edge.
Ahmadi N, Constandinou TG, Bouganis C, 2018, Spike rate estimation using Bayesian Adaptive Kernel Smoother (BAKS) and its application to brain machine interfaces, 40th International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Publisher: IEEE
Brain Machine Interfaces (BMIs) mostly utilise spike rate as an input feature for decoding a desired motor output as it conveys a useful measure to the underlying neuronal activity. The spike rate is typically estimated by a using non-overlap binning method that yields a coarse estimate. There exist several methods that can produce a smooth estimate which could potentially improve the decoding performance. However, these methods are relatively computationally heavy for real-time BMIs. To address this issue, we propose a new method for estimating spike rate that is able to yield a smooth estimate and also amenable to real-time BMIs. The proposed method, referred to as Bayesian adaptive kernel smoother (BAKS), employs kernel smoothing technique that considers the bandwidth as a random variable with prior distribution which is adaptively updated through a Bayesian framework. With appropriate selection of prior distribution and kernel function, an analytical expression can be achieved for the kernel bandwidth. We apply BAKS and evaluate its impact on of fline BMI decoding performance using Kalman filter. The results show that overlap BAKS improved the decoding performance up to 3.33% and 12.93% compared to overlap and non-overlapbinning methods, respectively, depending on the window size. This suggests the feasibility and the potential use of BAKS method for real-time BMIs.
Vasileiadis M, Malassiotis S, Giakoumis D, et al., 2018, Robust Human Pose Tracking For Realistic Service Robot Applications, 16th IEEE International Conference on Computer Vision (ICCV), Publisher: IEEE, Pages: 1363-1372, ISSN: 2473-9936
Robust human pose estimation and tracking plays an integral role in assistive service robot applications, as it provides information regarding the body pose and motion of the user in a scene. Even though current solutions provide high-accuracy results in controlled environments, they fail to successfully deal with problems encountered under real-life situations such as tracking initialization and failure, body part intersection, large object handling and partial-view body-part tracking. This paper presents a framework tailored for deployment under real-life situations addressing the above limitations. The framework is based on the articulated 3D-SDF data representation model, and has been extended with complementary mechanisms for addressing the above challenges. Extensive evaluation on public datasets demonstrates the framework's state-of-the-art performance, while experimental results on a challenging realistic human motion dataset exhibit its robustness in real life scenarios.
Kyrkou C, Plastiras G, Theocharides T, et al., 2018, DroNet: efficient convolutional neural network detector for real-time UAV applications, Design, Automation and Test in Europe Conference and Exhibition (DATE), Publisher: IEEE, Pages: 967-972, ISSN: 1530-1591
Unmanned Aerial Vehicles (drones) are emerging as a promising technology for both environmental and infrastructure monitoring, with broad use in a plethora of applications. Many such applications require the use of computer vision algorithms in order to analyse the information captured from an on-board camera. Such applications include detecting vehicles for emergency response and traffic monitoring. This paper therefore, explores the trade-offs involved in the development of a single-shot object detector based on deep convolutional neural networks (CNNs) that can enable UAVs to perform vehicle detection under a resource constrained environment such as in a UAV. The paper presents a holistic approach for designing such systems; the data collection and training stages, the CNN architecture, and the optimizations necessary to efficiently map such a CNN on a lightweight embedded processing platform suitable for deployment on UAVs. Through the analysis we propose a CNN architecture that is capable of detecting vehicles from aerial UAV images and can operate between 5-18 frames-per-second for a variety of platforms with an overall accuracy of ~ 95%. Overall, the proposed architecture is suitable for UAV applications, utilizing low-power embedded processors that can be deployed on commercial UAVs.
Venieris SI, Bouganis C-S, fpgaConvNet: A Toolflow for Mapping Diverse Convolutional Neural Networks on Embedded FPGAs, Conference on Neural Information Processing Systems
In recent years, Convolutional Neural Networks (ConvNets) have become anenabling technology for a wide range of novel embedded Artificial Intelligencesystems. Across the range of applications, the performance needs varysignificantly, from high-throughput video surveillance to the very low-latencyrequirements of autonomous cars. In this context, FPGAs can provide a potentialplatform that can be optimally configured based on the different performanceneeds. However, the complexity of ConvNet models keeps increasing making theirmapping to an FPGA device a challenging task. This work presents fpgaConvNet,an end-to-end framework for mapping ConvNets on FPGAs. The proposed frameworkemploys an automated design methodology based on the Synchronous Dataflow (SDF)paradigm and defines a set of SDF transformations in order to efficientlyexplore the architectural design space. By selectively optimising forthroughput, latency or multiobjective criteria, the presented tool is able toefficiently explore the design space and generate hardware designs fromhigh-level ConvNet specifications, explicitly optimised for the performancemetric of interest. Overall, our framework yields designs that improve theperformance by up to 6.65x over highly optimised embedded GPU designs for thesame power constraints in embedded environments.
Bouganis C, venieris, 2017, Latency-Driven Design for FPGA-based Convolutional Neural Networks, International Conference on Field-Programmable Logic and Applications, Publisher: IEEE, ISSN: 1946-1488
In recent years, Convolutional Neural Networks (ConvNets) have become the quintessential component of several state-of-the-art Artificial Intelligence tasks. Across the spectrum of applications, the performance needs vary significantly, from high-throughput image recognition to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on different performance requirements. However, with the increasing complexity of ConvNet models, the architectural design space becomes overwhelmingly large, asking for principled design flows that address the application-level needs. This paper presents a latency-driven design methodology for mapping ConvNets on FPGAs. The proposed design flow employs novel transformations over a Synchronous Dataflow-based modelling framework together with a latency-centric optimisation procedure in order to efficiently explore the design space targeting low-latency designs. Quantitative evaluation shows large improvements in latency when latency-driven optimisation is in place yielding designs that improve the latency of AlexNet by 73.54× and VGG16 by 5.61× over throughput-optimised designs.
Bouganis C, Boikos K, 2017, A High-Performance System-on-Chip Architecture for Direct Tracking for SLAM, International Conference on Field-Programmable Logic and Applications, Publisher: IEEE, ISSN: 1946-1488
Simultaneous Localization and Mapping or SLAM, is a family of algorithms that solve the problem of estimating an observer's position in an unknown environment while generating a map of that environment. SLAM algorithms that produce high quality dense maps require powerful hardware platforms. In the simultaneous solution of these two problems, Localization, also known as Tracking, is the one that is latency sensitive and needs a sustained high framerate. This work focuses on providing an efficient, high-performance solution for Direct Tracking using a high bandwidth streaming architecture, optimized for maximum memory throughput. At its centre is a Tracking Core that performs non-linear least-squares optimization for direct whole-image alignment. The architecture is designed to scale with the available hardware resources in order to enable its use for different performance/cost levels and platforms. An initial implementation tested with a Zynq System-on-Chip can process and track more than 22 frames/second with an embedded power budget and achieves a 5× improvement over previous work on FPGA SoCs.
Bouganis C-S, Gorgon M, Bonato V, 2017, Special issue on applied reconfigurable computing, MICROPROCESSORS AND MICROSYSTEMS, Vol: 52, Pages: 1-1, ISSN: 0141-9331
Liu S, Bouganis CS, 2017, Communication-aware MCMC method for big data applications on FPGAs, Field-Programmable Custom Computing Machines (FCCM), 2017 IEEE 25th Annual International Symposium on, Pages: 9-16
© 2017 IEEE. Markov Chain Monte Carlo (MCMC) based methods have been the main tool for Bayesian Inference for some years now, and recently they find increasing applications in modern statistics and machine learning. Nevertheless, with the availability of large datasets and increasing complexity of Bayesian models, MCMC methods are becoming prohibitively expensive for real-world problems. At the heart of these methods, lies the computation of likelihood functions that requires access to all input data points in each iteration of the method. Current approaches, based on data subsampling, aim to accelerate these algorithms by reducing the number of the data points for likelihood evaluations at each MCMC iteration. However the existing work doesn't consider the properties of modern memory hierarchies, but treats the memory as one monolithic storage space. This paper proposes a communication-aware MCMC framework that takes into account the underlying performance of the memory subsystem. The framework is based on a novel subsampling algorithm that utilises an unbiased likelihood estimator based on Probability Proportional-to-Size (PPS) sampling, allowing information on the performance of the memory system to be taken into account during the sampling stage. The proposed MCMC sampler is mapped to an FPGA device and its performance is evaluated using the Bayesian logistic regression model on MNIST dataset. The proposed system achieves a 3.37x speed up over a highly optimised traditional FPGA design, therefore the risk in the estimates based on the generated samples is largely decreased.
Vavouras M, Bouganis C-S, 2017, Area-driven partial reconfiguration for SEU mitigation on SRAM-based FPGAs, International Conference on Reconfigurable Computing and FPGAs (ReConFig), Publisher: IEEE, ISSN: 2325-6532
This paper presents an area-driven Field-Programmable Gate Array (FPGA) scrubbing technique based on partial reconfiguration for Single Event Upset (SEU) mitigation. The proposed method is compared with existing techniques such as blind and on-demand scrubbing on a novel SEU mitigation framework implemented on the ZYNQ platform, supporting various SEU and scrubbing rates. A design space exploration on the availability versus data transfers from a Double Data Rate Type 3 (DDR3) memory, shows that our approach outperforms blind scrubbing for a range of availability values when a second order polynomial IP is targeted. A comparison to an existing on-demand scrubbing technique based on Dual Modular Redundancy (DMR) shows that our approach saves up to 46% area for the same case study.
Vavouras M, Duarte RP, Armato A, et al., 2017, A Hybrid ASIC/FPGA Fault-Tolerant Artificial Pancreas, International Conference on Embedded Computer Systems - Architectures, Modeling and Simulation (SAMOS), Publisher: IEEE, Pages: 261-267
Venieris SI, Bouganis C-S, 2017, fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs (Abstract Only)., International Symposium on Field-Programmable Gate Arrays, Publisher: ACM, Pages: 291-292
Liu S, Mingas G, Bouganis C, 2016, An Unbiased MCMC FPGA-based Accelerator in the Land of Custom Precision Arithmetic, IEEE Transactions on Computers, Vol: 66, Pages: 745-758, ISSN: 0018-9340
Markov Chain Monte Carlo (MCMC) based methods have been the main tool used for Bayesian Inference by practitionersand researchers due to their flexibility and theoretical properties that guarantee unbiased sampling-based estimates. Nevertheless,with the availability of large data sets and the constant need to develop more complex models that better capture the targeted problem,significant computational challenges have been presented. Current approaches, based on multi-core CPUs, GPUs, and FPGAs, aim toaccelerate the execution time of the MCMC methods using subsampling techniques or custom precision arithmetic, resulting to biasedestimates. In this work, a novel FPGA-based construction is proposed that utilises the custom precision support of FPGA devices inorder to accelerate the computations, guaranteeing at the same time asymptotically unbiased estimates. Key to this approach is theextension of the parameter space by an extra parameter that indicates the required precision in the computation of the likelihood of adata point. The work proposes an FPGA architecture for the above algorithm, as well as discuss its tuning for maximising theperformance of the system. The performance of the FPGA-mapped sampler is evaluated using two Bayesian logistic regression casestudies of varying complexity, which show significant speedups compared to existing FPGA- and CPU-based works that utilise doublefloating point arithmetic, without any bias on the sampling-based estimates.
Bouganis C, Mingas G, Bottolo L, 2016, Particle MCMC algorithms and architectures for accelerating inference in state-space models, International Journal of Approximate Reasoning, Vol: 83, Pages: 413-433, ISSN: 1873-4731
Particle Markov Chain Monte Carlo (pMCMC) is a stochastic algorithm designed to generate samples from a prob-ability distribution, when the density of the distribution does not admit a closed form expression. pMCMC is mostcommonly used to sample from the Bayesian posterior distribution in State-Space Models (SSMs), a class of prob-abilistic models used in numerous scientific applications. Nevertheless, this task is prohibitive when dealing withcomplex SSMs with massive data, due to the high computational cost of pMCMC and its poor performance when theposterior exhibits multi-modality. This paper aims to address both issues by: 1) Proposing a novel pMCMC algorithm(denoted ppMCMC), which uses multiple Markov chains (instead of the one used by pMCMC) to improve sampling
Boikos K, Bouganis C-S, 2016, Semi-dense SLAM on an FPGA SoC, 26th International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, ISSN: 1946-1488
Deploying advanced Simultaneous Localisation and Mapping, or SLAM, algorithms in autonomous low-power robotics will enable emerging new applications which require an accurate and information rich reconstruction of the environment. This has not been achieved so far because accuracy and dense 3D reconstruction come with a high computational complexity. This paper discusses custom hardware design on a novel platform for embedded SLAM, an FPGA-SoC, combining an embedded CPU and programmable logic on the same chip. The use of programmable logic, tightly integrated with an efficient multicore embedded CPU stands to provide an effective solution to this problem. In this work an average framerate of more than 4 frames/second for a resolution of 320×240 has been achieved with an estimated power of less than 1 Watt for the custom hardware. In comparison to the software-only version, running on a dual-core ARM processor, an acceleration of 2× has been achieved for LSD-SLAM, without any compromise in the quality of the result.
Rabieah MB, Bouganis C-S, 2016, FPGASVM: A Framework for Accelerating Kernelized Support Vector Machine., BigMine-2016, Publisher: JMLR.org, Pages: 68-84
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.