Publications

Ahmadi N, Constandinou TG, Bouganis C-S, 2019, Decoding Hand Kinematics from Local Field Potentials Using Long Short-Term Memory (LSTM) Network, 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER 2019), Publisher: IEEE, Pages: 415-419

Local field potential (LFP) has gained increasing interest as an alternativeinput signal for brain-machine interfaces (BMIs) due to its informativefeatures, long-term stability, and low frequency content. However, despitethese interesting properties, LFP-based BMIs have been reported to yield lowdecoding performances compared to spike-based BMIs. In this paper, we propose anew decoder based on long short-term memory (LSTM) network which aims toimprove the decoding performance of LFP-based BMIs. We compare offline decodingperformance of the proposed LSTM decoder to a commonly used Kalman filter (KF)decoder on hand kinematics prediction tasks from multichannel LFPs. We alsobenchmark the performance of LFP-driven LSTM decoder against KF decoder drivenby two types of spike signals: single-unit activity (SUA) and multi-unitactivity (MUA). Our results show that LFP-driven LSTM decoder achievessignificantly better decoding performance than LFP-, SUA-, and MUA-driven KFdecoders. This suggests that LFPs coupled with LSTM decoder could provide highdecoding performance, robust, and low power BMIs.

Conference paper

Venieris S, Bouganis C, 2019, fpgaConvNet: mapping regular and irregular convolutional neural networks on FPGAs, IEEE Transactions on Neural Networks and Learning Systems, Vol: 30, Pages: 326-342, ISSN: 2162-2388

Since neural networks renaissance, convolutional neural networks (ConvNets) have demonstrated a state-of-the-art performance in several emerging artificial intelligence tasks. The deployment of ConvNets in real-life applications requires power-efficient designs that meet the application-level performance needs. In this context, field-programmable gate arrays (FPGAs) can provide a potential platform that can be tailored to application-specific requirements. However, with the complexity of ConvNet models increasing rapidly, the ConvNet-to-FPGA design space becomes prohibitively large. This paper presents fpgaConvNet, an end-to-end framework for the optimized mapping of ConvNets on FPGAs. The proposed framework comprises an automated design methodology based on the synchronous dataflow (SDF) paradigm and defines a set of SDF transformations in order to efficiently navigate the architectural design space. By proposing a systematic multiobjective optimization formulation, the presented framework is able to generate hardware designs that are cooptimized for the ConvNet workload, the target device, and the application's performance metric of interest. Quantitative evaluation shows that the proposed methodology yields hardware designs that improve the performance by up to 6.65x over highly optimized graphics processing unit designs for the same power constraints and achieve up to 2.94x higher performance density compared with the state-of-the-art FPGA-based ConvNet architectures.

Journal article

Kouris A, Bouganis C, 2019, Learning to fly by myself: a self-supervised CNN-based approach for autonomous navigation, Intelligent Robots and Systems (IROS 2018), 2018 IEEE/RSJ International Conference on, Publisher: IEEE

Nowadays, Unmanned Aerial Vehicles (UAVs)are becoming increasingly popular facilitated by their extensive availability. Autonomous navigation methods can act as an enabler for the safe deployment of drones on a wide range of real-world civilian applications. In this work, we introduce a self-supervised CNN-based approach for indoor robot navigation. Our method addresses the problem of real-time obstacle avoidance, by employing a regression CNN that predicts the agent's distance-to-collision in view of the raw visual input of its on-board monocular camera. The proposed CNN is trained on our custom indoor-flight dataset which is collected and annotated with real-distance labels, in a self-supervised manner using external sensors mounted on an UAV. By simultaneously processing the current and previous input frame, the proposed CNN extracts spatio-temporal features that encapsulate both static appearance and motion information to estimate the robot's distance to its closest obstacle towards multiple directions. These predictions are used to modulate the yaw and linear velocity of the UAV, in order to navigate autonomously and avoid collisions. Experimental evaluation demonstrates that the proposed approach learns a navigation policy that achieves high accuracy on real-world indoor flights, outperforming previously proposed methods from the literature.

Conference paper

Ahmadi N, Cavuto ML, Feng P, Leene LB, Maslik M, Mazza F, Savolainen O, Szostak KM, Bouganis C-S, Ekanayake J, Jackson A, Constandinou TGet al., 2019, Towards a Distributed, Chronically-Implantable Neural Interface., Publisher: IEEE, Pages: 719-724

Conference paper

Montgomerie-Corcoran A, Venieris S, Bouganis C-S, 2019, Power-Aware FPGA Mapping of Convolutional Neural Networks, International Conference on Field-Programmable Technology (ICFPT), Publisher: IEEE COMPUTER SOC, Pages: 327-330

Author Web Link
Cite
Citations: 4

Conference paper

Kouris A, Venieris SI, Bouganis C-S, 2018, CascadeC(NN): pushing the performance limits of quantisation in convolutional neural networks, 28th International Conference on Field Programmable Logic and Applications (FPL), Publisher: IEEE, Pages: 155-162, ISSN: 1946-1488

This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, aiming to perform high-throughput inference. A two-stage architecture tailored for any given CNN-FPGA pair is generated, consisting of a low-and high-precision unit in a cascade. A confidence evaluation unit is employed to identify misclassified cases from the excessively low-precision unit and forward them to the high-precision unit for re-processing. Experiments demonstrate that the proposed toolflow can achieve a performance boost up to 55% for VGG-16 and 48% for AlexNet over the baseline design for the same resource budget and accuracy, without the need of retraining the model or accessing the training data.

Conference paper

Kyrkou C, Theocharides T, Bouganis C-S, Polycarpou Met al., 2018, Boosting the hardware-efficiency of cascade support vector machines for embedded classification applications, International Journal of Parallel Programming, Vol: 46, Pages: 1220-1246, ISSN: 0885-7458

Support Vector Machines (SVMs) are considered as a state-of-the-art classification algorithm capable of high accuracy rates for a different range of applications. When arranged in a cascade structure, SVMs can efficiently handle problems where the majority of data belongs to one of the two classes, such as image object classification, and hence can provide speedups over monolithic (single) SVM classifiers. However, the SVM classification process is still computationally demanding due to the number of support vectors. Consequently, in this paper we propose a hardware architecture optimized for cascaded SVM processing to boost performance and hardware efficiency, along with a hardware reduction method in order to reduce the overheads from the implementation of additional stages in the cascade, leading to significant resource and power savings. The architecture was evaluated for the application of object detection on 800×600 resolution images on a Spartan 6 Industrial Video Processing FPGA platform achieving over 30 frames-per-second. Moreover, by utilizing the proposed hardware reduction method we were able to reduce the utilization of FPGA custom-logic resources by ∼30%, and simultaneously observed ∼20% peak power reduction compared to a baseline implementation.

Journal article

Ahmadi N, Constandinou T, Bouganis C, 2018, Estimation of neuronal firing rate using Bayesian Adaptive Kernel Smoother (BAKS), PLoS ONE, Vol: 13, ISSN: 1932-6203

Neurons use sequences of action potentials (spikes) to convey information across neuronal networks. In neurophysiology experiments, information about external stimuli or behavioral tasks has been frequently characterized in term of neuronal firing rate. The firing rate is conventionally estimated by averaging spiking responses across multiple similar experiments (or trials). However, there exist a number of applications in neuroscience research that require firing rate to be estimated on a single trial basis. Estimating firing rate from a single trial is a challenging problem and current state-of-the-art methods do not perform well. To address this issue, we develop a new method for estimating firing rate based on a kernel smoothing technique that considers the bandwidth as a random variable with prior distribution that is adaptively updated under an empirical Bayesian framework. By carefully selecting the prior distribution together with Gaussian kernel function, an analytical expression can be achieved for the kernel bandwidth. We refer to the proposed method as Bayesian Adaptive Kernel Smoother (BAKS). We evaluate the performance of BAKS using synthetic spike train data generated by biologically plausible models: inhomogeneous Gamma (IG) and inhomogeneous inverse Gaussian (IIG). We also apply BAKS to real spike train data from non-human primate (NHP) motor and visual cortex. We benchmark the proposed method against established and previously reported methods. These include: optimized kernel smoother (OKS), variable kernel smoother (VKS), local polynomial fit (Locfit), and Bayesian adaptive regression splines (BARS). Results using both synthetic and real data demonstrate that the proposed method achieves better performance compared to competing methods. This suggests that the proposed method could be useful for understanding the encoding mechanism of neurons in cognitive-related tasks. The proposed method could also potentially improve the performance of brain-mac

Journal article

Ahmadi N, Constandinou TG, Bouganis C, 2018, Spike rate estimation using Bayesian Adaptive Kernel Smoother (BAKS) and its application to brain machine interfaces, 40th International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Publisher: IEEE

Brain Machine Interfaces (BMIs) mostly utilise spike rate as an input feature for decoding a desired motor output as it conveys a useful measure to the underlying neuronal activity. The spike rate is typically estimated by a using non-overlap binning method that yields a coarse estimate. There exist several methods that can produce a smooth estimate which could potentially improve the decoding performance. However, these methods are relatively computationally heavy for real-time BMIs. To address this issue, we propose a new method for estimating spike rate that is able to yield a smooth estimate and also amenable to real-time BMIs. The proposed method, referred to as Bayesian adaptive kernel smoother (BAKS), employs kernel smoothing technique that considers the bandwidth as a random variable with prior distribution which is adaptively updated through a Bayesian framework. With appropriate selection of prior distribution and kernel function, an analytical expression can be achieved for the kernel bandwidth. We apply BAKS and evaluate its impact on of fline BMI decoding performance using Kalman filter. The results show that overlap BAKS improved the decoding performance up to 3.33% and 12.93% compared to overlap and non-overlapbinning methods, respectively, depending on the window size. This suggests the feasibility and the potential use of BAKS method for real-time BMIs.

Conference paper

Venieris SI, Kouris A, Bouganis C-S, 2018, Deploying Deep Neural Networks in the Embedded Space

Recently, Deep Neural Networks (DNNs) have emerged as the dominant modelacross various AI applications. In the era of IoT and mobile systems, theefficient deployment of DNNs on embedded platforms is vital to enable thedevelopment of intelligent applications. This paper summarises our recent workon the optimised mapping of DNNs on embedded settings. By covering such diversetopics as DNN-to-accelerator toolflows, high-throughput cascaded classifiersand domain-specific model design, the presented set of works aim to enable thedeployment of sophisticated deep learning models on cutting-edge mobile andembedded systems.

Working paper

Venieris SI, Bouganis C-S, 2018, f-CNNx: a toolflow for mapping multiple convolutional neural networks on FPGAs, 28th International Conference on Field Programmable Logic and Applications

The predictive power of Convolutional Neural Networks (CNNs) has been an integral factor for emerging latency-sensitive applications, such as autonomous drones and vehicles. Such systems employ multiple CNNs, each one trained for a particular task. The efficient mapping of multiple CNNs on a single FPGA device is a challenging task as the allocation of compute resources and external memory bandwidth needs to be optimised at design time. This paper proposes f-CNNx, an automated toolflow for the optimised mapping of multiple CNNs on FPGAs, comprising a novel multi-CNN hardware architecture together with an automated design space exploration method that considers the user-specified performance requirements for each model to allocate compute resources and generate a synthesisable accelerator. Moreover, f-CNNx employs a novel scheduling algorithm that alleviates the limitations of the memory bandwidth contention between CNNs and sustains the high utilisation of the architecture. Experimental evaluation shows that f-CNNx's designs outperform contention-unaware FPGA mappings by up to 50% and deliver up to 6.8x higher performance-per-Watt over highly optimised GPU designs for multi-CNN systems.

Conference paper

Kouris A, Venieris SI, Bouganis C-S, 2018, CascadeCNN: Pushing the performance limits of quantisation

This work presents CascadeCNN, an automated toolflow that pushes thequantisation limits of any given CNN model, to perform high-throughputinference by exploiting the computation time-accuracy trade-off. Without theneed for retraining, a two-stage architecture tailored for any given FPGAdevice is generated, consisting of a low- and a high-precision unit. Aconfidence evaluation unit is employed between them to identify misclassifiedcases at run time and forward them to the high-precision unit or terminatecomputation. Experiments demonstrate that CascadeCNN achieves a performanceboost of up to 55% for VGG-16 and 48% for AlexNet over the baseline design forthe same resource budget and accuracy.

Conference paper

Rizakis M, Venieris SI, Kouris A, Bouganis C-Set al., 2018, Approximate FPGA-based LSTMs under computation time constraints, ARC 2018: 14th International Symposium on Applied Reconfigurable Computing, Publisher: Springer, Pages: 3-15, ISSN: 0302-9743

Recurrent Neural Networks, with the prominence of LongShort-Term Memory (LSTM) networks, have demonstrated state-of-the-art accuracy in several emerging Artificial Intelligence tasks. Neverthe-less, the highest performing LSTM models are becoming increasinglydemanding in terms of computational and memory load. At the sametime, emerging latency-sensitive applications including mobile robots andautonomous vehicles often operate under stringent computation timeconstraints. In this paper, we address the challenge of deploying com-putationally demanding LSTMs at a constrained time budget by intro-ducing an approximate computing scheme that combines iterative low-rank compression and pruning, along with a novel FPGA-based LSTMarchitecture. Combined in an end-to-end framework, the approximationmethod parameters are optimised and the architecture is configuredto address the problem of high-performance LSTM execution in time-constrained applications. Quantitative evaluation on a real-life imagecaptioning application indicates that the proposed system required up to6.5×less time to achieve the same application-level accuracy comparedto a baseline method, while achieving an average of 25×higher accuracyunder the same computation time constraints.

Conference paper

Venieris SI, Kouris A, Bouganis C-S, 2018, Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions., ACM Comput. Surv., Vol: 51, Pages: 56:1-56:1, ISSN: 0360-0300

n the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performancein various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, severalsoftware frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context,reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integratedin the existing deep learning ecosystem to provide a tunable balance between performance, power consumptionand programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising acomparative study of their key characteristics which include the supported applications, architectural choices,design space exploration methods and achieved performance. Moreover, major challenges and objectivesintroduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniformevaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation ofCNN-to-FPGA toolflows.

Journal article

Shafique M, Theocharides T, Bouganis C-S, Hanif MA, Khalid F, Hafiz R, Rehman Set al., 2018, An overview of next-generation architectures for machine learning: roadmap, opportunities and challenges in the IoT era, Design, Automation & Test in Europe Conference & Exhibition (DATE), Publisher: IEEE, Pages: 827-832, ISSN: 1558-1101

The number of connected Internet of Things (IoT) devices are expected to reach over 20 billion by 2020. These range from basic sensor nodes that log and report the data to the ones that are capable of processing the incoming information and taking an action accordingly. Machine learning, and in particular deep learning, is the de facto processing paradigm for intelligently processing these immense volumes of data. However, the resource inhibited environment of IoT devices, owing to their limited energy budget and low compute capabilities, render them a challenging platform for deployment of desired data analytics. This paper provides an overview of the current and emerging trends in designing highly efficient, reliable, secure and scalable machine learning architectures for such devices. The paper highlights the focal challenges and obstacles being faced by the community in achieving its desired goals. The paper further presents a roadmap that can help in addressing the highlighted challenges and thereby designing scalable, high-performance, and energy efficient architectures for performing machine learning on the edge.

Conference paper

Rizakis M, Venieris SI, Kouris A, Bouganis C-Set al., 2018, Approximate FPGA-based LSTMs under computation time constraints, 14th International Symposium, ARC 2018, Publisher: Springer International Publishing, Pages: 3-15, ISSN: 0302-9743

Recurrent Neural Networks, with the prominence of Long Short-Term Memory (LSTM) networks, have demonstrated state-of-the-art accuracy in several emerging Artificial Intelligence tasks. Nevertheless, the highest performing LSTM models are becoming increasingly demanding in terms of computational and memory load. At the same time, emerging latency-sensitive applications including mobile robots and autonomous vehicles often operate under stringent computation time constraints. In this paper, we address the challenge of deploying computationally demanding LSTMs at a constrained time budget by introducing an approximate computing scheme that combines iterative low-rank compression and pruning, along with a novel FPGA-based LSTM architecture. Combined in an end-to-end framework, the approximation method parameters are optimised and the architecture is configured to address the problem of high-performance LSTM execution in time-constrained applications. Quantitative evaluation on a real-life image captioning application indicates that the proposed system required up to 6.5 X less time to achieve the same application-level accuracy compared to a baseline method, while achieving an average of 25 X higher accuracy under the same computation time constraints.

Conference paper

Vasileiadis M, Malassiotis S, Giakoumis D, Bouganis C-S, Tzovaras Det al., 2018, Robust Human Pose Tracking For Realistic Service Robot Applications, 16th IEEE International Conference on Computer Vision (ICCV), Publisher: IEEE, Pages: 1363-1372, ISSN: 2473-9936

Robust human pose estimation and tracking plays an integral role in assistive service robot applications, as it provides information regarding the body pose and motion of the user in a scene. Even though current solutions provide high-accuracy results in controlled environments, they fail to successfully deal with problems encountered under real-life situations such as tracking initialization and failure, body part intersection, large object handling and partial-view body-part tracking. This paper presents a framework tailored for deployment under real-life situations addressing the above limitations. The framework is based on the articulated 3D-SDF data representation model, and has been extended with complementary mechanisms for addressing the above challenges. Extensive evaluation on public datasets demonstrates the framework's state-of-the-art performance, while experimental results on a challenging realistic human motion dataset exhibit its robustness in real life scenarios.

Conference paper

Kyrkou C, Plastiras G, Theocharides T, Venieris SI, Bouganis C-Set al., 2018, DroNet: efficient convolutional neural network detector for real-time UAV applications, Design, Automation and Test in Europe Conference and Exhibition (DATE), Publisher: IEEE, Pages: 967-972, ISSN: 1530-1591

Unmanned Aerial Vehicles (drones) are emerging as a promising technology for both environmental and infrastructure monitoring, with broad use in a plethora of applications. Many such applications require the use of computer vision algorithms in order to analyse the information captured from an on-board camera. Such applications include detecting vehicles for emergency response and traffic monitoring. This paper therefore, explores the trade-offs involved in the development of a single-shot object detector based on deep convolutional neural networks (CNNs) that can enable UAVs to perform vehicle detection under a resource constrained environment such as in a UAV. The paper presents a holistic approach for designing such systems; the data collection and training stages, the CNN architecture, and the optimizations necessary to efficiently map such a CNN on a lightweight embedded processing platform suitable for deployment on UAVs. Through the analysis we propose a CNN architecture that is capable of detecting vehicles from aerial UAV images and can operate between 5-18 frames-per-second for a variety of platforms with an overall accuracy of ~ 95%. Overall, the proposed architecture is suitable for UAV applications, utilizing low-power embedded processors that can be deployed on commercial UAVs.

Conference paper

Rosa LDS, Bonato V, Bouganis C-S, 2018, Scaling Up Loop Pipelining for High-Level Synthesis: A Non-iterative Approach., Publisher: IEEE, Pages: 62-69

Conference paper

Venieris SI, Bouganis C-S, 2017, fpgaConvNet: A Toolflow for Mapping Diverse Convolutional Neural Networks on Embedded FPGAs, Conference on Neural Information Processing Systems

In recent years, Convolutional Neural Networks (ConvNets) have become anenabling technology for a wide range of novel embedded Artificial Intelligencesystems. Across the range of applications, the performance needs varysignificantly, from high-throughput video surveillance to the very low-latencyrequirements of autonomous cars. In this context, FPGAs can provide a potentialplatform that can be optimally configured based on the different performanceneeds. However, the complexity of ConvNet models keeps increasing making theirmapping to an FPGA device a challenging task. This work presents fpgaConvNet,an end-to-end framework for mapping ConvNets on FPGAs. The proposed frameworkemploys an automated design methodology based on the Synchronous Dataflow (SDF)paradigm and defines a set of SDF transformations in order to efficientlyexplore the architectural design space. By selectively optimising forthroughput, latency or multiobjective criteria, the presented tool is able toefficiently explore the design space and generate hardware designs fromhigh-level ConvNet specifications, explicitly optimised for the performancemetric of interest. Overall, our framework yields designs that improve theperformance by up to 6.65x over highly optimised embedded GPU designs for thesame power constraints in embedded environments.

Journal article

Bouganis C, Boikos K, 2017, A High-Performance System-on-Chip Architecture for Direct Tracking for SLAM, International Conference on Field-Programmable Logic and Applications, Publisher: IEEE, ISSN: 1946-1488

Simultaneous Localization and Mapping or SLAM, is a family of algorithms that solve the problem of estimating an observer's position in an unknown environment while generating a map of that environment. SLAM algorithms that produce high quality dense maps require powerful hardware platforms. In the simultaneous solution of these two problems, Localization, also known as Tracking, is the one that is latency sensitive and needs a sustained high framerate. This work focuses on providing an efficient, high-performance solution for Direct Tracking using a high bandwidth streaming architecture, optimized for maximum memory throughput. At its centre is a Tracking Core that performs non-linear least-squares optimization for direct whole-image alignment. The architecture is designed to scale with the available hardware resources in order to enable its use for different performance/cost levels and platforms. An initial implementation tested with a Zynq System-on-Chip can process and track more than 22 frames/second with an embedded power budget and achieves a 5× improvement over previous work on FPGA SoCs.

Conference paper

Bouganis C, venieris, 2017, Latency-Driven Design for FPGA-based Convolutional Neural Networks, International Conference on Field-Programmable Logic and Applications, Publisher: IEEE, ISSN: 1946-1488

In recent years, Convolutional Neural Networks (ConvNets) have become the quintessential component of several state-of-the-art Artificial Intelligence tasks. Across the spectrum of applications, the performance needs vary significantly, from high-throughput image recognition to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on different performance requirements. However, with the increasing complexity of ConvNet models, the architectural design space becomes overwhelmingly large, asking for principled design flows that address the application-level needs. This paper presents a latency-driven design methodology for mapping ConvNets on FPGAs. The proposed design flow employs novel transformations over a Synchronous Dataflow-based modelling framework together with a latency-centric optimisation procedure in order to efficiently explore the design space targeting low-latency designs. Quantitative evaluation shows large improvements in latency when latency-driven optimisation is in place yielding designs that improve the latency of AlexNet by 73.54× and VGG16 by 5.61× over throughput-optimised designs.

Conference paper

Liu S, Bouganis CS, 2017, Communication-aware MCMC method for big data applications on FPGAs, Field-Programmable Custom Computing Machines (FCCM), 2017 IEEE 25th Annual International Symposium on, Pages: 9-16

© 2017 IEEE. Markov Chain Monte Carlo (MCMC) based methods have been the main tool for Bayesian Inference for some years now, and recently they find increasing applications in modern statistics and machine learning. Nevertheless, with the availability of large datasets and increasing complexity of Bayesian models, MCMC methods are becoming prohibitively expensive for real-world problems. At the heart of these methods, lies the computation of likelihood functions that requires access to all input data points in each iteration of the method. Current approaches, based on data subsampling, aim to accelerate these algorithms by reducing the number of the data points for likelihood evaluations at each MCMC iteration. However the existing work doesn't consider the properties of modern memory hierarchies, but treats the memory as one monolithic storage space. This paper proposes a communication-aware MCMC framework that takes into account the underlying performance of the memory subsystem. The framework is based on a novel subsampling algorithm that utilises an unbiased likelihood estimator based on Probability Proportional-to-Size (PPS) sampling, allowing information on the performance of the memory system to be taken into account during the sampling stage. The proposed MCMC sampler is mapped to an FPGA device and its performance is evaluated using the Bayesian logistic regression model on MNIST dataset. The proposed system achieves a 3.37x speed up over a highly optimised traditional FPGA design, therefore the risk in the estimates based on the generated samples is largely decreased.

Conference paper

Liu S, Mingas G, Bouganis C, 2017, An unbiased MCMC FPGA-based accelerator in the land of custom precision arithmetic, IEEE Transactions on Computers, Vol: 66, Pages: 745-758, ISSN: 0018-9340

Markov Chain Monte Carlo (MCMC) based methods have been the main tool used for Bayesian Inference by practitioners and researchers due to their flexibility and theoretical properties that guarantee unbiased sampling-based estimates. Nevertheless, with the availability of large data sets and the constant need to develop more complex models that better capture the targeted problem, significant computational challenges have been presented. Current approaches, based on multi-core CPUs, GPUs, and FPGAs, aim to accelerate the execution time of the MCMC methods using subsampling techniques or custom precision arithmetic, resulting to biased estimates. In this work, a novel FPGA-based construction is proposed that utilises the custom precision support of FPGA devices in order to accelerate the computations, guaranteeing at the same time asymptotically unbiased estimates. Key to this approach is the extension of the parameter space by an extra parameter that indicates the required precision in the computation of the likelihood of a data point. The work proposes an FPGA architecture for the above algorithm, as well as discuss its tuning for maximising the performance of the system. The performance of the FPGA-mapped sampler is evaluated using two Bayesian logistic regression case studies of varying complexity, which show significant speedups compared to existing FPGAand CPU-based works that utilise double floating point arithmetic, without any bias on the sampling-based estimates.

Journal article

Vavouras M, Bouganis C-S, 2017, Area-driven partial reconfiguration for SEU mitigation on SRAM-based FPGAs, International Conference on Reconfigurable Computing and FPGAs (ReConFig), Publisher: IEEE, ISSN: 2325-6532

This paper presents an area-driven Field-Programmable Gate Array (FPGA) scrubbing technique based on partial reconfiguration for Single Event Upset (SEU) mitigation. The proposed method is compared with existing techniques such as blind and on-demand scrubbing on a novel SEU mitigation framework implemented on the ZYNQ platform, supporting various SEU and scrubbing rates. A design space exploration on the availability versus data transfers from a Double Data Rate Type 3 (DDR3) memory, shows that our approach outperforms blind scrubbing for a range of availability values when a second order polynomial IP is targeted. A comparison to an existing on-demand scrubbing technique based on Dual Modular Redundancy (DMR) shows that our approach saves up to 46% area for the same case study.

Conference paper

Vavouras M, Duarte RP, Armato A, Bouganis C-Set al., 2017, A Hybrid ASIC/FPGA Fault-Tolerant Artificial Pancreas, International Conference on Embedded Computer Systems - Architectures, Modeling and Simulation (SAMOS), Publisher: IEEE, Pages: 261-267

Conference paper

Venieris SI, Bouganis C-S, 2017, fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs (Abstract Only)., International Symposium on Field-Programmable Gate Arrays, Publisher: ACM, Pages: 291-292

Conference paper

Bouganis C, Mingas G, Bottolo L, 2016, Particle MCMC algorithms and architectures for accelerating inference in state-space models, International Journal of Approximate Reasoning, Vol: 83, Pages: 413-433, ISSN: 1873-4731

Particle Markov Chain Monte Carlo (pMCMC) is a stochastic algorithm designed to generate samples from a prob-ability distribution, when the density of the distribution does not admit a closed form expression. pMCMC is mostcommonly used to sample from the Bayesian posterior distribution in State-Space Models (SSMs), a class of prob-abilistic models used in numerous scientific applications. Nevertheless, this task is prohibitive when dealing withcomplex SSMs with massive data, due to the high computational cost of pMCMC and its poor performance when theposterior exhibits multi-modality. This paper aims to address both issues by: 1) Proposing a novel pMCMC algorithm(denoted ppMCMC), which uses multiple Markov chains (instead of the one used by pMCMC) to improve sampling

Journal article

Boikos K, Bouganis C-S, 2016, Semi-dense SLAM on an FPGA SoC, 26th International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, ISSN: 1946-1488

Deploying advanced Simultaneous Localisation and Mapping, or SLAM, algorithms in autonomous low-power robotics will enable emerging new applications which require an accurate and information rich reconstruction of the environment. This has not been achieved so far because accuracy and dense 3D reconstruction come with a high computational complexity. This paper discusses custom hardware design on a novel platform for embedded SLAM, an FPGA-SoC, combining an embedded CPU and programmable logic on the same chip. The use of programmable logic, tightly integrated with an efficient multicore embedded CPU stands to provide an effective solution to this problem. In this work an average framerate of more than 4 frames/second for a resolution of 320×240 has been achieved with an estimated power of less than 1 Watt for the custom hardware. In comparison to the software-only version, running on a dual-core ARM processor, an acceleration of 2× has been achieved for LSD-SLAM, without any compromise in the quality of the result.

Conference paper

Rabieah MB, Bouganis C-S, 2016, FPGASVM: A Framework for Accelerating Kernelized Support Vector Machine., BigMine-2016, Publisher: JMLR.org, Pages: 68-84

Conference paper

ProfessorChristos-SavvasBouganis

Contact

Location

Summary