Publications

Bouganis C-S, Toupas P, Yu Z, Tzovaras Det al., 2024, SMOF: Streaming Modern CNNs on FPGAs with Smart Off-Chip Eviction, IEEE International Symposium on Field-Programmable Custom Computing Machines

Cite

Conference paper

Yu Z, Bouganis C-S, 2023, Mixed-TD: efficient neural network accelerator with layer-specific tensor decomposition, 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, ISSN: 1946-1488

Neural Network designs are quite diverse, from VGG-style to ResNet-style, and from Convolutional Neural Networks to Transformers. Towards the design of efficient accelerators, many works have adopted a dataflow-based, inter-layer pipelined architecture, with a customized hardware towards each layer, achieving ultra high throughput and low latency. The deployment of neural networks to such dataflow architecture accelerators is usually hindered by the available on-chip memory as it is desirable to preload the weights of neural networks on-chip to maximise the system performance. To address this, networks are usually compressed before the deployment through methods such as pruning, quantization and tensor decomposition. In this paper, a framework for mapping CNNs onto FPGAs based on a novel tensor decomposition method called Mixed-TD is proposed. The proposed method applies layer-specific Singular Value Decomposition (SVD) and Canonical Polyadic Decomposition (CPD) in a mixed manner, achieving 1.73× to 10.29× throughput per DSP to state-of-the-art CNNs. Our work is open-sourced: https://github.com/Yu-Zhewen/Mixed-TD.

Conference paper

Toupas P, Bouganis C-S, Tzovaras D, 2023, fpgaHART: a toolflow for throughput-oriented acceleration of 3D CNNs for HAR onto FPGAs, 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, ISSN: 1946-1488

Surveillance systems, autonomous vehicles, human monitoring systems, and video retrieval are just few of the many applications in which 3D Convolutional Neural Networks are exploited. However, their extensive use is restricted by their high computational and memory requirements, especially when integrated into systems with limited resources. This study proposes a toolflow that optimises the mapping of 3D CNN models for Human Action Recognition onto FPGA devices, taking into account FPGA resources and off-chip memory characteristics. The proposed system employs Synchronous Dataflow (SDF) graphs to model the designs and introduces transformations to expand and explore the design space, resulting in high-throughput designs. A variety of 3D CNN models were evaluated using the proposed toolflow on multiple FPGA devices, demonstrating its potential to deliver competitive performance compared to earlier hand-tuned and model-specific designs.

Conference paper

Toupas P, Bouganis C-S, Tzovaras D, 2023, FMM-X3D: FPGA-based modeling and mapping of X3D for human action recognition, 2023 IEEE 34th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Publisher: IEEE

3D Convolutional Neural Networks are gaining increasing attention from researchers and practitioners and have found applications in many domains, such as surveillance systems, autonomous vehicles, human monitoring systems, and video retrieval. However, their widespread adoption is hindered by their high computational and memory requirements, especially when resource-constrained systems are targeted. This paper addresses the problem of mapping X3D, a state-of-the-art model in Human Action Recognition that achieves accuracy of 95.5% in the UCF101 benchmark, onto any FPGA device. The proposed toolflow generates an optimised stream-based hardware system, taking into account the available resources and off-chip memory characteristics of the FPGA device. The generated designs push further the current performance-accuracy pareto front, and enable for the first time the targeting of such complex model architectures for the Human Action Recognition task.

Conference paper

Montgomerie-Corcoran A, Yu Z, Cheng J, Bouganis C-Set al., 2023, PASS: exploiting post-activation sparsity in streaming architectures for CNN acceleration, 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, Pages: 288-293

With the ever-growing popularity of Artificial Intelligence, there is an increasing demand for more performant and efficient underlying hardware. Convolutional Neural Networks (CNN) are a workload of particular importance, which achieve high accuracy in computer vision applications. Inside CNNs, a significant number of the post-activation values are zero, resulting in many redundant computations. Recent works have explored this post-activation sparsity on instruction-based CNN accelerators but not on streaming CNN accelerators, despite the fact that streaming architectures are considered the leading design methodology in terms of performance. In this paper, we highlight the challenges associated with exploiting post-activation sparsity for performance gains in streaming CNN accelerators, and demonstrate our approach to address them. Using a set of modern CNN benchmarks, our streaming sparse accelerators achieve 1.41 x to 1.93 x efficiency (GOP/sDSP) compared to state-of-the-art instruction-based sparse accelerators.

Conference paper

Toupas P, Montgomerie-Corcoran A, Bouganis C-S, Tzovaras Det al., 2023, HARFLOW3D: a latency-oriented 3D-CNN accelerator toolflow for HAR on FPGA devices, 31st IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE Computer Society, Pages: 144-154, ISSN: 2576-2613

For Human Action Recognition tasks (HAR), 3D Convolutional Neural Networks have proven to be highly effective, achieving state-of-the-art results. This study introduces a novel streaming architecture-based toolflow for mapping such models onto FPGAs considering the model's inherent characteristics and the features of the targeted FPGA device. The HARFLOW3D toolflow takes as input a 3D CNN in ONNX format and a description of the FPGA characteristics, generating a design that minimises the latency of the computation. The toolflow is comprised of a number of parts, including (i) a 3D CNN parser, (ii) a performance and resource model, (iii) a scheduling algorithm for executing 3D models on the generated hardware, (iv) a resource-aware optimisation engine tailored for 3D models, (v) an automated mapping to synthesizable code for FPGAs. The ability of the toolflow to support a broad range of models and devices is shown through a number of experiments on various 3D CNN and FPGA system pairs. Furthermore, the toolflow has produced high-performing results for 3D CNN models that have not been mapped to FPGAs before, demonstrating the potential of FPGA-based systems in this space. Overall, HARFLOW3D has demonstrated its ability to deliver competitive latency compared to a range of state-of-the-art hand-tuned approaches, being able to achieve up to 5× better performance compared to some of the existing works. The tool is available at https://github.com/ptoupas/harflow3d.

Conference paper

Biggs B, Bouganis C-S, Constantinides G, 2023, ATHEENA: a toolflow for hardware early-exit network automation, International Symposium On Field-Programmable Custom Computing Machines, Publisher: IEEE, Pages: 121-132, ISSN: 2576-2621

The continued need for improvements in accuracy, throughput, and efficiency of Deep Neural Networks has resulted in a multitude of methods that make the most of custom architectures on FPGAs. These include the creation of hand-crafted networks and the use of quantization and pruning to reduce extraneous network parameters. However, with the potential of static solutions already well exploited, we propose to shift the focus to using the varying difficulty of individual data samples to further improve efficiency and reduce average compute for classification. Input-dependent computation allows for the network to make runtime decisions to finish a task early if the result meets a confidence threshold. Early-Exit network architectures have become an increasingly popular way to implement such behaviour in software. We create A Toolflow for Hardware Early-Exit Network Automation (ATHEENA), an automated FPGA toolflow that leverages the probability of samples exiting early from such networks to scale the resources allocated to different sections of the network. The toolflow uses the data-flow model of fpgaConvNet, extended to support Early-Exit networks as well as Design Space Exploration to optimize the generated streaming architecture hardware with the goal of increasing throughput/reducing area while maintaining accuracy. Experimental results on three different networks demonstrate a throughput increase of 2.00× to 2.78× compared to an optimized baseline network implementation with no early exits. Additionally, the toolflow can achieve a throughput matching the same baseline with as low as 46% of the resources the baseline requires.

Conference paper

Venieris SI, Bouganis C-S, Lane ND, 2023, Multiple-deep neural network accelerators for next-generation artificial intelligence systems, Computer, Vol: 56, Pages: 70-79, ISSN: 0018-9162

The next generation of artificial intelligence (AI) systems will have multi-deep neural network (multi-DNN) workloads as their core. Large-scale deployment of AI services and integration across mobile devices require additional breakthroughs in the computer architecture front, with processors that can maintain high performance as the number of DNNs increase, giving rise to the topic of multi-DNN accelerator design.

Journal article

Xia G, Bouganis C-S, 2023, Augmenting Softmax information for selective classification with out-of-distribution data, 16th Asian Conference on Computer Vision, Publisher: Springer Nature Switzerland, Pages: 664-680, ISSN: 0302-9743

Detecting out-of-distribution (OOD) data is a task that is receiving an increasing amount of research attention in the domain of deep learning for computer vision. However, the performance of detection methods is generally evaluated on the task in isolation, rather than also considering potential downstream tasks in tandem. In this work, we examine selective classification in the presence of OOD data (SCOD). That is to say, the motivation for detecting OOD samples is to reject them so their impact on the quality of predictions is reduced. We show under this task specification, that existing post-hoc methods perform quite differently compared to when evaluated only on OOD detection. This is because it is no longer an issue to conflate in-distribution (ID) data with OOD data if the ID data is going to be misclassified. However, the conflation within ID data of correct and incorrect predictions becomes undesirable. We also propose a novel method for SCOD, Softmax Information Retaining Combination (SIRC), that augments softmax-based confidence scores with feature-agnostic information such that their ability to identify OOD samples is improved without sacrificing separation between correct and incorrect ID predictions. Experiments on a wide variety of ImageNet-scale datasets and convolutional neural network architectures show that SIRC is able to consistently match or outperform the baseline for SCOD, whilst existing OOD detection methods fail to do so. Code is available at https://github.com/Guoxoug/SIRC.

Conference paper

Yu Z, Bouganis C-S, 2023, SVD-NAS: coupling low-rank approximation and neural architecture search, 23rd IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Publisher: IEEE, Pages: 1503-1512, ISSN: 2472-6737

The task of compressing pre-trained Deep Neural Networks has attracted wide interest of the research community due to its great benefits in freeing practitioners from data access requirements. In this domain, low-rank approximation is a promising method, but existing solutions considered a restricted number of design choices and failed to efficiently explore the design space, which lead to severe accuracy degradation and limited compression ratio achieved. To address the above limitations, this work proposes the SVD-NAS framework that couples the domains of low-rank approximation and neural architecture search. SVD-NAS generalises and expands the design choices of previous works by introducing the Low-Rank architecture space, LR-space, which is a more fine-grained design space of low-rank approximation. Afterwards, this work proposes a gradient-descent-based search for efficiently traversing the LR-space. This finer and more thorough exploration of the possible design choices results in improved accuracy as well as reduction in parameters, FLOPS, and latency of a CNN model. Results demonstrate that the SVD-NAS achieves 2.06-12.85pp higher accuracy on ImageNet than state-of-the-art methods under the data-limited problem setting. SVD-NAS is open-sourced at https://github.com/Yu-Zhewen/SVD-NAS.

Conference paper

Montgomerie-Corcoran A, Toupas P, Yu Z, Bouganis CSet al., 2023, SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on FPGA Devices, Pages: 179-187, ISSN: 2837-0430

AI has led to significant advancements in computer vision and image processing tasks, enabling a wide range of applications in real-life scenarios, from autonomous vehicles to medical imaging. Many of those applications require efficient object detection algorithms and complementary real-time, low latency hardware to perform inference of these algorithms. The YOLO family of models is considered the most efficient for object detection, having only a single model pass. Despite this, the complexity and size of YOLO models can be too computationally demanding for current edge-based platforms. To address this, we present SATAY: a Streaming Architecture Toolflow for Accelerating YOLO. This work tackles the challenges of deploying state-of-the-art object detection models onto FPGA devices for ultralow latency applications, enabling real-time, edge-based object detection. We employ a streaming architecture design for our YOLO accelerators, implementing the complete model on-chip in a deeply pipelined fashion. These accelerators are generated using an automated toolflow, and can target a range of suitable FPGA devices. We introduce novel hardware components to support the operations of YOLO models in a dataflow manner, and off-chip memory buffering to address the limited on-chip memory resources. Our toolflow is able to generate accelerator designs which demonstrate competitive performance and energy characteristics to GPU devices, and which outperform current state-of-the-art FPGA accelerators. The code is available at https://github.com/ICIdsl/satay

Abstract
Cite

Conference paper

Xia G, Bouganis CS, 2023, Window-Based Early-Exit Cascades for Uncertainty Estimation: When Deep Ensembles are More Efficient than Single Models, Pages: 17322-17334, ISSN: 1550-5499

Deep Ensembles are a simple, reliable, and effective method of improving both the predictive performance and uncertainty estimates of deep learning approaches. However, they are widely criticised as being computationally expensive, due to the need to deploy multiple independent models. Recent work has challenged this view, showing that for predictive accuracy, ensembles can be more computationally efficient (at inference) than scaling single models within an architecture family. This is achieved by cascading ensemble members via an early-exit approach. In this work, we investigate extending these efficiency gains to tasks related to uncertainty estimation. As many such tasks, e.g. selective classification, are binary classification, our key novel insight is to only pass samples within a window close to the binary decision boundary to later cascade stages. Experiments on ImageNet-scale data across a number of network architectures and uncertainty tasks show that the proposed window-based early-exit approach is able to achieve a superior uncertainty-computation trade-off compared to scaling single models. For example, a cascaded EfficientNet-B2 ensemble is able to achieve similar coverage at 5% risk as a single EfficientNet-B4 with <30% the number of MACs. We also find that cascades/ensembles give more reliable improvements on OOD data vs scaling models up. Code for this work is available at: https://github.com/Guoxoug/window-early-exit.

Abstract
Cite

Conference paper

Toupas P, Montgomerie-Corcoran A, Bouganis C-S, Tzovaras Det al., 2023, HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on FPGA Devices., Pages: 144-154

Cite

Conference paper

Cheng J, Zhang C, Yu Z, Montgomerie-Corcoran A, Xiao C, Bouganis C-S, Zhao Yet al., 2023, Fast Prototyping Next-Generation Accelerators for New ML Models using MASE: ML Accelerator System Exploration., CoRR, Vol: abs/2307.15517

Cite

Journal article

Boroumand S, Bouganis C-S, Constantinides GA, 2022, MIDAS: Mutual Information Driven Approximate Synthesis, IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Publisher: IEEE, Pages: 50-55, ISSN: 2159-3477

Applications ranging from the Internet of Things (IoT) to high-performance computing demand energy-efficient hardware for processing and storage. Reducing computation accuracy has shown the potential to achieve high energy efficiency in hardware implementations. In recent years, several automatic approximate logic synthesis techniques have been proposed to build an approximate circuit systematically, trading off accuracy for hardware cost. In this paper, we propose a novel approximate logic synthesis technique to simplify circuits using mutual information by considering the input distribution. Our experimental result shows that our proposed methodology demonstrates improvements in terms of area, delay, and error compared to the state-of-the-art.

Conference paper

Zampokas G, Skartados E, Alexiou D, Tsiakas K, Tzanakis I, Roussos N, Giakoumis D, Kostavelis I, Bouganis C-S, Tzovaras Det al., 2022, WTA/TLA: A UAV-captured dataset for semantic segmentation of energy infrastructure, International Conference on Unmanned Aircraft Systems (ICUAS), Publisher: IEEE, Pages: 552-561, ISSN: 2373-6720

Automated inspection of energy infrastructure with Unmanned Aerial Vehicles (UAVs) is becoming increasingly important, exhibiting significant advantages over manual inspection, including improved scalability, cost/time effectiveness, and risks reduction. Although recent technological advancements enabled the collection of an abundance of vision data from UAVs’ sensors, significant efforts are still required from experts to interpret manually the collected data and assess the condition of energy infrastructure. Thus, semantic understanding of vision data collected from UAVs during inspection is a critical prerequisite for performing autonomous robotic tasks. However, the lack of labeled data introduces challenges and limitations in evaluating the performance of semantic prediction algorithms. To this end, we release two novel semantic datasets (WTA and TLA) of aerial images captured from power transmission networks and wind turbine farms, collected during real inspection scenarios with UAVs. We also propose modifications to existing state-of-the-art semantic segmentation CNNs to achieve improved trade-off between accuracy and computational complexity. Qualitative and quantitative experiments demonstrate both the challenging properties of the provided dataset and the effectiveness of the proposed networks in this domain.The dataset is available at: https://github.com/gzamps/wta_tla_dataset.

Conference paper

Ahmadi N, Adiono T, Purwarianti A, Constandinou T, Bouganis Cet al., 2022, Improved spike-based brain-machine interface using bayesian adaptive kernel smoother and deep learning, IEEE Access, Vol: 10, Pages: 29341-29356, ISSN: 2169-3536

Multiunit activity (MUA) has been proposed to mitigate the robustness issue faced by single-unit activity (SUA)-based brain-machine interfaces (BMIs). Most MUA-based BMIs still employ a binning method for estimating firing rates and linear decoder for decoding behavioural parameters. The limitations of binning and linear decoder lead to suboptimal performance of MUA-based BMIs. To address this issue, we propose a method which consists of Bayesian adaptive kernel smoother (BAKS) as the firing rate estimation algorithm and deep learning, particularly quasi-recurrent neural network (QRNN), as the decoding algorithm. We evaluated the proposed method for reconstructing (offline) hand kinematics from intracortical neural data chronically recorded from the primary motor cortex of two non-human primates. Extensive empirical results across recording sessions and subjects showed that the proposed method consistently outperforms other combinations of firing rate estimation algorithm and decoding algorithm. Overall results suggest the effectiveness of the proposed method for improving the decoding performance of MUA-based BMIs.

Journal article

Montgomerie-Corcoran A, Yu Z, Bouganis C-S, 2022, SAMO: optimised mapping of convolutional neural networks to streaming architectures, 32nd International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, Pages: 418-424, ISSN: 1946-1488

Significant effort has been placed on the development of toolflows that map Convolutional Neural Network (CNN) models to Field Programmable Gate Arrays (FPGAs) with the aim of automating the production of high performance designs for a diverse set of applications. However, within these toolflows, the problem of finding an optimal mapping is often overlooked, with the expectation that the end user will tune their generated hardware for their desired platform. This is particularly prominent within Streaming Architecture toolflows, where there is a large design space to be explored. In this work, we establish the framework SAMO: a Streaming Architecture Mapping Optimiser. SAMO exploits the structure of CNN models and the common features that exist in Streaming Architectures, and casts the mapping optimisation problem under a unified methodology. Furthermore, SAMO explicitly explores the re-configurability property of FPGAs, allowing the methodology to overcome mapping limitations imposed by certain toolflows under resource-constrained scenarios, as well as improve on the achievable throughput. Three optimisation methods - Brute-Force, Simulated Annealing and Rule-Based - have been developed in order to generate valid, high performance designs for a range of target platforms and CNN models. Results show that SAMO-optimised designs can achieve 4x-20x better performance compared to existing hand-tuned designs. The SAMO framework is open-source: https://github.com/AlexMontgomerie/samo.

Conference paper

Zampokas G, Bouganis C-S, Tzovaras D, 2022, Pushing the efficiency of StereoNet: exploiting spatial sparsity, 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP) / 17th International Conference on Computer Vision Theory and Applications (VISAPP), Publisher: SCITEPRESS, Pages: 757-766, ISSN: 2184-4321

Current CNN-based stereo matching methods have demonstrated superior performance compared to traditional stereo matching methods. However, mapping these algorithms into embedded devices, which exhibit limited compute resources, and achieving high performance is a challenging task due to the high computational complexity of the CNN-based methods. The recently proposed StereoNet network, achieves disparity estimation with reduced complexity, whereas performance does not greatly deteriorate. Towards pushing this performance to complexity trade-off further, we propose an optimization applied to StereoNet that adapts the computations to the input data, steering the computations to the regions of the input that would benefit from the application of the CNN-based stereo matching algorithm, where the rest of the input is processed by a traditional, less computationally demanding method. Key to the proposed methodology is the introduction of a lightweight CNN that predicts the importance of r efining a region of the input to the quality of the final disparity map, allowing the system to trade-off computational complexity for disparity error on-demand, enabling the application of these methods to embedded systems with real-time requirements.

Conference paper

Rajagopal A, Bouganis C-S, 2022, Low-Cost On-device Partial Domain Adaptation (LoCO-PDA): Enabling efficient CNN retraining on edge devices., CoRR, Vol: abs/2203.00772

Cite

Journal article

Rajagopal A, Bouganis C-S, 2021, perf4sight: a toolflow to model CNN training performance on Edge GPUs, 18th IEEE/CVF International Conference on Computer Vision (ICCV), Publisher: IEEE COMPUTER SOC, Pages: 963-971, ISSN: 2473-9936

The increased memory and processing capabilities of today’s edge devices create opportunities for greater edge intelligence. In the domain of vision, the ability to adapt a Convolutional Neural Network’s (CNN) structure and parameters to the input data distribution leads to systems with lower memory footprint, latency and power consumption. However, due to the limited compute resources and memory budget on edge devices, it is necessary for the system to be able to predict the latency and memory footprint of the training process in order to identify favourable training configurations of the network topology and device combination for efficient network adaptation. This work proposes perf4sight, an automated methodology for developing accurate models that predict CNN training memory footprint and latency given a target device and network. This enables rapid identification of network topologies that can be retrained on the edge device with low resource consumption. With PyTorch as the framework and NVIDIA Jetson TX2 as the target device, the developed models predict training memory footprint and latency with 95% and 91% accuracy respectively for a wide range of networks, opening the path towards efficient network adaptation on edge GPUs.

Conference paper

Yu Z, Bouganis C-S, 2021, StreamSVD: Low-rank approximation and streaming accelerator co-design, 20th International Conference on Field-Programmable Technology (ICFPT), Publisher: IEEE, Pages: 69-77

The post-training compression of a Convolutional Neural Network (CNN) aims to produce Pareto-optimal designs on the accuracy-performance frontier when the access to training data is not possible. Low-rank approximation is one of the methods that is often utilised in such cases. However, existing work considers the low-rank approximation of the network and the optimisation of the hardware accelerator separately, leading to systems with sub-optimal performance. This work focuses on the efficient mapping of a CNN into an FPGA device, and presents StreamSVD, a model-accelerator co-design framework 1 . The framework considers simultaneously the compression of a CNN model through a hardware-aware low-rank approximation scheme, and the optimisation of the hardware accelerator's architecture by taking into account the approximation scheme's compute structure. Our results show that the co-designed StreamSVD outperforms existing work that utilises similar low-rank approximation schemes by providing better accuracy-throughput trade-off. The proposed framework also achieves competitive performance compared with other post-training compression methods, even outperforming them under certain cases.

Conference paper

Rosa LDS, Bouganis C-S, Bonato V, 2021, Non-iterative SDC modulo scheduling for high-level synthesis, Microprocessors and Microsystems, Vol: 86, Pages: 1-13, ISSN: 0141-9331

High-level synthesis is a powerful tool for increasing productivity in digital hardware design. However, as digital systems become larger and more complex, designers have to consider an increased number of optimizations and directives offered by high-level synthesis tools to control the hardware generation process. One of the most explored optimizations is loop pipelining due to its impact on hardware throughput and resources. Nevertheless, the modulo scheduling algorithms used at resource-constrained loop pipelining are computationally expensive, and their application through the whole design space is often non-viable. Current state-of-the-art approaches rely on solving multiple optimization problems in polynomial time, or on solving one optimization problem in exponential time. This work proposes a novel data-flow-based approach, where exactly two optimization problems of polynomial time complexity are solved, leading to significant reductions on computation time for generating a single loop pipeline. Results indicate that, even for complex loops, the proposed method generates high-quality designs, comparable to the ones produced by existing state-of-the-art methods, achieving a reduction on the design-space exploration time by

Journal article

Ahmadi N, Constandinou T, Bouganis C, 2021, Inferring entire spiking activity from local field potentials, Scientific Reports, Vol: 11, Pages: 1-13, ISSN: 2045-2322

Extracellular recordings are typically analysed by separating them into two distinct signals: local field potentials (LFPs) andspikes. Previous studies have shown that spikes, in the form of single-unit activity (SUA) or multiunit activity (MUA), can beinferred solely from LFPs with moderately good accuracy. SUA and MUA are typically extracted via threshold-based techniquewhich may not be reliable when the recordings exhibit a low signal-to-noise ratio (SNR). Another type of spiking activity, referredto as entire spiking activity (ESA), can be extracted by a threshold-less, fast, and automated technique and has led to betterperformance in several tasks. However, its relationship with the LFPs has not been investigated. In this study, we aim toaddress this issue by inferring ESA from LFPs intracortically recorded from the motor cortex area of three monkeys performingdifferent tasks. Results from long-term recording sessions and across subjects revealed that ESA can be inferred from LFPswith good accuracy. On average, the inference performance of ESA was consistently and significantly higher than those of SUAand MUA. In addition, local motor potential (LMP) was found to be the most predictive feature. The overall results indicate thatLFPs contain substantial information about spiking activity, particularly ESA. This could be useful for understanding LFP-spikerelationship and for the development of LFP-based BMIs.

Journal article

Martorell X, Alvarez C, Bouganis C-S, Sourdis Iet al., 2021, Introduction to the Special Section on FPL 2019, ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, Vol: 14, ISSN: 1936-7406

Journal article

Bonato V, Bouganis C-S, 2021, Class-specific early exit design methodology for convolutional neural networks., Applied Soft Computing, Vol: 107, Pages: 1-12, ISSN: 1568-4946

Convolutional Neural Network-based (CNN) inference is a demanding computational task where a longsequence of operations is applied to an input as dictated by the network topology. Optimisationsby data quantisation, data reuse, network pruning, and dedicated hardware architectures have astrong impact on reducing both energy consumption and hardware resource requirements, and onimproving inference latency. Implementing new applications from established models available fromboth academic and industrial worlds is common nowadays. Further optimisations by preserving modelarchitecture have been proposed via early exiting approaches, where additional exit points are includedin order to evaluate classifications of samples that produce feature maps with sufficient evidence tobe classified before reaching the final model exit. This paper proposes a methodology for designingearly-exit networks from a given baseline model aiming to improve the average latency for a targetedsubset class constrained by the original accuracy for all classes. Results demonstrate average timesaving in the order of 2.09× to 8.79× for dataset CIFAR10 and 15.00× to 20.71× for CIFAR100 forbaseline models ResNet-21, ResNet-110, Inceptionv3-159, and DenseNet-121.

Journal article

Miliadis P, Bouganis C-S, Pnevmatikatos D, 2021, Performance landscape of resource-constrained platforms targeting DNNs

Over the recent years, a significant number of complex, deep neural networkshave been developed for a variety of applications including speech and facerecognition, computer vision in the areas of health-care, automatictranslation, image classification, etc. Moreover, there is an increasing demandin deploying these networks in resource-constrained edge devices. As thecomputational demands of these models keep increasing, pushing to their limitsthe targeted devices, the constant development of new hardware systems tailoredto those workloads has been observed. Since programmability of these diverseand complex platforms -- compounded by the rapid development of new DNN models-- is a major challenge, platform vendors have developed Machine Learningtailored SDKs to maximize the platform's performance. This work investigates the performance achieved on a number of moderncommodity embedded platforms coupled with the vendors' provided softwaresupport when state-of-the-art DNN models from image classification, objectdetection and image segmentation are targeted. The work quantifies the relativelatency gains of the particular embedded platforms and provides insights on therelationship between the required minimum batch size for achieving maximumthroughput, concluding that modern embedded systems reach their maximumperformance even for modest batch sizes when a modern state of the art DNNmodel is targeted. Overall, the presented results provide a guide for theexpected performance for a number of state-of-the-art DNNs on popular embeddedplatforms across the image classification, detection and segmentation domains.

Journal article

Ahmadi N, Constandinou TG, Bouganis C-S, 2021, Robust and accurate decoding of hand kinematics from entire spiking activity using deep learning, Journal of Neural Engineering, Vol: 18, Pages: 1-23, ISSN: 1741-2552

Objective. Brain–machine interfaces (BMIs) seek to restore lost motor functions in individuals with neurological disorders by enabling them to control external devices directly with their thoughts. This work aims to improve robustness and decoding accuracy that currently become major challenges in the clinical translation of intracortical BMIs. Approach. We propose entire spiking activity (ESA)—an envelope of spiking activity that can be extracted by a simple, threshold-less, and automated technique—as the input signal. We couple ESA with deep learning-based decoding algorithm that uses quasi-recurrent neural network (QRNN) architecture. We evaluate comprehensively the performance of ESA-driven QRNN decoder for decoding hand kinematics from neural signals chronically recorded from the primary motor cortex area of three non-human primates performing different tasks. Main results. Our proposed method yields consistently higher decoding performance than any other combinations of the input signal and decoding algorithm previously reported across long-term recording sessions. It can sustain high decoding performance even when removing spikes from the raw signals, when using the different number of channels, and when using a smaller amount of training data. Significance. Overall results demonstrate exceptionally high decoding accuracy and chronic robustness, which is highly desirable given it is an unresolved challenge in BMIs.

Journal article

Ahmadi N, Constandinou T, Bouganis C-S, 2021, Impact of referencing scheme on decoding performance of LFP-based brain-machine interface, Journal of Neural Engineering, Vol: 18, ISSN: 1741-2552

OBJECTIVE: There has recently been an increasing interest in local field potential (LFP) for brain-machine interface (BMI) applications due to its desirable properties (signal stability and low bandwidth). LFP is typically recorded with respect to a single unipolar reference which is susceptible to common noise. Several referencing schemes have been proposed to eliminate the common noise, such as bipolar reference, current source density (CSD), and common average reference (CAR). However, to date, there have not been any studies to investigate the impact of these referencing schemes on decoding performance of LFP-based BMIs. APPROACH: To address this issue, we comprehensively examined the impact of different referencing schemes and LFP features on the performance of hand kinematic decoding using a deep learning method. We used LFPs chronically recorded from the motor cortex area of a monkey while performing reaching tasks. MAIN RESULTS: Experimental results revealed that local motor potential (LMP) emerged as the most informative feature regardless of the referencing schemes. Using LMP as the feature, CAR was found to yield consistently better decoding performance than other referencing schemes over long-term recording sessions. Significance Overall, our results suggest the potential use of LMP coupled with CAR for enhancing the decoding performance of LFP-based BMIs.

Journal article

Boroumand S, Bouganis C, Constantinides G, 2021, Learning Boolean circuits from examples for approximate logic synthesis, 26th Asia and South Pacific Design Automation Conference - ASP-DAC 2021, Publisher: ACM, Pages: 524-529

Many computing applications are inherently error resilient. Thus,it is possible to decrease computing accuracy to achieve greater effi-ciency in area, performance, and/or energy consumption. In recentyears, a slew of automatic techniques for approximate computinghas been proposed; however, most of these techniques require fullknowledge of an exact, or ‘golden’ circuit description. In contrast,there has been significant recent interest in synthesizing computa-tion from examples, a form of supervised learning. In this paper, weexplore the relationship between supervised learning of Booleancircuits and existing work on synthesizing incompletely-specifiedfunctions. We show that when considered through a machine learn-ing lens, the latter work provides a good training accuracy butpoor test accuracy. We contrast this with prior work from the 1990swhich uses mutual information to steer the search process, aimingfor good generalization. By combining this early work with a recentapproach to learning logic functions, we are able to achieve a scal-able and efficient machine learning approach for Boolean circuitsin terms of area/delay/test-error trade-off.

Conference paper

ProfessorChristos-SavvasBouganis

Contact

Location

Summary