Publications
180 results found
Biggs B, Bouganis C-S, Constantinides G, 2023, ATHEENA: a toolflow for hardware early-exit network automation, International Symposium On Field-Programmable Custom Computing Machines, Publisher: IEEE, ISSN: 2576-2621
Venieris SI, Bouganis C-S, Lane ND, 2023, Multiple-Deep Neural Network Accelerators for Next-Generation Artificial Intelligence Systems, COMPUTER, Vol: 56, Pages: 70-79, ISSN: 0018-9162
Yu Z, Bouganis CS, 2023, SVD-NAS: Coupling Low-Rank Approximation and Neural Architecture Search, Proceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023, Pages: 1503-1512
The task of compressing pre-trained Deep Neural Networks has attracted wide interest of the research community due to its great benefits in freeing practitioners from data access requirements. In this domain, low-rank approximation is a promising method, but existing solutions considered a restricted number of design choices and failed to efficiently explore the design space, which lead to severe accuracy degradation and limited compression ratio achieved. To address the above limitations, this work proposes the SVD-NAS framework that couples the domains of low-rank approximation and neural architecture search. SVD-NAS generalises and expands the design choices of previous works by introducing the Low-Rank architecture space, LR-space, which is a more fine-grained design space of low-rank approximation. Afterwards, this work proposes a gradient-descent-based search for efficiently traversing the LR-space. This finer and more thorough exploration of the possible design choices results in improved accuracy as well as reduction in parameters, FLOPS, and latency of a CNN model. Results demonstrate that the SVD-NAS achieves 2.06-12.85pp higher accuracy on ImageNet than state-of-the-art methods under the data-limited problem setting. SVD-NAS is open-sourced at https://github.com/Yu-Zhewen/SVD-NAS.
Xia G, Bouganis CS, 2023, Augmenting Softmax Information for Selective Classification with Out-of-Distribution Data, Pages: 664-680, ISSN: 0302-9743
Detecting out-of-distribution (OOD) data is a task that is receiving an increasing amount of research attention in the domain of deep learning for computer vision. However, the performance of detection methods is generally evaluated on the task in isolation, rather than also considering potential downstream tasks in tandem. In this work, we examine selective classification in the presence of OOD data (SCOD). That is to say, the motivation for detecting OOD samples is to reject them so their impact on the quality of predictions is reduced. We show under this task specification, that existing post-hoc methods perform quite differently compared to when evaluated only on OOD detection. This is because it is no longer an issue to conflate in-distribution (ID) data with OOD data if the ID data is going to be misclassified. However, the conflation within ID data of correct and incorrect predictions becomes undesirable. We also propose a novel method for SCOD, Softmax Information Retaining Combination (SIRC), that augments softmax-based confidence scores with feature-agnostic information such that their ability to identify OOD samples is improved without sacrificing separation between correct and incorrect ID predictions. Experiments on a wide variety of ImageNet-scale datasets and convolutional neural network architectures show that SIRC is able to consistently match or outperform the baseline for SCOD, whilst existing OOD detection methods fail to do so. Code is available at https://github.com/Guoxoug/SIRC.
Venieris SI, Bouganis C-S, Lane ND, 2022, Multi-DNN Accelerators for Next-Generation AI Systems, Publisher: Arxiv
As the use of AI-powered applications widens across multiple domains, so doincrease the computational demands. Primary driver of AI technology are thedeep neural networks (DNNs). When focusing either on cloud-based systems thatserve multiple AI queries from different users each with their own DNN model,or on mobile robots and smartphones employing pipelines of various models orparallel DNNs for the concurrent processing of multi-modal data, the nextgeneration of AI systems will have multi-DNN workloads at their core.Large-scale deployment of AI services and integration across mobile andembedded systems require additional breakthroughs in the computer architecturefront, with processors that can maintain high performance as the number of DNNsincreases while meeting the quality-of-service requirements, giving rise to thetopic of multi-DNN accelerator design.
Xia G, Bouganis C-S, 2022, On the Usefulness of Deep Ensemble Diversity for Out-of-Distribution Detection
The ability to detect Out-of-Distribution (OOD) data is important insafety-critical applications of deep learning. The aim is to separateIn-Distribution (ID) data drawn from the training distribution from OOD datausing a measure of uncertainty extracted from a deep neural network. DeepEnsembles are a well-established method of improving the quality of uncertaintyestimates produced by deep neural networks, and have been shown to havesuperior OOD detection performance compared to single models. An existingintuition in the literature is that the diversity of Deep Ensemble predictionsindicates distributional shift, and so measures of diversity such as MutualInformation (MI) should be used for OOD detection. We show experimentally thatthis intuition is not valid on ImageNet-scale OOD detection -- using MI leadsto 30-40% worse %FPR@95 compared to single-model entropy on some OOD datasets.We suggest an alternative explanation for Deep Ensembles' better OOD detectionperformance -- OOD detection is binary classification and we are ensemblingdiverse classifiers. As such we show that practically, even better OODdetection performance can be achieved for Deep Ensembles by averagingtask-specific detection scores such as Energy over the ensemble.
Ahmadi N, Adiono T, Purwarianti A, et al., 2022, Improved spike-based brain-machine interface using bayesian adaptive kernel smoother and deep learning, IEEE Access, Vol: 10, Pages: 29341-29356, ISSN: 2169-3536
Multiunit activity (MUA) has been proposed to mitigate the robustness issue faced by single-unit activity (SUA)-based brain-machine interfaces (BMIs). Most MUA-based BMIs still employ a binning method for estimating firing rates and linear decoder for decoding behavioural parameters. The limitations of binning and linear decoder lead to suboptimal performance of MUA-based BMIs. To address this issue, we propose a method which consists of Bayesian adaptive kernel smoother (BAKS) as the firing rate estimation algorithm and deep learning, particularly quasi-recurrent neural network (QRNN), as the decoding algorithm. We evaluated the proposed method for reconstructing (offline) hand kinematics from intracortical neural data chronically recorded from the primary motor cortex of two non-human primates. Extensive empirical results across recording sessions and subjects showed that the proposed method consistently outperforms other combinations of firing rate estimation algorithm and decoding algorithm. Overall results suggest the effectiveness of the proposed method for improving the decoding performance of MUA-based BMIs.
Rajagopal A, Bouganis C-S, 2022, Low-Cost On-device Partial Domain Adaptation (LoCO-PDA): Enabling efficient CNN retraining on edge devices
With the increased deployment of Convolutional Neural Networks (CNNs) on edgedevices, the uncertainty of the observed data distribution upon deployment hasled researchers to to utilise large and extensive datasets such as ILSVRC'12 totrain CNNs. Consequently, it is likely that the observed data distribution upondeployment is a subset of the training data distribution. In such cases, notadapting a network to the observed data distribution can cause performancedegradation due to negative transfer and alleviating this is the focus ofPartial Domain Adaptation (PDA). Current works targeting PDA do not focus onperforming the domain adaptation on an edge device, adapting to a changingtarget distribution or reducing the cost of deploying the adapted network. Thiswork proposes a novel PDA methodology that targets all of these directions andopens avenues for on-device PDA. LoCO-PDA adapts a deployed network to theobserved data distribution by enabling it to be retrained on an edge device.Across subsets of the ILSVRC12 dataset, LoCO-PDA improves classificationaccuracy by 3.04pp on average while achieving up to 15.1x reduction inretraining memory consumption and 2.07x improvement in inference latency on theNVIDIA Jetson TX2. The work is open-sourced at \emph{link removed foranonymity}.
Zampokas G, Skartados E, Alexiou D, et al., 2022, WTA/TLA: A UAV-captured Dataset for Semantic Segmentation of Energy Infrastructure, International Conference on Unmanned Aircraft Systems (ICUAS), Publisher: IEEE, Pages: 552-561, ISSN: 2373-6720
Boroumand S, Bouganis C-S, Constantinides GA, 2022, MIDAS: Mutual Information Driven Approximate Synthesis, IEEE-Computer-Society Annual Symposium on VLSI (ISVLSI), Publisher: IEEE, Pages: 50-55
Montgomerie-Corcoran A, Yu Z, Bouganis CS, 2022, SAMO: Optimised Mapping of Convolutional Neural Networks to Streaming Architectures, Proceedings - 2022 32nd International Conference on Field-Programmable Logic and Applications, FPL 2022, Pages: 418-424
Significant effort has been placed on the development of toolflows that map Convolutional Neural Network (CNN) models to Field Programmable Gate Arrays (FPGAs) with the aim of automating the production of high performance designs for a diverse set of applications. However, within these toolflows, the problem of finding an optimal mapping is often overlooked, with the expectation that the end user will tune their generated hardware for their desired platform. This is particularly prominent within Streaming Architecture toolflows, where there is a large design space to be explored. In this work, we establish the framework SAMO: a Streaming Architecture Mapping Optimiser. SAMO exploits the structure of CNN models and the common features that exist in Streaming Architectures, and casts the mapping optimisation problem under a unified methodology. Furthermore, SAMO explicitly explores the re-configurability property of FPGAs, allowing the methodology to overcome mapping limitations imposed by certain toolflows under resource-constrained scenarios, as well as improve on the achievable throughput. Three optimisation methods - Brute-Force, Simulated Annealing and Rule-Based - have been developed in order to generate valid, high performance designs for a range of target platforms and CNN models. Results show that SAMO-optimised designs can achieve 4x-20x better performance compared to existing hand-tuned designs. The SAMO framework is open-source: https://github.com/AlexMontgomerie/samo.
Zampokas G, Bouganis C-S, Tzovaras D, 2022, Pushing the efficiency of StereoNet: exploiting spatial sparsity, 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP) / 17th International Conference on Computer Vision Theory and Applications (VISAPP), Publisher: SCITEPRESS, Pages: 757-766, ISSN: 2184-4321
Current CNN-based stereo matching methods have demonstrated superior performance compared to traditional stereo matching methods. However, mapping these algorithms into embedded devices, which exhibit limited compute resources, and achieving high performance is a challenging task due to the high computational complexity of the CNN-based methods. The recently proposed StereoNet network, achieves disparity estimation with reduced complexity, whereas performance does not greatly deteriorate. Towards pushing this performance to complexity trade-off further, we propose an optimization applied to StereoNet that adapts the computations to the input data, steering the computations to the regions of the input that would benefit from the application of the CNN-based stereo matching algorithm, where the rest of the input is processed by a traditional, less computationally demanding method. Key to the proposed methodology is the introduction of a lightweight CNN that predicts the importance of r efining a region of the input to the quality of the final disparity map, allowing the system to trade-off computational complexity for disparity error on-demand, enabling the application of these methods to embedded systems with real-time requirements.
Rosa LDS, Bouganis C-S, Bonato V, 2021, Non-iterative SDC modulo scheduling for high-level synthesis, Microprocessors and Microsystems, Vol: 86, Pages: 1-13, ISSN: 0141-9331
High-level synthesis is a powerful tool for increasing productivity in digital hardware design. However, as digital systems become larger and more complex, designers have to consider an increased number of optimizations and directives offered by high-level synthesis tools to control the hardware generation process. One of the most explored optimizations is loop pipelining due to its impact on hardware throughput and resources. Nevertheless, the modulo scheduling algorithms used at resource-constrained loop pipelining are computationally expensive, and their application through the whole design space is often non-viable. Current state-of-the-art approaches rely on solving multiple optimization problems in polynomial time, or on solving one optimization problem in exponential time. This work proposes a novel data-flow-based approach, where exactly two optimization problems of polynomial time complexity are solved, leading to significant reductions on computation time for generating a single loop pipeline. Results indicate that, even for complex loops, the proposed method generates high-quality designs, comparable to the ones produced by existing state-of-the-art methods, achieving a reduction on the design-space exploration time by
Ahmadi N, Constandinou T, Bouganis C, 2021, Inferring entire spiking activity from local field potentials, Scientific Reports, Vol: 11, Pages: 1-13, ISSN: 2045-2322
Extracellular recordings are typically analysed by separating them into two distinct signals: local field potentials (LFPs) andspikes. Previous studies have shown that spikes, in the form of single-unit activity (SUA) or multiunit activity (MUA), can beinferred solely from LFPs with moderately good accuracy. SUA and MUA are typically extracted via threshold-based techniquewhich may not be reliable when the recordings exhibit a low signal-to-noise ratio (SNR). Another type of spiking activity, referredto as entire spiking activity (ESA), can be extracted by a threshold-less, fast, and automated technique and has led to betterperformance in several tasks. However, its relationship with the LFPs has not been investigated. In this study, we aim toaddress this issue by inferring ESA from LFPs intracortically recorded from the motor cortex area of three monkeys performingdifferent tasks. Results from long-term recording sessions and across subjects revealed that ESA can be inferred from LFPswith good accuracy. On average, the inference performance of ESA was consistently and significantly higher than those of SUAand MUA. In addition, local motor potential (LMP) was found to be the most predictive feature. The overall results indicate thatLFPs contain substantial information about spiking activity, particularly ESA. This could be useful for understanding LFP-spikerelationship and for the development of LFP-based BMIs.
Martorell X, Alvarez C, Bouganis C-S, et al., 2021, Introduction to the Special Section on FPL 2019, ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, Vol: 14, ISSN: 1936-7406
Bonato V, Bouganis C-S, 2021, Class-specific early exit design methodology for convolutional neural networks., Applied Soft Computing, Vol: 107, Pages: 1-12, ISSN: 1568-4946
Convolutional Neural Network-based (CNN) inference is a demanding computational task where a longsequence of operations is applied to an input as dictated by the network topology. Optimisationsby data quantisation, data reuse, network pruning, and dedicated hardware architectures have astrong impact on reducing both energy consumption and hardware resource requirements, and onimproving inference latency. Implementing new applications from established models available fromboth academic and industrial worlds is common nowadays. Further optimisations by preserving modelarchitecture have been proposed via early exiting approaches, where additional exit points are includedin order to evaluate classifications of samples that produce feature maps with sufficient evidence tobe classified before reaching the final model exit. This paper proposes a methodology for designingearly-exit networks from a given baseline model aiming to improve the average latency for a targetedsubset class constrained by the original accuracy for all classes. Results demonstrate average timesaving in the order of 2.09× to 8.79× for dataset CIFAR10 and 15.00× to 20.71× for CIFAR100 forbaseline models ResNet-21, ResNet-110, Inceptionv3-159, and DenseNet-121.
Miliadis P, Bouganis C-S, Pnevmatikatos D, 2021, Performance landscape of resource-constrained platforms targeting DNNs
Over the recent years, a significant number of complex, deep neural networkshave been developed for a variety of applications including speech and facerecognition, computer vision in the areas of health-care, automatictranslation, image classification, etc. Moreover, there is an increasing demandin deploying these networks in resource-constrained edge devices. As thecomputational demands of these models keep increasing, pushing to their limitsthe targeted devices, the constant development of new hardware systems tailoredto those workloads has been observed. Since programmability of these diverseand complex platforms -- compounded by the rapid development of new DNN models-- is a major challenge, platform vendors have developed Machine Learningtailored SDKs to maximize the platform's performance. This work investigates the performance achieved on a number of moderncommodity embedded platforms coupled with the vendors' provided softwaresupport when state-of-the-art DNN models from image classification, objectdetection and image segmentation are targeted. The work quantifies the relativelatency gains of the particular embedded platforms and provides insights on therelationship between the required minimum batch size for achieving maximumthroughput, concluding that modern embedded systems reach their maximumperformance even for modest batch sizes when a modern state of the art DNNmodel is targeted. Overall, the presented results provide a guide for theexpected performance for a number of state-of-the-art DNNs on popular embeddedplatforms across the image classification, detection and segmentation domains.
Ahmadi N, Constandinou TG, Bouganis C-S, 2021, Robust and accurate decoding of hand kinematics from entire spiking activity using deep learning, Journal of Neural Engineering, Vol: 18, Pages: 1-23, ISSN: 1741-2552
Objective. Brain–machine interfaces (BMIs) seek to restore lost motor functions in individuals with neurological disorders by enabling them to control external devices directly with their thoughts. This work aims to improve robustness and decoding accuracy that currently become major challenges in the clinical translation of intracortical BMIs. Approach. We propose entire spiking activity (ESA)—an envelope of spiking activity that can be extracted by a simple, threshold-less, and automated technique—as the input signal. We couple ESA with deep learning-based decoding algorithm that uses quasi-recurrent neural network (QRNN) architecture. We evaluate comprehensively the performance of ESA-driven QRNN decoder for decoding hand kinematics from neural signals chronically recorded from the primary motor cortex area of three non-human primates performing different tasks. Main results. Our proposed method yields consistently higher decoding performance than any other combinations of the input signal and decoding algorithm previously reported across long-term recording sessions. It can sustain high decoding performance even when removing spikes from the raw signals, when using the different number of channels, and when using a smaller amount of training data. Significance. Overall results demonstrate exceptionally high decoding accuracy and chronic robustness, which is highly desirable given it is an unresolved challenge in BMIs.
Ahmadi N, Constandinou T, Bouganis C-S, 2021, Impact of referencing scheme on decoding performance of LFP-based brain-machine interface, Journal of Neural Engineering, Vol: 18, ISSN: 1741-2552
OBJECTIVE: There has recently been an increasing interest in local field potential (LFP) for brain-machine interface (BMI) applications due to its desirable properties (signal stability and low bandwidth). LFP is typically recorded with respect to a single unipolar reference which is susceptible to common noise. Several referencing schemes have been proposed to eliminate the common noise, such as bipolar reference, current source density (CSD), and common average reference (CAR). However, to date, there have not been any studies to investigate the impact of these referencing schemes on decoding performance of LFP-based BMIs. APPROACH: To address this issue, we comprehensively examined the impact of different referencing schemes and LFP features on the performance of hand kinematic decoding using a deep learning method. We used LFPs chronically recorded from the motor cortex area of a monkey while performing reaching tasks. MAIN RESULTS: Experimental results revealed that local motor potential (LMP) emerged as the most informative feature regardless of the referencing schemes. Using LMP as the feature, CAR was found to yield consistently better decoding performance than other referencing schemes over long-term recording sessions. Significance Overall, our results suggest the potential use of LMP coupled with CAR for enhancing the decoding performance of LFP-based BMIs.
Boroumand S, Bouganis C, Constantinides G, 2021, Learning Boolean circuits from examples for approximate logic synthesis, 26th Asia and South Pacific Design Automation Conference - ASP-DAC 2021, Publisher: ACM, Pages: 524-529
Many computing applications are inherently error resilient. Thus,it is possible to decrease computing accuracy to achieve greater effi-ciency in area, performance, and/or energy consumption. In recentyears, a slew of automatic techniques for approximate computinghas been proposed; however, most of these techniques require fullknowledge of an exact, or ‘golden’ circuit description. In contrast,there has been significant recent interest in synthesizing computa-tion from examples, a form of supervised learning. In this paper, weexplore the relationship between supervised learning of Booleancircuits and existing work on synthesizing incompletely-specifiedfunctions. We show that when considered through a machine learn-ing lens, the latter work provides a good training accuracy butpoor test accuracy. We contrast this with prior work from the 1990swhich uses mutual information to steer the search process, aimingfor good generalization. By combining this early work with a recentapproach to learning logic functions, we are able to achieve a scal-able and efficient machine learning approach for Boolean circuitsin terms of area/delay/test-error trade-off.
Yu Z, Bouganis C-S, 2021, StreamSVD: Low-rank Approximation and Streaming Accelerator Co-design, 20th International Conference on Field-Programmable Technology (ICFPT), Publisher: IEEE, Pages: 69-77
Rajagopal A, Bouganis C-S, 2021, perf4sight: A toolflow to model CNN training performance on Edge GPUs, 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), Pages: 963-971, ISSN: 2473-9936
- Author Web Link
- Cite
- Citations: 2
Montgomerie-Corcoran A, Bouganis C-S, 2021, POMMEL: Exploring Off-Chip Memory Energy & Power Consumption in Convolutional Neural Network Accelerators, 24th Euromicro Conference on Digital System Design (DSD), Publisher: IEEE COMPUTER SOC, Pages: 442-448
Montgomerie-Corcoran A, Savvas-Bouganis C, 2021, DEF: Differential Encoding of Featuremaps for Low Power Convolutional Neural Network Accelerators, 26th Asia and South Pacific Design Automation Conference (ASP-DAC), Publisher: IEEE, Pages: 703-708, ISSN: 2153-6961
- Author Web Link
- Cite
- Citations: 1
Vink DA, Rajagopal A, Venieris SI, et al., 2020, Caffe barista: brewing caffe with FPGAs in the training loop, Publisher: arXiv
As the complexity of deep learning (DL) models increases, their computerequirements increase accordingly. Deploying a Convolutional Neural Network(CNN) involves two phases: training and inference. With the inference tasktypically taking place on resource-constrained devices, a lot of research hasexplored the field of low-power inference on custom hardware accelerators. Onthe other hand, training is both more compute- and memory-intensive and isprimarily performed on power-hungry GPUs in large-scale data centres. CNNtraining on FPGAs is a nascent field of research. This is primarily due to thelack of tools to easily prototype and deploy various hardware and/oralgorithmic techniques for power-efficient CNN training. This work presentsBarista, an automated toolflow that provides seamless integration of FPGAs intothe training of CNNs within the popular deep learning framework Caffe. To thebest of our knowledge, this is the only tool that allows for such versatile andrapid deployment of hardware and algorithms for the FPGA-based training ofCNNs, providing the necessary infrastructure for further research anddevelopment.
Rajagopal A, Vink DA, Venieris SI, et al., 2020, Multi-Precision Policy Enforced Training (MuPPET): A precision-switching strategy for quantised fixed-point training of CNNs, Publisher: arXiv
Large-scale convolutional neural networks (CNNs) suffer from very longtraining times, spanning from hours to weeks, limiting the productivity andexperimentation of deep learning practitioners. As networks grow in size andcomplexity, training time can be reduced through low-precision datarepresentations and computations. However, in doing so the final accuracysuffers due to the problem of vanishing gradients. Existing state-of-the-artmethods combat this issue by means of a mixed-precision approach utilising twodifferent precision levels, FP32 (32-bit floating-point) and FP16/FP8(16-/8-bit floating-point), leveraging the hardware support of recent GPUarchitectures for FP16 operations to obtain performance gains. This work pushesthe boundary of quantised training by employing a multilevel optimisationapproach that utilises multiple precisions including low-precision fixed-pointrepresentations. The novel training strategy, MuPPET, combines the use ofmultiple number representation regimes together with a precision-switchingmechanism that decides at run time the transition point between precisionregimes. Overall, the proposed strategy tailors the training process to thehardware-level capabilities of the target hardware architecture and yieldsimprovements in training time and energy efficiency compared tostate-of-the-art approaches. Applying MuPPET on the training of AlexNet,ResNet18 and GoogLeNet on ImageNet (ILSVRC12) and targeting an NVIDIA TuringGPU, MuPPET achieves the same accuracy as standard full-precision training withtraining-time speedup of up to 1.84$\times$ and an average speedup of1.58$\times$ across the networks.
Kouris A, Venieris S, Bouganis C-S, 2020, A throughput-latency co-optimised cascade of convolutional neural network classifiers, Design, Automation and Test in Europe Conference (DATE 2020), Publisher: IEEE, Pages: 1656-1661
Convolutional Neural Networks constitute a promi-nent AI model for classification tasks, serving a broad span ofdiverse application domains. To enable their efficient deploymentin real-world tasks, the inherent redundancy of CNNs is fre-quently exploited to eliminate unnecessary computational costs.Driven by the fact that not all inputs require the same amount ofcomputation to drive a confident prediction, multi-precision cas-cade classifiers have been recently introduced. FPGAs comprise apromising platform for the deployment of such input-dependentcomputation models, due to their enhanced customisation ca-pabilities. Current literature, however, is limited to throughput-optimised cascade implementations, employing large batching atthe expense of a substantial latency aggravation prohibiting theirdeployment on real-time scenarios. In this work, we introduce anovel methodology for throughput-latency co-optimised cascadedCNN classification, deployed on a custom FPGA architecturetailored to the target application and deployment platform,with respect to a set of user-specified requirements on accuracyand performance. Our experiments indicate that the proposedapproach achieves comparable throughput gains with relatedstate-of-the-art works, under substantially reduced overhead inlatency, enabling its deployment on latency-sensitive applications.
Rajagopal A, Bouganis C-S, 2020, Now that I can see, I can improve: Enabling data-driven finetuning of CNNs on the edge, Publisher: arXiv
In today's world, a vast amount of data is being generated by edge devicesthat can be used as valuable training data to improve the performance ofmachine learning algorithms in terms of the achieved accuracy or to reduce thecompute requirements of the model. However, due to user data privacy concernsas well as storage and communication bandwidth limitations, this data cannot bemoved from the device to the data centre for further improvement of the modeland subsequent deployment. As such there is a need for increased edgeintelligence, where the deployed models can be fine-tuned on the edge, leadingto improved accuracy and/or reducing the model's workload as well as its memoryand power footprint. In the case of Convolutional Neural Networks (CNNs), boththe weights of the network as well as its topology can be tuned to adapt to thedata that it processes. This paper provides a first step towards enabling CNNfinetuning on an edge device based on structured pruning. It explores theperformance gains and costs of doing so and presents an extensible open-sourceframework that allows the deployment of such approaches on a wide range ofnetwork architectures and devices. The results show that on average, data-awarepruning with retraining can provide 10.2pp increased accuracy over a wide rangeof subsets, networks and pruning levels with a maximum improvement of 42.0ppover pruning and retraining in a manner agnostic to the data being processed bythe network.
Kouris A, Venieris SI, Rizakis M, et al., 2020, Approximate LSTMs for time-constrained inference: enabling fast reaction in self-driving cars., IEEE Consumer Electronics Magazine, Vol: 9, Pages: 11-26, ISSN: 2162-2248
The need to recognize long-term dependencies in sequential data, such as video streams, has made long short-term memory (LSTM) networks a prominent artificial intelligence model for many emerging applications. However, the high computational and memory demands of LSTMs introduce challenges in their deployment on latency-critical systems such as self-driving cars, which are equipped with limited computational resources on-board. In this article, we introduce a progressive inference computing scheme that combines model pruning and computation restructuring leading to the best possible approximation of the result given the available latency budget of the target application. The proposed methodology enables mission-critical systems to make informed decisions even in early stages of the computation, based on approximate LSTM inference, meeting their specifications on safety and robustness. Our experiments on a state-of-the-art driving model for autonomous vehicle navigation demonstrate that the proposed approach can yield outputs with similar quality of result compared to a faithful LSTM baseline, up to 415× faster (198× on average, 76× geo. mean).
Yu Z, Bouganis C-S, 2020, A parameterisable FPGA-tailored architecture for YOLOv3-Tiny, 16th International Symposium, ARC 2020, Publisher: Springer International Publishing, Pages: 330-344, ISSN: 0302-9743
Object detection is the task of detecting the position of objects in an image or video as well as their corresponding class. The current state of the art approach that achieves the highest performance (i.e. fps) without significant penalty in accuracy of detection is the YOLO framework, and more specifically its latest version YOLOv3. When embedded systems are targeted for deployment, YOLOv3-tiny, a lightweight version of YOLOv3, is usually adopted. The presented work is the first to implement a parameterised FPGA-tailored architecture specifically for YOLOv3-tiny. The architecture is optimised for latency-sensitive applications, and is able to be deployed in low-end devices with stringent resource constraints. Experiments demonstrate that when a low-end FPGA device is targeted, the proposed architecture achieves a 290x improvement in latency, compared to the hard core processor of the device, achieving at the same time a reduction in mAP of 2.5 pp (30.9% vs 33.4%) compared to the original model. The presented work opens the way for low-latency object detection on low-end FPGA devices.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.