Computational Machine Learning
Our vision is to bring Machine Learning algorithms closer to every–day life, enabling as such humans and robots to perform their tasks in a most efficient way. Towards this vision, our aim is to design systems that are able to apply knowledge that have been programmed with on the field (inference stage), but also to design systems that are able to acquire new knowledge as it becomes available (training stage) being able as such to adapt to environment and handle previously unseen inputs.
As such, our group performs research on mapping Machine Learning algorithms into digital hardware aiming at high performance (embedded) systems with reduced power consumption. Our work takes a holistic approach to the problems, by coupling the design of the machine learning algorithm with the design of the hardware systems. By exposing the hardware characteristics to the design of the Machine Algorithm, and vice versa, significant gains in performance and power consumption are obtained, making Machine Learning and AI deployable to real-life scenarios.
CascadeCNN: Pushing the performance limits of quantisation in CNNs
CascadeCNN is an automated toolflow that pushes the quantisation limits of any given CNN model, aiming to perform high-throughput inference by exploiting the computation time-accuracy trade-off. A two-stage architecture tailored for any given CNN-FPGA pair is generated, consisting of a low- and high-precision unit in a cascade. A confidence evaluation unit between them is employed to identify misclassified cases from the excessively low-precision unit and forward them to the high-precision unit for re-processing.
Experiments demonstrate that the proposed toolflow can achieve a performance boost up to 55% for VGG-16 and 48% for AlexNet over the baseline design for the same resource budget and accuracy, without the need of retraining the model or accessing the training data, enabling extreme quantisation in CNNs under privacy-constrained settings.
f-CNNx : Enabling emerging multi-CNN applications
In the construction of complex AI systems, CNN models are used as building blocks of a larger system. In this respect, multi-CNN systems have emerged, employing several models, each one trained for a different subtask. Nevertheless, deploying multiple models on a target platform poses a number of challenges. From a resource allocation perspective, with each model targeting a different task, the performance constraints, such as required throughput and latency, vary accordingly. Instead of being model-agnostic, this property requires the design of an architecture that captures and reflects the performance requirements of each model. Moreover, in resource-constrained setups, multiple CNNs compete for the same pool of resources and hence resource allocation between models becomes a critical factor. f-CNNx is a toolflow which addresses the mapping of multiple CNNs on a target FPGA platform while meeting the required performance for each model. Starting from a set of pretrained models, f-CNNx explores a wide range of resource and bandwidth allocations and incorporates the application-level importance of each model by means of multiobjective cost functions to guide the design space exploration to the optimum hardware design.
Landscape of CNN-to-FPGA toolflows and future directions
Recently, FPGAs have emerged as a potential platform that can reduce significantly power consumption cost while meeting the compute requirements of modern deep learning systems. In this project, we survey the landscape of toolflows which automate the mapping of the CNN inference stage to FPGA-based platforms. From a deep learning scientist perspective, we conduct a study over the supported deep learning models, including DNNs, CNNs and LSTMs, the achieved processing speed and the applicability of each toolflow on specific deep learning applications, from latency-critical mobile systems to high-throughput cloud services. From a computer engineering perspective, we present a detailed analysis of the architectural choices, design space exploration methods and implementation optimisations of the existing tools, together with insightful discussions. The existing FPGA tooflows have employed ad-hoc evaluation methodologies, by targeting different CNN models and reporting the achieved performance in a non-uniform manner. A uniform evaluation methodology is proposed in order to enable the thorough and comparative evaluation of CNN-to-FPGA toolflows (see more).
With an eye to the future of FPGA-based deep learning, we present a set of promising areas of research that can bridge the gap between FPGAs and deep learning practitioners and enable FPGAs to provide the necessary compute infrastructure that can drive future deep learning algorithmic innovations.
Approximate FPGA-based LSTMs
Long Short-Term Memory (LSTM) networks have demonstrated state-of-the-art accuracy in several pattern recognition tasks, with the prominence of Natural Language Processing (NLP) and speech recognition. Nevertheless, as LSTM models increase in complexity and sophistication, so do their computational and memory requirements. Emerging latency-sensitive applications including mobile robots and autonomous vehicles often operate under stringent computation time constraints, where a decision has to be made in real-time. To address this problem, we developed an approximate computing scheme which enables LSTMs to increase their accuracy as a function of the computation time budget, together with an FPGA-based architecture for the high-performance deployment of the approximate LSTMs. By targeting the real-life Neural Image Caption (NIC) model developed by Google, the proposed framework requires 6.65x less time to achieve the same application-level accuracy compared to a baseline implementation, while achieving an average of 25x higher accuracy under the same computation time constraints. This is the first work in the literature to address the deployment of LSTMs under computation time constraints.
Latency-Driven Design for FPGA-based CNNs
The majority of existing CNN implementations, targeting CPUs, GPUs and FPGAs, are optimised with high throughput as the primary objective. Emerging new AI systems, from self-driving cars and UAVs to low response-time, cloud-based analytics services, require the very low-latency execution of several CNN-based tasks without the processing of inputs in batches. To meet these requirements, we place latency at the centre of optimisation and generate latency-optimised hardware designs for the target CNN-FPGA pairs. This is achieved by introducing a latency-driven methodology, which enables the high-performance execution of CNNs without the need for batch processing. The developed approach enables the expansion of the architectural design space, to meet the performance needs of modern latency-sensitive applications and delivers up to 73.54x and 5.61x latency improvements over throughput-optimised designs on AlexNet and VGG-16 respectively.
fpgaConvNet: a framework for mapping Convolutional Neural Networks on FPGAs
Convolutional Neural Networks (ConvNets/CNNs) are a powerful Deep Learning model which has demonstrated state-of-the-art accuracy in numerous AI tasks, from object detections to neural image captioning. In this context, FPGAs constitute a promising platform for the deployment of ConvNets which can satisfy the demanding performance needs and power constraints posed by emerging applications. Nevertheless, the effective mapping of ConvNets on FPGAs requires Deep Learning practitioners to have expertise in hardware design and familiarity with the esoteric FPGA development toolchains, and therefore poses a significant barrier. fpgaConvNet is a framework that automates the mapping of ConvNets onto reconfigurable FPGA-based platforms. Starting from a high-level description of a ConvNet model, fpgaConvNet considers both the input model's workload and the application-level performance needs, including throughput, latency and multiobjective criteria, in order to generate optimised streaming accelerators tailored for the target FPGA. Read more
Support Vector Machines
Support Vector Machines (SVMs) is a popular supervised learning method, providing state-of-the-art accuracy in various classification tasks. However, SVM training is a time-consuming task for large-scale problems and prohibits its deployment to situations where the model needs to be constantly refined. In our work, we are addressing the problem of accelerating the training stage of the SVM in order to be able to address large databases, or to address problems with close to real-time constraints (adaptive systems). Towards this direction, we have developed an FPGA-based system based on ensemble learning capable of achieving significant acceleration factors compared to state of the art CPU and GPU based solutions. We have also done work on the classification stage where we have introduced a cascade classifier based on the word-length of the data.
A Random Forest (RF) is an ensemble of decision tree classifiers and it is one of the most widely used supervised learning tools available. In various empirical testings, it demonstrates significant improvement in generalization error over single decision and regression trees by alleviating the overfitting problem and features a fast and robust performance of learning especially for data with noise and missing information. Our group has focused on the acceleration of the training stage of a RF using FPGAs. We have developed a system that can handle large-scale problems by working on them in a batch mode, taking advantage of the large internal memory bandwidth of modern FPGA devices. The performance of the resulting architecture implemented on a modest FPGA is able to achieve a training speed that is comparable to a modest 448-cores GPU implementation and compares favorably with a Python-based implementation running on a 6-cores 2.3GHz CPU. In all cases the proposed architecture features superior performance/power-consumption ratio due to inherent low power consumption in FPGA.
Bayesian inference is a powerful modeling framework which treats all the unknowns in a system as random variables and makes use of probability theory to infer the distributions of the parameters given the observed data. Inferring the unknown distributions of the parameters (i.e. the posterior distribution) requires combining information about the model which comes from the observed data, expressed by the likelihood function, with information coming from some form of expert or prior knowledge about the parameters (expressed by the prior distribution). In the era of Big data and energy-aware systems, my group focuses on the acceleration and performance per energy unit optimization of the main tools in the field such as MCMC and SMC using FPGAs and GPUs. Examples of application can be found in localization of mobile robots and genome analysis.