Image Denoising for Retrieval and Localisation

With the rise of deep learning, end-to-end pose regression models are driving the community towards a new paradigm. Following the absolute pose regression, the relative pose regression with image retrieval won higher accuracy. Recently, It is improved by a novel method to cooperate with view synthesis, which further leads to a better performance. In the project, we aim to enhance the novel approach by improving synthetic images in model training. An image inpainting network as well as a refinement network are proposed to fill hole regions and remove artefacts of synthetic views. The two networks form a module and then integrate into the relative pose estimation pipeline for performance evaluation.
The work indicates it is challenging in improving relative pose regression via refined synthetic images. In experiments, different strategies are adopted to answer what are the challenges and additionally explore the relationship between pose regression model and input images. The analysis, from another view, also helps understand the limitation of relative pose regression to support future work.

Extracting Object Properties from Online Resources

The rapid development of robotic research has benefited from domain-specific benchmarks. In this paper, we present a deep learning model based on a pre-trained language representation model. Commonly used methods of text extraction, e.g., regular expression, are lack semantic comprehension for texts. Thus, they take a high risk of low accuracy faced with large chunks of text. However, our model is qualified to extract the physical properties of objects from texts accurately. It can synchronously extract multiple properties from unlabelled product descriptions and further generate textual databases intended to be used in the training of robotic manipulation tasks. Especially for numerical properties, which always result in high percentage errors, we applied a particular pre-processing method to ensure that the network can serve all properties as classification tasks. It relieves the awkward situation when the network simultaneously encounters classification and regression tasks. Besides, the output of the model is flexible, allowing the network to dexterously extend as it is confronted with the demand for more properties in the future. Meantime, we also provide a highly extendable and cleansed database whose data is collected from online resources. It includes textual and visual product descriptions of daily objects and frequently used properties that can provide necessary information for robotic applications. The database is a template that can be quickly extended through our model. It is reasonable to believe that our model can be combined with 3D printing to provide a large-scale benchmark for the robotics community and allows physical objects to be available to researchers around the world shortly.

Modelling Smartphone Properties for Robotic Grasping

With the rapid technological development, the replacement frequency of smartphones increases a lot, and consequently an efficient method to recycle the increasing number of discarded smartphones has become a hot topic. Currently, many robots help disassemble the smartphones efficiently in the recycling process, but prior knowledge such as smartphone properties are needed for robotic grasping. Assigning properties manually is tedious and cannot meet the huge recycling demands. Thus, automatic recognition of smartphone property has become a key factor to speed up the recycling process.
In view of these issues, in this project, a database of smartphones is constructed serving as a benchmark in robotic grasping, and an approach to extract smartphone properties from visual data based on convolutional neural networks (CNNs) is implemented. The database contains images of 20 smartphone classes along with their properties of model name, dimension, location of cable port and location of buttons. Then, data augmentation techniques are used to increase the size and diversity of data, while slight resolution variation and partial occlusion are applied to generate new data allowing to extend the benchmark for other analysis. Moreover, a custom model and VGG16 are employed with different training strategies. Finally, with those techniques and models, a set of experiments are performed on the presented benchmark to determine the most effective approach of smartphone property recognition. In terms of the model name recognition, the results show the pretrained VGG16 fine-tuned on higher-resolution back view images which are augmented by fusion of techniques can achieve 94.50\% test accuracy, whilst the same approach on the dataset including back view and front view images can also reach 81.63\%. As for the recognition of front or back as well as broken or not, the best results are both above 98%. In addition, these results expose some latent knowledge in the recognition of smartphone property.

Modeling properties of articulated objects for robotic manipulations

Robot's grasping has been a topic of interest for researchers for a long time but only in recent years, machine learning methods have begun to intervene in this field. In this project, the goal is to establish a knowledge system for the robot by using text descriptions of objects to model the properties of the object in advance in order to improve the decision-making and control strategy of the robot's grasping.
However, due to different data sets, the models trained in different experiments will lack generalization capabilities, making them unusable in practical applications. In conclusion, the fragmentation of model and feature representation will affect the progress of modeling. In order to solve this problem, this experiment trains a text classifier with generalization ability that can accept large collections of object descriptions in natural language.
The model trained in this experiment has a test accuracy of over 50% in response to data outside the training set. The model using BERT feature representation has the best performance with test accuracy around 70%

Person Re-identification from Drones 

There has been a growing interest in drone applications and many computer vision tasks were specifically adapted to drone scenarios such as SLAM, object detection, depth estimation, etc. Person re-identification is one of the tasks that can be effectively performed from drones and new datasets specifically geared towards aerial person imagery emerge. In addition to the common problems found in almost every person re-ID dataset, the most significant difference to static CCTV re-ID is the very different human pose across views from the top and similar appearance of different people but also motion blur, light conditions, low resolution and occlusions. To address these problems, this project combines a Part-based Convolutional Baseline (PCB), which exploits local features, with an adaptive weight distribution strategy, which assigns different weights to similar and dissimilar samples.  The result shows that our method outperforms the state of the arts by a large margin. In addition, we propose a re-ranking method which aggregates Expanded
Cross Neighborhood (ECN) distance and Jaccard distance to compute the final ranking. Compared to the existing methods, our re-ranking achieves 3% improvement on mAP and rank-1 accuracy. we present a real-time pipeline based on Convolutional Neural Networks for segmenting instruments and vessel tree, classifying instruments and vessels in different classes. In using CNNs for segmentation we need a large number of ground truth segmentations that are not available since they need to be generated by experts on a large number of frames. Furthermore, we need to combat the presence of noise and low contrast that may lead to inaccurate feature detection.
We employ a Digital Subtraction Angiography-like process in order to generate a large number of initial segmentations automatically that are used to train a CNN. The automatic process although close to ground truth, may introduce errors, therefore we fine-tune the network using a relatively small number of hand generated labels thus producing a network capable of generating accurate segmentations. In order to discriminate between vessels and catheters, we employ tracking of the masks generated in the first few frames of a sequence that contain only the catheter. Tracking is based on optical flow computation instead of estimating  non-rigid transformations or feature tracking. Optical flow for the whole image is shown to be quite effective in case of neighbouring frames and degrading in long range prediction. However, in case of catheters and wires it is very robust since their pixels have better contrast and are less influenced by noise. To improve tracking, use of the segmentations generated by the CNN is made.

Deep Segmentation and Registration in X-Ray Angiography Video

Quantitative Coronary Angiography (QCA) and Percutaneous Coronary Intervention (PCI) are standard procedures to diagnose and restore blood circulation in the cardiovascular system. X-ray angiography is the most popular imaging modality to visualise blood vessels for interventional purposes such as stenting of stenosed vessels or for diagnostic purposes such as assessment of myocardial perfusion or stenosis grading. In case of diagnosis, the main object of interest is the vascular tree, its branchings and variations in thickness. It is therefore necessary to accurately highlight the vessel tree in consecutive frames to combat noise and low contrast.  It is therefore necessary to discriminate between instruments and vessels as well as other anatomical structures that may have similar appearance.

 

Cycle Power Prediction Using Machine Learning

 

With advances in modern technology the appetite for data analytics has rapidly propagated throughout the sport and fitness market. What used to be desirable to scientists and professional athletes only is now in demand by amateur and hobby athletes in order to track and improve performance. This renaissance has already occurred in sport and fitness activities such as running or golf. The present market for cycle sensors is geared towards professional cyclists with complex tools and high price-point products. This project explores the implementation of deep learning techniques and regression analysis to carry out experiments to estimate power from cycling trips. These experiments will look at 1) finding the most valuable data for power prediction, 2) predicting a rider's cycle power from low-cost sensor data and using existing APIs to supplement missing features and 3) real-time power prediction. The performance of the model will be compared to power meter data as well as comparable power-calculation tools available in the market. The findings from these experiments could be incorporated in a product to make data analytics previously only available to professional cyclists accessible and affordable to the wider cycling market.

Tools for mining electronic component information

 

One of the most important steps of electronics design is choosing components as they essentially define the function of the circuit. This process is extremely time consuming as an engineer has to optimise their component choices based on heuristics and unstructured data deeply embedded in component datasheets. This is also the part of the electronics design process where Electronics Design Automation(EDA) tools are of little help, they solely provide component symbols and footprints at most. Hence there has been a long standing interest in creating representations of components which allow computers to automatically choose, or in more advanced cases place and connect electronics components, significantly reducing design time. This project will aim to create a framework and a set of tools that mine large collections of schematics for useful connectivity information. This will help engineers significantly reduce the time and effort they spend on choosing components as well as paves the way towards automatically picking and connecting components from a large database. Information extraction from schematics is largely enabled by modern developments in the field of graph neural networks on which the project will heavily rely on. The project will also produce a tool that can convert schematic diagrams into digital schematic representations unleashing tremendous potential in creating a circuit template database.

Domain Adversarial Training for Infrared-colour Person Re-Identification

 

Person re-identification (re-ID) is a very active area of research in computer vision, due to the role it plays in video surveillance. Currently, most methods only address the task of matching between colour images. However, in poorly-lit environments CCTV cameras switch to infrared imaging, hence developing a system which can correctly perform matching between infrared and colour images is a necessity. In this paper, we propose a part-feature extraction network to better focus on subtle, unique signatures on the person which are visible across both infrared and colour modalities. To train the model we propose a novel variant of the domain adversarial feature-learning framework. Through extensive experimentation, we show that our approach outperforms state-of-the-art methods.

Attribute-Image Person Re-Identification

Although person re-identification (person re-ID) has become a popular topic and received many inspiring solutions, attribute-image person re-ID is still an open research domain demanding more studies to be carried out. Attribute-image person re-ID aims at taking text attributes as queries to retrieve person images from the gallery database. Among current existing methods, text attribute aggregation and visual feature decomposition (TAVD) realises attribute-image person matching in both global-level and local-level bi-modal embedding space, and hence yields an improved performance compared to many of other methods. However, it utilises batch hard bi-triplet loss as the global-level criterion which
might lead to high intra-class variation and bad local minima. As a consequence, we aim at taking TAVD as our baseline framework and improving its performance by improving
its loss functions. We perform related literature review and adopt three novel methods, namely compact triplet loss, fast approximated triplet loss and hybrid similarity measure.
Since these three approaches only consider single modal situations, we then modify them to adapt them to attribute-image cross-modal cases and experimentally testify their effectiveness in TAVD systems. Based on the observation of the experiment results of these methods and by studying their advantages and weakness, we propose a fused method
named compact bi-triplet loss which exploits the strength of batch hard bi-triplet loss and introduces a point-to-centroid clustering term to reduce the intra-class distance among
samples of the same person class. Moreover, two different person class centroid computation protocols are employed. We finally empirically validate our method which enhances
the performance of TAVD framework on Market-1501 benchmark.

CNN based vehicle re-identification

Recently, the re-identification (Re-ID) has became one of the burgeoning topics in computer vision research community with ever-growing attention due to their applications and research significance. Even though there are significant advances in person re-identification techniques, the problems of vehicle Re-ID has still not been fully investigated owing to their inherent characteristics. Specifically, the inter-class similarity and intra-class variance pertained in the vehicle images make the image-based vehicle re-identification problems indeed more challenging tasks. In this project, an approach using only the visualization information will be developed for vehicle re-identification. Particularly,
our established model will be based on the convolutional neural network (CNN) for deep feature learning, considering its outstanding performance in a variety of computer vision tasks. The structure defined as multistage loss merged (MSLM) model is the key component of our proposed method, which
combines the information fusion and multistage supervision. It will then be evaluated on three popular benchmark datasets with different sizes and multiple constrained conditions: VeRi776, VehicleID and VRIC. Finally, the results will be presented and compared with existing vehicle Re-ID methods. The comprehensive experiments over benchmark datasets have shown that our proposed MSLM model outperforms the existing image-based vehicle re-identification approaches with following achievements: highest mAP value (67.71%) on Veri776; best CMC scores on VehicleID (88.88%, 83.50\% and 82.90% @rank1 for small, medium and large test dataset respectively); significant improvements on VRIC dataset top-1 accuracy from 46.67\% to 63.00\%. Besides, the performance guarantee on real-world task deployments will also be investigated by presenting results on the CityFlow dataset from 2019 AI city challenge. Our proposed purely image-based model achieves competitive performance among the competitors which only utilize visualization information and also rank the middle place within all teams. Noticeably, the thorough analysis of surpassed performance from the leaderboard top teams will be explored and shed some light on the potential performance improvement for our models by incorporating other approaches and techniques.  

Tackling Mathematical Reasoning with Transformers

This project aims to find a neural model capable of exploiting the compositionality of mathematical reasoning problems by learning their structure and rules with the specific aim of achieving high performance for questions involving previously seen concepts, but are longer than those seen in training. In this project, we test the transformer architecture on maths reasoning problems and highlight its strengths and weaknesses. We use this analysis to design and test new models that jointly learn the correct rules and their appropriate ordering and application for performing the 4 basic arithmetic operations addition, subtraction, multiplication and division. We evaluate our models on the a Maths Reasoning Dataset which provides testing data for longer sequences.

Spider recognition system

Of the 10 most common spiders in UK households, 4 of them deliver bites poisonous enough to affect humans. With an average of roughly 30 spiders present in each household, there is a significant need to be able to correctly and efficiently identify spiders upon encountering them in the house. This paper details an attempt to develop an automated spider species recognition system that requires little instruction and no prior knowledge in arachnology. The Spider Comparator and AnalyseR (SCAR) program is developed in MATLAB with use of open source GIST code and the VLfeat toolbox. SCAR employs both retrieval and recognition methods for spider image classification. The methods used are: GIST and SIFT descriptors matched with L2 Norm distance measure, SVM classification using GIST descriptors, and the Bag of Visual Words approach. The program consists of 6 core, interrelated functions for the identification and testing of spider test images against a species database, 3 functions for test and result analysis, and 1 demonstration program. Tests were conducted to demonstrate scaling and rotational invariance with the SIFT descriptor. More realistic testing scenarios were also employed to test for the suitability of each classification approach in this specific spider recognition application. While the tests confirm the scaling and rotational invariance of the SIFT descriptor, limitations of realistic species matching are revealed. While all approaches performed admirably, the Bag of Visual Words approach produced the best performance, with a classification performance of 53%. Future work includes optimising descriptors (e.g. improving orientation bin quantisation), including colour information in descriptors used, and segmentation of the spider body within the image.

Long-term surveillance from low frame-rate cameras

The aim of this project is to analyse and develop on a computer vision algorithm that identifies the presence and location of one or more animals in a series of footage captured by a motion-operated low frame-rate camera in a surveillance camera trap. The target areas of analysis and improvement include robustness to environmental changes and extreme cases, and run-time and resource optimisation. The applica- tion of the background subtraction technique provides simplicity for identifying the foreground objects in videos with frames at one second apart, at the expense of per- formance. By means of camera-specific analysis and ground-truthing, the selection of optimal background subtraction parameters for satisfactory outputs is justified. By ex- ecuting and optimising the algorithm on a Raspberry Pi, the simulation of the operating environment on a low energy-and-storage-demanding machine is also made possible.

Image classification using convolutional neural networks applied to brain tumor detection & segmentation

This project is concern with a design of convolutional neural networks that can be used for the detection and segmentation of brain tumors. The aim is to explore possible improvements to the conventional architecture that allows for high levels of accuracy. The recently proposed multi-pathway structure, which has shown promising results in several computer vision tasks, serves as a basis for the new model design. Due to the relatively small size and unbalanced nature of the training dataset, model over-fitting arises as the main issue that needs to be tackled. It has been found that the key to success was not purely based on suitable network design; appropriate training sample selection and pre-processing procedures are just as important, if not more.

Person identification by matching natural text description with images

This project concerns the research and design of a system to match natural language text descriptions with pedestrian images. The main focus of this project is projecting image and description representations to a common subspace where their similarity can be measured and matches predicted. Descriptions have been captured through a web inter- face where users followed a given script, and as such all data is labelled and the majority of learning is supervised. By experimenting with different neural network structures, vector projection methods and natural language to vector conversion parameters different degrees of successful matching have been achieved. Word2Vec, a program that learns a dis- tributed representation of words, was used to convert the natural language descriptions to their vector representations. A two-channel CNN binary match classifier network has been used to measure the similarity of image and description representations. This project has security related applications using soft-biometrics to identify a suspect from footage using witness statements, a process that is currently very time intensive and with a low success rate, as shown in figure 1. This person re-identification is of growing importance in an ever increasingly watched and recorded world, where CCTV has been widely implemented in many countries.

Feature detection & tracking on mobile phone

Simultaneous Localization and Mapping (SLAM) has many applications in robotics and Aug- mented Reality. One of the main problems facing SLAM is the computation efficiency of SLAM algorithms. In order for these algorithms to work on smaller devices such as mobile phones which have inferior computing power compared to a Desktop, there is a need to come up with solutions as to how to optimize these algorithms to be more computationally efficient so they work better on the mobile devices. There are many parts to a SLAM pipeline : Visual Input, Feature Extraction, Feature Matching, Visual Odometry and Mapping. This project aims to explore how the Feature extraction part of the SLAM pipeline affects the performance of the SLAM algorithm. The results obtained can be used to propose improvements to current SLAM solutions.

Using machine learning to learn and generate new text

Traditional natural language generation systems are often both limited in their expressive ability, and highly application specific. This means they must be built from the ground up for every slight change in application domain. A domain-independent system is therefore of great interest, and the flexibility of such a system would give it practical advantages over the traditional approaches. The purpose of this project was to investigate methods for semantically constraining LSTM RNNs such that they can be of practical use. The generation of natural language weather forecasts from raw data adequately illustrates the practical uses of this work, and was therefore selected as the application domain for the evaluation of the systems created.

Speed camera from monocular view

Modern vehicle speed detection products which now widely used in commercial markets are mainly based on radar technology or infra-red light. Instead of retailer markets, consumers oriented by these products are mainly government departments, e.g., traffic police section. Meanwhile, traffic safety is remaining to be a severe problem among all the nations in the world. A portable, reliable and effective method that can detect other vehicles velocities on the street in the real-time scenario is still not mature for commercial use. This project, however, proposes a brand-new concept, that is to combine machine learning techniques with image processing, thus to estimate the moving vehicle’s speed coming from the back of driver by using only one single viewed camera. The operating procedure of this project include 1), video acquisition from a single viewed camera, 2), video frame processing, 3), vehicle detection and tracking, 4), pixel size to distance conversion, and 5), velocity calculation based on frame sampling rate. In these procedures, several deep leaning concepts are involved in this project, including modification of a deep neural network architecture using Tensorflow, and training the non-linear model for the pixel to distance matching.

Global to local object matching

The introduction of Content-Based Image Retrieval (CBIR) makes great use of the visual content representation to identify relevant images. In the early years, CBIR was mainly studied with global features such as colour histogram. From 21st century onwards, local descriptors (SIFT based) and Bag of Word (BOW) model became very popular for CBIR since local features are more invariant to certain image condition changes which are advantageous in producing image representation. Recently, deep learning has attracted huge attentions from CBIR since it has demonstrated state-of-the-art performances in many other computer vision tasks such as image classification, object detection etc. This project investigates both the initial filtering stage and re-ranking stage of a CBIR system: Various image representations derived from different layers (fully-connected layers and convolutional layers) of deep learning models (CNN) are applied to the initial filtering stage, and several refinement schemes based on localization and query expansion are adopted in the re-ranking stage. A thorough series of experiments are conducted to find the method which performs the best in terms of accuracy and the ability of realizing local object retrieval when the query object gets smaller and local. Several mainstream public image retrieval datasets and mean average precision (mAP) were used to compare and evaluate the retrieval performances of different methods.

Deep segmentation and registration in x-ray angiography video

In interventional radiology, short video sequences of vein structure in motion are captured in order to help medical personnel identify vascular issues or plan intervention. Semantic segmentation can greatly improve the usefulness of these videos by indicating exact position of vessels and instruments, thus reducing the ambiguity. We propose a real-time segmentation method for these tasks, based on U-Net network trained in a Siamese architecture from automatically generated annotations. We make use of noisy low level binary segmentation and optical flow to generate multi-class annotations that are successively improved in a multistage segmentation approach. We signifi- cantly improve the performance of a state of the art U-Net at the processing speeds of 90fps. Stemming from the need to estimate accurately but also rapidly the op- tical flow between two frames; we propose a method to utilise the U-Net network to increase the accuracy of optical flow estimations, which are char- acterised by fast but inaccurate results, by 48% while adding only 4ms of extra computation. Finally to address the difficulties of acquiring medical image datasets we provide proof of concept that X-Ray fluoroscopy sequences can be synthesised using a DC-GAN network.

Real time object tracking on FPGA for UAV applications

This project implements the Kernel Correlation Filter (KCF) Object tracking algo- rithm onto a Field Programmable Gate Array (FPGA) using existing Xilinx IPCores. The KCF algorithm operates in the Fourier domain to train and to detect the move- ment. As the main bottleneck for the algorithm is performing the Discrete Fourier Transform; this has been moved into the FPGA for speed-ups. The achieved speed-up was 25% for the DFT operation itself, but this doesn’t include the extra time needed to access uncached memory area for the CPU. Further enhancements were implemented to speed-up larger sections of the code. The element-wise operations that are executed between the forward transform and the inverse transform are moved to the FPGA to reduce memory transfers. Although the FPGA operates faster, there is a trade-off between speed and the accuracy of the results, as the FPGA’s implementa- tion introduces noise to the values.

Improving illumination conditions by Generative Adversarial Networks

Enhancing extreme low light images is a challenge that has been addressed throughout the literature in multiple ways. Traditional image processing pipelines, such as histogram equalisation, white balancing or denoising, have proven successful for improving illumination conditions, however, they introduced image artefacts, noise and struggle to preserve the colours. Furthermore, many of those pipelines are designed to operate on top of raw images, which are not always available. This project introduces a new approach to enhance low light compressed images. In contrast to raw images, compressed images’ degraded information content and noise presence requires the latest deep learning techniques in addition to traditional methods to obtain high quality results. A novel adaptation of traditional cGAN methodology together with a descriptor loss function in the generator network is used to obtain properly illuminated images. Evaluation is performed by means of standard metrics of image perception, such as PSNR or SSIM, together with matching score, to faithfully assess the detail quality preservation of the transformed images. Results show that using cGANs for illumination enhancement of low-light images is possible, yielding realistic results that improve traditional image processing methods.

Cross-modal person re-identification

Person re-identification is a fairly popular sub-topic in computer vision, however most of the proposed methods only consider matching between RGB images. In some cases though, RGB images will not be available. For example in most surveillance systems the CCTV cameras switch to infrared imaging in poorly lit environments. Thus, matching between RGB and infrared images would be required which is essentially a cross-modal person re-identification problem. This project makes several contributions in the field of cross-modal person re-id. Firstly, we attempt to address the importance of local features in a cross- modal re-id setting which has never before been done. Our motivation is that certain local signatures on a persons clothing are both unique to the person and invariant across the domain gap. We utilize a model which replaces the Global-Average Pooling (GAP) layer in the ResNet50, as to better equip the model to attend to such local regions. We observe that this modification provides a 18.8% improvement in rank-1 accuracy compared to using the GAP layer. In addition we incorporate several tricks into the modified ResNet50 model as to further improve the performance. The combination of all provides a very strong baseline model which performs consistently well across all notable cross-modal datasets: SYSU-MM01, RegDB and Sketch Re-ID. Beating the previous state of the art in rank-1 accuracy by the respective margins: 2.61%, 6.18% and 22.7%.

Generation of paintings using Generative Adversarial Networks

An exploration into the research, design and implementation of bespoke Generative Adversarial Network (GAN) architectures is presented, with the aim of synthesizing novel paintings that have ’aesthetic arousal’ and can be perceived as ’Art’ by observers. Synthetic paintings are first generated from noise and then by using auxiliary information in the form of artist, style, genre and title labels as well as text descriptions. Modifications to architecture, hyper-parameter configuration and training procedure of each GAN implementation are considered in order to improve training stability and generate higher resolution paintings. Qualitative analysis based on responses from a survey and quantitative analysis using the Inception Score and Fréchet Inception Distance were conducted. These analyses revealed that paintings generated using auxiliary information, with the exception of text descriptions induced the most aesthetic arousal, where style-conditioned generated paintings were the most indistinguishable from real paintings. Sentence-level methods were more successful than character-level methods for embedding text, however both failed in generating coherent paintings, suggesting more sophisticated and tailored embedding methods are required for noisy and verbose text descriptions.

Domain transfer between images with GANs

Person re-identification (re-ID) is the problem of identifying the same person in multiple cameras. This is a non-trivial problem, that is confounded by non-overlapping field of view, lighting differences, occlusion, variation in poses and different camera viewpoints. Current person re-ID systems perform well on specific datasets but experience large performance drops when trained and tested on a different dataset. This report describes a new method to improve the robustness of person re-ID models. The proposed method generates new backgrounds using a generative adversarial network which allows person re-ID models to be trained on larger and more varied datasets, therefore improving robustness. Individual identities from the original dataset are recreated in new scenarios with corresponding labels, this allows person re-ID networks to utilise supervised learning on the generated data. Variations of the proposed method provide significant control over the generated images, from maintaining high similarity between the generated identities and their respective original (same pose) to generating the identity in any new pose while still maintaining significant similarities.

Audio event mining in large data with Deep Neural Network representation

Passive acoustic monitoring has become a popular way to estimate activity and population of species. However, a large amount of recording data are significantly time-consuming effort for experts. As a case study of Geoffroy’s spider monkey, Ateles geoffroyi, we aim to develop an automated species detector based on CNN to predict the call position in audio files. The audio signal is represented by mel-spectrogram. In order to improve the model performance, we propose several data processing approaches as the strategy of compiling training dataset. Noise reduction methods, including spectral subtraction and MMSE-LSA estimators, are applied on positive data, which enhance the region of interest. Due to relative small dataset, we use augmentation method to increase the variety and diversity of data, reducing the generalisation error. Moreover, we test the performance by new data clips as well, measuring the number of wrong predictions as a comparison. The prediction results are recorded in files for future review. All models with these strategies achieve improvement in varying degrees. Since the baseline model is a shallow network with limited performance, a deep model based on VGGNet is proposed named VGG-based model. By learning high-level features, most of the hard positive data can be accurately classified. As a result, the VGG-based model with augmentation dataset achieves optimal performance, presenting in 85.05% accuracy and 83.32% F1 score. Additionally, both baseline and VGG-based models are trained in a second time with applying hard negative mining, increased accuracy and precision by 5% in general.

CNN based vehicle re-identification

Recently, the re-identification (Re-ID) has became one of the burgeoning topics in computer vision research community with ever-growing attention due to their applications and research significance. Even though there are significant advances in person re-identification techniques, the problems of vehicle Re-ID has still not been fully investigated owing to their inherent characteristics. Specifically, the inter-class similarity and intra-class variance pertained in the vehicle images make the image-based vehicle re-identification problems indeed more challenging tasks. In this project, an approach using only the visual information will be developed for vehicle re-identification. Particularly, our model will be based on the convolutional neural network (CNN). The structure defined as multistage loss merged (MSLM) model is the key component of our proposed method, which combines the information fusion and multistage supervision. It is evaluated on three popular benchmark datasets with different sizes and mul- tiple constrained conditions: VeRi776, VehicleID and VRIC. Finally, the results are presented and compared with existing vehicle Re-ID methods. The comprehensive experiments over benchmark datasets have shown that our proposed MSLM model outperforms the existing image-based vehicle re-identification approaches with following achievements: highest mAP value (67.71%) on Veri776; best CMC scores on VehicleID (88.88%, 83.50% and 82.90% @rank1 for small, medium and large test dataset respectively); significant improvements on VRIC dataset top-1 accuracy from 46.67% to 63.00%. Besides, the performance guarantee on real-world task deployments is also investigated by presenting results on the CityFlow dataset from 2019 AI city challenge.