Spider recognition system

Of the 10 most common spiders in UK households, 4 of them deliver bites poisonous enough to affect humans. With an average of roughly 30 spiders present in each household, there is a significant need to be able to correctly and efficiently identify spiders upon encountering them in the house. This paper details an attempt to develop an automated spider species recognition system that requires little instruction and no prior knowledge in arachnology.

The Spider Comparator and AnalyseR (SCAR) program is developed in MATLAB with use of open source GIST code and the VLfeat toolbox. SCAR employs both retrieval and recognition methods for spider image classification. The methods used are: GIST and SIFT descriptors matched with L2 Norm distance measure, SVM classification using GIST descriptors, and the Bag of Visual Words approach. The program consists of 6 core, interrelated functions for the identification and testing of spider test images against a species database, 3 functions for test and result analysis, and 1 demonstration program. Tests were conducted to demonstrate scaling and rotational invariance with the SIFT descriptor. More realistic testing scenarios were also employed to test for the suitability of each classification approach in this specific spider recognition application. While the tests confirm the scaling and rotational invariance of the SIFT descriptor, limitations of realistic species matching are revealed. While all approaches performed admirably, the Bag of Visual Words approach produced the best performance, with a classification performance of 53%. Future work includes optimising descriptors (e.g. improving orientation bin quantisation), including colour information in descriptors used, and segmentation of the spider body within the image.

Long-term surveillance from low frame-rate cameras

The aim of this project is to analyse and develop on a computer vision algorithm that identifies the presence and location of one or more animals in a series of footage captured by a motion-operated low frame-rate camera in a surveillance camera trap. The target areas of analysis and improvement include robustness to environmental changes and extreme cases, and run-time and resource optimisation. The applica- tion of the background subtraction technique provides simplicity for identifying the foreground objects in videos with frames at one second apart, at the expense of per- formance. By means of camera-specific analysis and ground-truthing, the selection of optimal background subtraction parameters for satisfactory outputs is justified. By ex- ecuting and optimising the algorithm on a Raspberry Pi, the simulation of the operating environment on a low energy-and-storage-demanding machine is also made possible.

Image classification using convolutional neural networks applied to brain tumor detection & segmentation

This project is concern with a design of convolutional neural networks that can be used for the detection and segmentation of brain tumors. The aim is to explore possible improvements to the conventional architecture that allows for high levels of accuracy. The recently proposed multi-pathway structure, which has shown promising results in several computer vision tasks, serves as a basis for the new model design. Due to the relatively small size and unbalanced nature of the training dataset, model over-fitting arises as the main issue that needs to be tackled. It has been found that the key to success was not purely based on suitable network design; appropriate training sample selection and pre-processing procedures are just as important, if not more.

Person identification by matching natural text description with images

This project concerns the research and design of a system to match natural language text descriptions with pedestrian images. The main focus of this project is projecting image and description representations to a common subspace where their similarity can be measured and matches predicted. Descriptions have been captured through a web inter- face where users followed a given script, and as such all data is labelled and the majority of learning is supervised. By experimenting with different neural network structures, vector projection methods and natural language to vector conversion parameters different degrees of successful matching have been achieved. Word2Vec, a program that learns a dis- tributed representation of words, was used to convert the natural language descriptions to their vector representations. A two-channel CNN binary match classifier network has been used to measure the similarity of image and description representations. This project has security related applications using soft-biometrics to identify a suspect from footage using witness statements, a process that is currently very time intensive and with a low success rate, as shown in figure 1. This person re-identification is of growing importance in an ever increasingly watched and recorded world, where CCTV has been widely implemented in many countries.

Feature detection & tracking on mobile phone

Simultaneous Localization and Mapping (SLAM) has many applications in robotics and Aug- mented Reality. One of the main problems facing SLAM is the computation efficiency of SLAM algorithms. In order for these algorithms to work on smaller devices such as mobile phones which have inferior computing power compared to a Desktop, there is a need to come up with solutions as to how to optimize these algorithms to be more computationally efficient so they work better on the mobile devices. There are many parts to a SLAM pipeline : Visual Input, Feature Extraction, Feature Matching, Visual Odometry and Mapping. This project aims to explore how the Feature extraction part of the SLAM pipeline affects the performance of the SLAM algorithm. The results obtained can be used to propose improvements to current SLAM solutions.

Using machine learning to learn and generate new text

Traditional natural language generation systems are often both limited in their expressive ability, and highly application specific. This means they must be built from the ground up for every slight change in application domain. A domain-independent system is therefore of great interest, and the flexibility of such a system would give it practical advantages over the traditional approaches. The purpose of this project was to investigate methods for semantically constraining LSTM RNNs such that they can be of practical use. The generation of natural language weather forecasts from raw data adequately illustrates the practical uses of this work, and was therefore selected as the application domain for the evaluation of the systems created.

Speed camera from monocular view

Modern vehicle speed detection products which now widely used in commercial markets are mainly based on radar technology or infra-red light. Instead of retailer markets, consumers oriented by these products are mainly government departments, e.g., traffic police section. Meanwhile, traffic safety is remaining to be a severe problem among all the nations in the world. A portable, reliable and effective method that can detect other vehicles velocities on the street in the real-time scenario is still not mature for commercial use. This project, however, proposes a brand-new concept, that is to combine machine learning techniques with image processing, thus to estimate the moving vehicle’s speed coming from the back of driver by using only one single viewed camera. The operating procedure of this project include 1), video acquisition from a single viewed camera, 2), video frame processing, 3), vehicle detection and tracking, 4), pixel size to distance conversion, and 5), velocity calculation based on frame sampling rate. In these procedures, several deep leaning concepts are involved in this project, including modification of a deep neural network architecture using Tensorflow, and training the non-linear model for the pixel to distance matching.

Global to local object matching

The introduction of Content-Based Image Retrieval (CBIR) makes great use of the visual content representation to identify relevant images. In the early years, CBIR was mainly studied with global features such as colour histogram. From 21st century onwards, local descriptors (SIFT based) and Bag of Word (BOW) model became very popular for CBIR since local features are more invariant to certain image condition changes which are advantageous in producing image representation. Recently, deep learning has attracted huge attentions from CBIR since it has demonstrated state-of-the-art performances in many other computer vision tasks such as image classification, object detection etc. This project investigates both the initial filtering stage and re-ranking stage of a CBIR system: Various image representations derived from different layers (fully-connected layers and convolutional layers) of deep learning models (CNN) are applied to the initial filtering stage, and several refinement schemes based on localization and query expansion are adopted in the re-ranking stage. A thorough series of experiments are conducted to find the method which performs the best in terms of accuracy and the ability of realizing local object retrieval when the query object gets smaller and local. Several mainstream public image retrieval datasets and mean average precision (mAP) were used to compare and evaluate the retrieval performances of different methods.

Deep segmentation and registration in x-ray angiography video

In interventional radiology, short video sequences of vein structure in motion are captured in order to help medical personnel identify vascular issues or plan intervention. Semantic segmentation can greatly improve the usefulness of these videos by indicating exact position of vessels and instruments, thus reducing the ambiguity. We propose a real-time segmentation method for these tasks, based on U-Net network trained in a Siamese architecture from automatically generated annotations. We make use of noisy low level binary segmentation and optical flow to generate multi-class annotations that are successively improved in a multistage segmentation approach. We signifi- cantly improve the performance of a state of the art U-Net at the processing speeds of 90fps. Stemming from the need to estimate accurately but also rapidly the op- tical flow between two frames; we propose a method to utilise the U-Net network to increase the accuracy of optical flow estimations, which are char- acterised by fast but inaccurate results, by 48% while adding only 4ms of extra computation. Finally to address the difficulties of acquiring medical image datasets we provide proof of concept that X-Ray fluoroscopy sequences can be synthesised using a DC-GAN network.

Real time object tracking on FPGA for UAV applications

This project implements the Kernel Correlation Filter (KCF) Object tracking algo- rithm onto a Field Programmable Gate Array (FPGA) using existing Xilinx IPCores. The KCF algorithm operates in the Fourier domain to train and to detect the move- ment. As the main bottleneck for the algorithm is performing the Discrete Fourier Transform; this has been moved into the FPGA for speed-ups. The achieved speed-up was 25% for the DFT operation itself, but this doesn’t include the extra time needed to access uncached memory area for the CPU. Further enhancements were implemented to speed-up larger sections of the code. The element-wise operations that are executed between the forward transform and the inverse transform are moved to the FPGA to reduce memory transfers. Although the FPGA operates faster, there is a trade-off between speed and the accuracy of the results, as the FPGA’s implementa- tion introduces noise to the values.

Improving illumination conditions by Generative Adversarial Networks

Enhancing extreme low light images is a challenge that has been addressed throughout the literature in multiple ways. Traditional image processing pipelines, such as histogram equalisation, white balancing or denoising, have proven successful for improving illumination conditions, however, they introduced image artefacts, noise and struggle to preserve the colours. Furthermore, many of those pipelines are designed to operate on top of raw images, which are not always available. This project introduces a new approach to enhance low light compressed images. In contrast to raw images, compressed images’ degraded information content and noise presence requires the latest deep learning techniques in addition to traditional methods to obtain high quality results. A novel adaptation of traditional cGAN methodology together with a descriptor loss function in the generator network is used to obtain properly illuminated images. Evaluation is performed by means of standard metrics of image perception, such as PSNR or SSIM, together with matching score, to faithfully assess the detail quality preservation of the transformed images. Results show that using cGANs for illumination enhancement of low-light images is possible, yielding realistic results that improve traditional image processing methods.

Cross-modal person re-identification

Person re-identification is a fairly popular sub-topic in computer vision, however most of the proposed methods only consider matching between RGB images. In some cases though, RGB images will not be available. For example in most surveillance systems the CCTV cameras switch to infrared imaging in poorly lit environments. Thus, matching between RGB and infrared images would be required which is essentially a cross-modal person re-identification problem. This project makes several contributions in the field of cross-modal person re-id. Firstly, we attempt to address the importance of local features in a cross- modal re-id setting which has never before been done. Our motivation is that certain local signatures on a persons clothing are both unique to the person and invariant across the domain gap. We utilize a model which replaces the Global-Average Pooling (GAP) layer in the ResNet50, as to better equip the model to attend to such local regions. We observe that this modification provides a 18.8% improvement in rank-1 accuracy compared to using the GAP layer. In addition we incorporate several tricks into the modified ResNet50 model as to further improve the performance. The combination of all provides a very strong baseline model which performs consistently well across all notable cross-modal datasets: SYSU-MM01, RegDB and Sketch Re-ID. Beating the previous state of the art in rank-1 accuracy by the respective margins: 2.61%, 6.18% and 22.7%.

Generation of paintings using Generative Adversarial Networks

An exploration into the research, design and implementation of bespoke Generative Adversarial Network (GAN) architectures is presented, with the aim of synthesizing novel paintings that have ’aesthetic arousal’ and can be perceived as ’Art’ by observers. Synthetic paintings are first generated from noise and then by using auxiliary information in the form of artist, style, genre and title labels as well as text descriptions. Modifications to architecture, hyper-parameter configuration and training procedure of each GAN implementation are considered in order to improve training stability and generate higher resolution paintings. Qualitative analysis based on responses from a survey and quantitative analysis using the Inception Score and Fréchet Inception Distance were conducted. These analyses revealed that paintings generated using auxiliary information, with the exception of text descriptions induced the most aesthetic arousal, where style-conditioned generated paintings were the most indistinguishable from real paintings. Sentence-level methods were more successful than character-level methods for embedding text, however both failed in generating coherent paintings, suggesting more sophisticated and tailored embedding methods are required for noisy and verbose text descriptions.

Domain transfer between images with GANs

Person re-identification (re-ID) is the problem of identifying the same person in multiple cameras. This is a non-trivial problem, that is confounded by non-overlapping field of view, lighting differences, occlusion, variation in poses and different camera viewpoints. Current person re-ID systems perform well on specific datasets but experience large performance drops when trained and tested on a different dataset. This report describes a new method to improve the robustness of person re-ID models. The proposed method generates new backgrounds using a generative adversarial network which allows person re-ID models to be trained on larger and more varied datasets, therefore improving robustness. Individual identities from the original dataset are recreated in new scenarios with corresponding labels, this allows person re-ID networks to utilise supervised learning on the generated data. Variations of the proposed method provide significant control over the generated images, from maintaining high similarity between the generated identities and their respective original (same pose) to generating the identity in any new pose while still maintaining significant similarities.

Audio event mining in large data with Deep Neural Network representation

Passive acoustic monitoring has become a popular way to estimate activity and population of species. However, a large amount of recording data are significantly time-consuming effort for experts. As a case study of Geoffroy’s spider monkey, Ateles geoffroyi, we aim to develop an automated species detector based on CNN to predict the call position in audio files. The audio signal is represented by mel-spectrogram. In order to improve the model performance, we propose several data processing approaches as the strategy of compiling training dataset. Noise reduction methods, including spectral subtraction and MMSE-LSA estimators, are applied on positive data, which enhance the region of interest. Due to relative small dataset, we use augmentation method to increase the variety and diversity of data, reducing the generalisation error. Moreover, we test the performance by new data clips as well, measuring the number of wrong predictions as a comparison. The prediction results are recorded in files for future review. All models with these strategies achieve improvement in varying degrees. Since the baseline model is a shallow network with limited performance, a deep model based on VGGNet is proposed named VGG-based model. By learning high-level features, most of the hard positive data can be accurately classified. As a result, the VGG-based model with augmentation dataset achieves optimal performance, presenting in 85.05% accuracy and 83.32% F1 score. Additionally, both baseline and VGG-based models are trained in a second time with applying hard negative mining, increased accuracy and precision by 5% in general.

CNN based vehicle re-identification

Recently, the re-identification (Re-ID) has became one of the burgeoning topics in computer vision research community with ever-growing attention due to their applications and research significance. Even though there are significant advances in person re-identification techniques, the problems of vehicle Re-ID has still not been fully investigated owing to their inherent characteristics. Specifically, the inter-class similarity and intra-class variance pertained in the vehicle images make the image-based vehicle re-identification problems indeed more challenging tasks. In this project, an approach using only the visual information will be developed for vehicle re-identification. Particularly, our model will be based on the convolutional neural network (CNN). The structure defined as multistage loss merged (MSLM) model is the key component of our proposed method, which combines the information fusion and multistage supervision. It is evaluated on three popular benchmark datasets with different sizes and mul- tiple constrained conditions: VeRi776, VehicleID and VRIC. Finally, the results are presented and compared with existing vehicle Re-ID methods. The comprehensive experiments over benchmark datasets have shown that our proposed MSLM model outperforms the existing image-based vehicle re-identification approaches with following achievements: highest mAP value (67.71%) on Veri776; best CMC scores on VehicleID (88.88%, 83.50% and 82.90% @rank1 for small, medium and large test dataset respectively); significant improvements on VRIC dataset top-1 accuracy from 46.67% to 63.00%. Besides, the performance guarantee on real-world task deployments is also investigated by presenting results on the CityFlow dataset from 2019 AI city challenge.