69 results found
Balntas V, Lenc K, Vedaldi A, et al., 2020, HPatches: A benchmark and evaluation of handcrafted and learned local descriptors, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 42, Pages: 2825-2841, ISSN: 0162-8828
In this paper, a novel benchmark is introduced for evaluating local image descriptors. We demonstrate limitations of the commonly used datasets and evaluation protocols, that lead to ambiguities and contradictory results in the literature. Furthermore, these benchmarks are nearly saturated due to the recent improvements in local descriptors obtained by learning from large annotated datasets. To address these issues, we introduce a new large dataset suitable for training and testing modern descriptors, together with strictly defined evaluation protocols in several tasks such as matching, retrieval and verification. This allows for more realistic, thus more reliable comparisons in different application scenarios. We evaluate the performance of several state-of-the-art descriptors and analyse their properties. We show that a simple normalisation of traditional hand-crafted descriptors is able to boost their performance to the level of deep learning based descriptors once realistic benchmarks are considered. Additionally we specify a protocol for learning and evaluating using cross validation. We show that when training state-of-the-art descriptors on this dataset, the traditional verification task is almost entirely saturated.
Ramisa A, Yan F, Moreno-Noguer F, et al., 2018, BreakingNews: article annotation by image and text processing, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 40, Pages: 1072-1085, ISSN: 0162-8828
Building upon recent Deep Neural Network architectures, current approaches lying in the intersection of Computer Vision and Natural Language Processing have achieved unprecedented breakthroughs in tasks like automatic captioning or image retrieval. Most of these learning methods, though, rely on large training sets of images associated with human annotations that specifically describe the visual content. In this paper we propose to go a step further and explore the more complex cases where textual descriptions are loosely related to the images. We focus on the particular domain of news articles in which the textual content often expresses connotative and ambiguous relations that are only suggested but not directly inferred from images. We introduce an adaptive CNN architecture that shares most of the structure for multiple tasks including source detection, article illustration and geolocation of articles. Deep Canonical Correlation Analysis is deployed for article illustration, and a new loss function based on Great Circle Distance is proposed for geolocation. Furthermore, we present BreakingNews, a novel dataset with approximately 100K news articles including images, text and captions, and enriched with heterogeneous meta-data (such as GPS coordinates and user comments). We show this dataset to be appropriate to explore all aforementioned problems, for which we provide a baseline performance using various Deep Learning architectures, and different representations of the textual and visual features. We report very promising results and bring to light several limitations of current state-of-the-art in this kind of domain, which we hope will help spur progress in the field.
Balntas V, Tang L, Mikolajczyk K, 2018, Binary Online Learned Descriptors, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 40, Pages: 555-567, ISSN: 0162-8828
Koniusz P, Yan F, Gosselin P-H, et al., 2017, Higher-order occurrence pooling for bags-of-words: visual concept detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 39, Pages: 313-326, ISSN: 0162-8828
In object recognition, the Bag-of-Words model assumes: i) extraction of local descriptors from images, ii) embedding the descriptors by a coder to a given visual vocabulary space which results in mid-level features, iii) extracting statistics from mid-level features with a pooling operator that aggregates occurrences of visual words in images into signatures, which we refer to as First-order Occurrence Pooling. This paper investigates higher-order pooling that aggregates over co-occurrences of visual words. We derive Bag-of-Words with Higher-order Occurrence Pooling based on linearisation of Minor Polynomial Kernel, and extend this model to work with various pooling operators. This approach is then effectively used for fusion of various descriptor types. Moreover, we introduce Higher-order Occurrence Pooling performed directly on local image descriptors as well as a novel pooling operator that reduces the correlation in the image signatures. Finally, First-, Second-, and Third-order Occurrence Pooling are evaluated given various coders and pooling operators on several widely used benchmarks. The proposed methods are compared to other approaches such as Fisher Vector Encoding and demonstrate improved results.
Akin O, Erdem E, Erdem A, et al., 2016, Deformable part-based tracking by coupled global and local correlation filters, Journal of Visual Communication and Image Representation, Vol: 38, Pages: 763-774, ISSN: 1095-9076
Correlation filters have recently attracted attention in visual tracking due to their efficiency and high performance. However, their application to long-term tracking is somewhat limited since these trackers are not equipped with mechanisms to cope with challenging cases like partial occlusion, deformation or scale changes. In this paper, we propose a deformable part-based correlation filter tracking approach which depends on coupled interactions between a global filter and several part filters. Specifically, local filters provide an initial estimate, which is then used by the global filter as a reference to determine the final result. Then, the global filter provides a feedback to the part filters regarding their updates and the related deformation parameters. In this way, our proposed collaborative model handles not only partial occlusion but also scale changes. Experiments on two large public benchmark datasets demonstrate that our approach gives significantly better results compared with the state-of-the-art trackers.
Chan CH, Yan F, Kittler J, et al., 2015, Full ranking as local descriptor for visual recognition: A comparison of distance metrics on sn, Pattern Recognition, Vol: 48, Pages: 1324-1332, ISSN: 0031-3203
© 2014 Elsevier Ltd. All rights reserved. In this paper we propose to use the full ranking of a set of pixels as a local descriptor. In contrast to existing methods which use only partial ranking information, the full ranking encodes the complete comparative information among the pixels, while retaining invariance to monotonic photometric transformations. The descriptor is used within the bag-of-visual-words paradigm for visual recognition. We demonstrate that the choice of distance metric for assigning the descriptors to visual words is crucial to the performance, and provide an extensive evaluation of eight distance metrics for the permutation group Sn on four widely used face verification and texture classification benchmarks. The results demonstrate that (1) full ranking of pixels encodes more information than partial ranking, consistently leading to better performance; (2) full ranking descriptor can be trivially made rotation invariant; (3) the proposed descriptor applies to both image intensities and filter responses, and is capable of producing state-of-the-art performance.
Yan F, Kittler J, Windridge D, et al., 2014, Automatic annotation of tennis games: An integration of audio, vision, and learning, Image and Vision Computing, Vol: 32, Pages: 896-903, ISSN: 0262-8856
Fully automatic annotation of tennis game using broadcast video is a task with a great potential but with enormous challenges. In this paper we describe our approach to this task, which integrates computer vision, machine listening, and machine learning. At the low level processing, we improve upon our previously proposed state-of-the-art tennis ball tracking algorithm and employ audio signal processing techniques to detect key events and construct features for classifying the events. At high level analysis, we model event classification as a sequence labelling problem, and investigate four machine learning techniques using simulated event sequences. Finally, we evaluate our proposed approach on three real world tennis games, and discuss the interplay between audio, vision and learning. To the best of our knowledge, our system is the only one that can annotate tennis game at such a detailed level. © 2014 Elsevier B.V.
Akin O, Mikolajczyk K, 2014, Online learning and detection with part-based, circulant structure, Pages: 4229-4233, ISSN: 1051-4651
© 2014 IEEE. Circulant Structure Kernel (CSK) has recently been introduced as a simple and extremely efficient tracking method. In this paper, we propose an extension of CSK that explicitly addresses partial occlusion problems which the original CSK suffers from. Our extension is based on a part-based scheme, which improves the robustness and localisation accuracy. Furthermore, we improve the robustness of CSK for long-term tracking by incorporating it into an online learning and detection framework. We provide an extensive comparison to eight recently introduced tracking methods. Our experimental results show that the proposed approach significantly improves the original CSK and provides state-of-the-art results when combined with online learning approach.
Schubert F, Mikolajczyk K, 2014, Robust registration and filtering for moving object detection in aerial videos, Pages: 2808-2813, ISSN: 1051-4651
© 2014 IEEE. In this paper we present a multi-frame motion detection approach for aerial platforms with a two-folded contribution. First, we propose a novel image registration method, which can robustly cope with a large variety of aerial imagery. We show that it can benefit from a hardware accelerated implementation using graphic cards, allowing processing at high frame rate. Second, to handle the inaccuracy of the registration and sensor noise that result in false-alarms, we present an efficient filtering step to reduce incorrect motion hypotheses that arise from background substraction. We show that the proposed filtering significantly improves the precision of the motion detection while maintaining high recall. We introduce a new dataset for evaluating aerial surveillance systems, which will be made available for comparison. We evaluate the registration performance in terms of accuracy and speed as well as the filtering in terms of motion detection performance.
Gaur A, Mikolajczyk K, 2014, Ranking images based on aesthetic qualities, Pages: 3410-3415, ISSN: 1051-4651
© 2014 IEEE. We propose a novel approach for learning image representation based on qualitative assessments of visual aesthetics. It relies on a multi-node multi-state model that represents image attributes and their relations. The model is learnt from pair wise image preferences provided by annotators. To demonstrate the effectiveness we apply our approach to fashion image rating, i.e., comparative assessment of aesthetic qualities. Bag-of-features object recognition is used for the classification of visual attributes such as clothing and body shape in an image. The attributes and their relations are then assigned learnt potentials which are used to rate the images. Evaluation of the representation model has demonstrated a high performance rate in ranking fashion images.
Balntas V, Tang L, Mikolajczyk K, 2014, Improving object tracking with voting from false positive detections, Pages: 1928-1933, ISSN: 1051-4651
© 2014 IEEE. Context provides additional information in detection and tracking and several works proposed online trained trackers that make use of the context. However, the context is usually considered during tracking as items with motion patterns significantly correlated with the target. We propose a new approach that exploits context in tracking-by-detection and makes use of persistent false positive detections. True detection as well as repeated false positives act as pointers to the location of the target. This is implemented with a generalised Hough voting and incorporated into a state-of-the art online learning framework. The proposed method presents good performance in both speed and accuracy and it improves the current state of the art results in a challenging benchmark.
Bowden R, Collomosse J, Mikolajczyk K, 2014, Guest Editorial: Tracking, detection and segmentation, International Journal of Computer Vision, Vol: 110, Pages: 1-1, ISSN: 0920-5691
Yan F, Mikolajczyk K, 2014, Leveraging High Level Visual Information for Matching Images and Captions, Asian Conference on Computer Vision
Schubert F, Mikolajczyk K, 2013, Performance evaluation of image filtering for classification and retrieval, Pages: 485-491
Much research effort in the literature is focused on improving feature extraction methods to boost the performance in various computer vision applications. This is mostly achieved by tailoring feature extraction methods to specific tasks. For instance, for the task of object detection often new features are designed that are even more robust to natural variations of a certain object class and yet discriminative enough to achieve high precision. This focus led to a vast amount of different feature extraction methods with more or less consistent performance across different applications. Instead of fine-tuning or re-designing new features to further increase performance we want to motivate the use of image filters for pre-processing. We therefore present a performance evaluation of numerous existing image enhancement techniques which help to increase performance of already well-known feature extraction methods. We investigate the impact of such image enhancement or filtering techniques on two state-of-the-art image classification and retrieval approaches. For classification we evaluate using a standard Pascal VOC dataset. For retrieval we provide a new challenging dataset. We find that gradient-based interest-point detectors and descriptors such as SIFT or HOG can benefit from enhancement methods and lead to improved performance.
Koniusz P, Yan F, Mikolajczyk K, 2013, Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection, Computer Vision and Image Understanding, Vol: 117, Pages: 479-492, ISSN: 1077-3142
Bag-of-Words lies at a heart of modern object category recognition systems. After descriptors are extracted from images, they are expressed as vectors representing visual word content, referred to as mid-level features. In this paper, we review a number of techniques for generating mid-level features, including two variants of Soft Assignment, Locality-constrained Linear Coding, and Sparse Coding. We also isolate the underlying properties that affect their performance. Moreover, we investigate various pooling methods that aggregate mid-level features into vectors representing images. Average pooling, Max-pooling, and a family of likelihood inspired pooling strategies are scrutinised. We demonstrate how both coding schemes and pooling methods interact with each other. We generalise the investigated pooling methods to account for the descriptor interdependence and introduce an intuitive concept of improved pooling. We also propose a coding-related improvement to increase its speed. Lastly, state-of-the-art performance in classification is demonstrated on Caltech101, Flower17, and ImageCLEF11 datasets. © 2012 Elsevier Inc. All rights reserved.
Schubert F, Mikolajczyk K, 2013, Benchmarking GPU-based phase correlation for homography-based registration of aerial imagery, Pages: 83-90, ISSN: 0302-9743
Many multi-image fusion applications require fast registration methods in order to allow real-time processing. Although the most popular approaches, local-feature-based methods, have proven efficient enough for registering image pairs at real-time, some applications like multi-frame background subtraction, super-resolution or high-dynamic-range imaging benefit from even faster algorithms. A common trend to speed up registration is to implement the algorithms on graphic cards (GPUs). However not all algorithms are specially suited for massive parallelization via GPUs. In this paper we evaluate the speed of a well-known global registration method, i.e. phase correlation, for computing 8-DOF homographies. We propose a benchmark to compare a CPU- and GPU-based implementation using different systems and a dataset of aerial imagery. We demonstrate that phase correlation benefits from GPU-based implementations much more than local methods, significantly increasing the processing speed. © 2013 Springer-Verlag.
In this paper, we present a novel approach to saliency detection. We define a visually salient region with the following two properties; global saliency i.e. the spatial redundancy, and local saliency i.e. the region complexity. The former is its probability of occurrence within the image, whereas the latter defines how much information is contained within the region, and it is quantified by the entropy. By combining the global spatial redundancy measure and local entropy, we can achieve a simple, yet robust saliency detector. We evaluate it quantitatively and compare to Itti et al.  as well as to the spectral residual approach  on publicly available data where it shows a significant improvement. © 2013 Springer-Verlag.
Tahir M, Yan F, Koniusz P, et al., 2012, A Robust and Scalable Visual Category and Action Recognition System using Kernel Discriminant Analysis with Spectral Regression, IEEE Transactions on Multimedia, ISSN: 1520-9210
, 2012, British Machine Vision Conference, BMVC 2012, Surrey, UK, September 3-7, 2012, Publisher: BMVA Press
Miksik O, Mikolajczyk K, 2012, Evaluation of local detectors and descriptors for fast feature matching, 21st International Conference on Pattern Recognition, Publisher: IEEE, Pages: 2681-2684, ISSN: 1051-4651
Local feature detectors and descriptors are widely used in many computer vision applications and various methods have been proposed during the past decade. There have been a number of evaluations focused on various aspects of local features, matching accuracy in particular, however there has been no comparisons considering the accuracy and speed trade-offs of recent extractors such as BRIEF, BRISK, ORB, MRRID, MROGH and LIOP. This paper provides a performance evaluation of recent feature detectors and compares their matching precision and speed in randomized kd-trees setup as well as an evaluation of binary descriptors with efficient computation of Hamming distance. © 2012 ICPR Org Committee.
Kalal Z, Matas J, Mikolajczyk K, 2012, Tracking-Learning-Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 34, Pages: 1409-1422, ISSN: 0162-8828
This paper investigates long-term tracking of unknown objects in a video stream. The object is defined by its location and extent in a single frame. In every frame that follows, the task is to determine the object’s location and extent or indicate that the object is not present. We propose a novel tracking framework (TLD) that explicitly decomposes the long-term tracking task into tracking, learning and detection. The tracker follows the object from frame to frame. The detector localizes all appearances that have been observed so far and corrects the tracker if necessary. The learning estimates detector’s errors and updates it to avoid these errors in the future. We study how to identify detector’s errors and learn from them. We develop a novel learning method (P-N learning) which estimates the errors by a pair of "experts”: (i) P-expert estimates missed detections, and (ii) N-expert estimates false alarms. The learning process is modeled as a discrete dynamical system and the conditions under which the learning guarantees improvement are found. We describe our real-time implementation of the TLD framework and the P-N learning. We carry out an extensive quantitative evaluation which shows a significant improvement over state-of-the-art approaches.
Yan F, Kittler J, Mikolajczyk K, et al., 2011, Non-Sparse Multiple Kernel Fisher Discriminant Analysis, Journal of Machine Learning Research, Vol: 13, Pages: 607-642, ISSN: 1532-4435
Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general ‘p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-infinite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances inMKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to significant improvements in the efficiency and scalability of MK-FDA. We carry out extensive experiments on six datasets from various application areas, and compare closely the performance of ‘p MK-FDA, fixed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our results demonstrate that ‘p MK-FDA improves upon sparse MK-FDA in many practical situations. The results also show that on image categorisation problems, ‘p MK-FDA tends to outperform its SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the unified framework of regularised kernel machines.
De Campos T, Barnard M, Mikolajczyk K, et al., 2011, An evaluation of bags-of-words and spatio-temporal shapes for action recognition, Pages: 344-351
Bags-of-visual-Words (BoW) and Spatio-Temporal Shapes (STS) are two very popular approaches for action recognition from video. The former (BoW) is an un-structured global representation of videos which is built using a large set of local features. The latter (STS) uses a single feature located on a region of interest (where the actor is) in the video. Despite the popularity of these methods, no comparison between them has been done. Also, given that BoW and STS differ intrinsically in terms of context inclusion and globality/locality of operation, an appropriate evaluation framework has to be designed carefully. This paper compares these two approaches using four different datasets with varied degree of space-time specificity of the actions and varied relevance of the contextual background. We use the same local feature extraction method and the same classifier for both approaches. Further to BoW and STS, we also evaluated novel variations of BoW constrained in time or space. We observe that the STS approach leads to better results in all datasets whose background is of little relevance to action classification. © 2010 IEEE.
Awais M, Yan F, Mikolajczyk K, et al., 2011, Augmented Kernel Matrix vs Classifier Fusion for Object Recognition, 22nd British Machine Vision Conference, Publisher: BMVA Press, Pages: 60.1-60.11
Awais M, Yan F, Mikolajczyk K, et al., 2011, Augmented Kernel Matrix vs classifier fusion for object recognition, BMVC 2011 - Proceedings of the British Machine Vision Conference 2011
Augmented Kernel Matrix (AKM) has recently been proposed to accommodate for the fact that a single training example may have different importance in different feature spaces, in contrast to Multiple Kernel Learning (MKL) that assigns the same weight to all examples in one feature space. However, the AKM approach is limited to small datasets due to its memory requirements. An alternative way to fuse information from different feature channels is classifier fusion (ensemble methods). There is a significant amount of work on linear programming formulations of classifier fusion (CF) in the case of binary classification. In this paper we derive primal and dual of AKM to draw its correspondence with CF. We propose a multiclass extension of binary v-LPBoost, which learns the contribution of each class in each feature channel. Existing approaches of CF promote sparse features combinations, due to regularization based on ℓ1-norm, and lead to a selection of a subset of feature channels, which is not good in case of informative channels. We also generalize existing CF formulations to arbitrary ℓp-norm for binary and multiclass problems which results in more effective use of complementary information. We carry out an extensive comparison and show that the proposed nonlinear CF schemes outperform its sparse counterpart as well as state-of-the-art MKL approaches. © 2011. The copyright of this document resides with its authors.
Yan F, Mikolajczyk K, Kittler J, 2011, Multiple Kernel Learning via Distance Metric Learning for Interactive Image Retrieval, 10th International Workshop on Multiple Classifier Systems, Publisher: SPRINGER-VERLAG BERLIN, Pages: 147-156, ISSN: 0302-9743
Cai H, Mikolajczyk K, Matas J, 2011, Learning linear discriminant projections for dimensionality reduction of image descriptors, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 33, Pages: 338-352, ISSN: 0162-8828
In this paper, we present Linear Discriminant Projections (LDP) for reducing dimensionality and improving discriminability of local image descriptors. We place LDP into the context of state-of-the-art discriminant projections and analyze its properties. LDP requires a large set of training data with point-to-point correspondence ground truth. We demonstrate that training data produced by a simulation of image transformations leads to nearly the same results as the real data with correspondence ground truth. This makes it possible to apply LDP as well as other discriminant projection approaches to the problems where the correspondence ground truth is not available, such as image categorization. We perform an extensive experimental evaluation on standard data sets in the context of image matching and categorization. We demonstrate that LDP enables significant dimensionality reduction of local descriptors and performance increases in different applications. The results improve upon the state-of-the-art recognition performance with simultaneous dimensionality reduction from 128 to 30.
Mikolajczyk K, Uemura H, 2011, Action recognition with appearance-motion features and fast search trees, Computer Vision and Image Understanding, Vol: 115, Pages: 426-438, ISSN: 1090-235X
Awais M, Yan F, Mikolajczyk K, et al., 2011, Two-stage augmented kernel matrix for object recognition, MCS 2011: 10th International Workshop on Multiple Classifier Systems, Publisher: Springer, Pages: 137-146, ISSN: 0302-9743
Multiple Kernel Learning (MKL) has become a preferred choice for information fusion in image recognition problem. Aim of MKL is to learn optimal combination of kernels formed from different features, thus, to learn importance of different feature spaces for classification. Augmented Kernel Matrix (AKM) has recently been proposed to accommodate for the fact that a single training example may have different importance in different feature spaces, in contrast to MKL that assigns same weight to all examples in one feature space. However, AKM approach is limited to small datasets due to its memory requirements. We propose a novel two stage technique to make AKM applicable to large data problems. In first stage various kernels are combined into different groups automatically using kernel alignment. Next, most influential training examples are identified within each group and used to construct an AKM of significantly reduced size. This reduced size AKM leads to same results as the original AKM. We demonstrate that proposed two stage approach is memory efficient and leads to better performance than original AKM and is robust to noise. Results are compared with other state-of-the art MKL techniques, and show improvement on challenging object recognition benchmarks.
Awais M, Yan F, Mikolajczyk K, et al., 2011, Novel fusion methods for pattern recognition, ECML PKDD 2011: Machine Learning and Knowledge Discovery in Databases, Publisher: Springer, Pages: 140-155, ISSN: 0302-9743
Over the last few years, several approaches have been proposed for information fusion including different variants of classifier level fusion (ensemble methods), stacking and multiple kernel learning (MKL). MKL has become a preferred choice for information fusion in object recognition. However, in the case of highly discriminative and complementary feature channels, it does not significantly improve upon its trivial baseline which averages the kernels. Alternative ways are stacking and classifier level fusion (CLF) which rely on a two phase approach. There is a significant amount of work on linear programming formulations of ensemble methods particularly in the case of binary classification. In this paper we propose a multiclass extension of binary ν-LPBoost, which learns the contribution of each class in each feature channel. The existing approaches of classifier fusion promote sparse features combinations, due to regularization based on ℓ1-norm, and lead to a selection of a subset of feature channels, which is not good in the case of informative channels. Therefore, we generalize existing classifier fusion formulations to arbitrary ℓ p -norm for binary and multiclass problems which results in more effective use of complementary information. We also extended stacking for both binary and multiclass datasets. We present an extensive evaluation of the fusion methods on four datasets involving kernels that are all informative and achieve state-of-the-art results on all of them.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.