221 results found
We present large scale facial model (LSFM)—a 3D Morphable Model (3DMM) automatically constructed from 9663 distinct facial identities. To the best of our knowledge LSFM is the largest-scale Morphable Model ever constructed, containing statistical information from a huge variety of the human population. To build such a large model we introduce a novel fully automated and robust Morphable Model construction pipeline, informed by an evaluation of state-of-the-art dense correspondence techniques. The dataset that LSFM is trained on includes rich demographic information about each subject, allowing for the construction of not only a global 3DMM model but also models tailored for specific age, gender or ethnicity groups. We utilize the proposed model to perform age classification from 3D shape alone and to reconstruct noisy out-of-sample data in the low-dimensional model space. Furthermore, we perform a systematic analysis of the constructed 3DMM models that showcases their quality and descriptive power. The presented extensive qualitative and quantitative evaluations reveal that the proposed 3DMM achieves state-of-the-art results, outperforming existing models by a large margin. Finally, for the benefit of the research community, we make publicly available the source code of the proposed automatic 3DMM construction pipeline, as well as the constructed global 3DMM and a variety of bespoke models tailored by age, gender and ethnicity.
Fabris A, Nicolaou MA, Kotsia I, et al., 2017, DYNAMIC PROBABILISTIC LINEAR DISCRIMINANT ANALYSIS FOR VIDEO CLASSIFICATION, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Publisher: IEEE, Pages: 2781-2785, ISSN: 1520-6149
Guler R, Trigeorgis G, Antonakos E, et al., 2017, DenseReg: fully convolutional dense shape regression in-the-wild, 2017 IEEE International Conference on Computer Vision and Pattern Recognition, Publisher: IEEE
In this paper we propose to learn a mapping from imagepixels into a dense template grid through a fully convolutionalnetwork. We formulate this task as a regression problemand train our network by leveraging upon manually annotatedfacial landmarks “in-the-wild”. We use such landmarksto establish a dense correspondence field betweena three-dimensional object template and the input image,which then serves as the ground-truth for training our regressionsystem. We show that we can combine ideas fromsemantic segmentation with regression networks, yielding ahighly-accurate ‘quantized regression’ architecture.Our system, called DenseReg, allows us to estimatedense image-to-template correspondences in a fully convolutionalmanner. As such our network can provide usefulcorrespondence information as a stand-alone system, whilewhen used as an initialization for Statistical DeformableModels we obtain landmark localization results that largelyoutperform the current state-of-the-art on the challenging300W benchmark. We thoroughly evaluate our method ona host of facial analysis tasks, and demonstrate its use forother correspondence estimation tasks, such as the humanbody and the human ear. DenseReg code is made availableat http://alpguler.com/DenseReg.html alongwith supplementary materials.
trigeorgis, snape, kokkinos I, et al., 2017, Face Normals “in-the-wild” using Fully Convolutional Networks, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Publisher: IEEE, ISSN: 2160-7508
Chrysos GG, Antonakos E, Snape P, et al., 2017, A comprehensive performance evaluation of deformable face tracking “In-the-Wild”, International Journal of Computer Vision, Vol: 126, Pages: 198-232, ISSN: 0920-5691
Recently, technologies such as face detection, facial landmark localisation and face recognition and verification have matured enough to provide effective and efficient solutions for imagery captured under arbitrary conditions (referred to as “in-the-wild”). This is partially attributed to the fact that comprehensive “in-the-wild” benchmarks have been developed for face detection, landmark localisation and recognition/verification. A very important technology that has not been thoroughly evaluated yet is deformable face tracking “in-the-wild”. Until now, the performance has mainly been assessed qualitatively by visually assessing the result of a deformable face tracking technology on short videos. In this paper, we perform the first, to the best of our knowledge, thorough evaluation of state-of-the-art deformable face tracking pipelines using the recently introduced 300 VW benchmark. We evaluate many different architectures focusing mainly on the task of on-line deformable face tracking. In particular, we compare the following general strategies: (a) generic face detection plus generic facial landmark localisation, (b) generic model free tracking plus generic facial landmark localisation, as well as (c) hybrid approaches using state-of-the-art face detection, model free tracking and facial landmark localisation technologies. Our evaluation reveals future avenues for further research on the topic.
Valstar M, Zafeiriou S, Pantic M, 2017, Facial actions as social signals, Social Signal Processing, Pages: 123-154, ISBN: 9781107161269
© Judee K. Burgoon, Nadia Magnenat-Thalmann, Maja Pantic and Alessandro Vinciarelli 2017. According to a recent survey on social signal processing (Vinciarelli, Pantic, & Bourlard, 2009), next-generation computing needs to implement the essence of social intelligence including the ability to recognize human social signals and social behaviors, such as turn taking, politeness, and disagreement, in order to become more effective and more efficient. Social signals and social behaviors are the expression of one’s attitude towards social situation and interplay, and they are manifested through a multiplicity of nonverbal behavioral cues, including facial expressions, body postures and gestures, and vocal outbursts like laughter. Of the many social signals, only face, eye, and posture cues are capable of informing us about all identified social behaviors. During social interaction, it is a social norm that one looks their dyadic partner in the eyes, clearly focusing one’s vision on the face. Facial expressions thus make for very powerful social signals. As one of the most comprehensive and objective ways to describe facial expressions, the facial action coding system (FACS) has recently received significant attention. Automating FACS coding would greatly benefit social signal processing, opening up new avenues to understanding how we communicate through facial expressions. In this chapter we provide a comprehensive overview of research into machine analysis of facial actions. We systematically review all components of such systems: pre-processing, feature extraction, and machine coding of facial actions. In addition, the existing FACS-coded facial expression databases are summarized. Finally, challenges that have to be addressed to make automatic facial action analysis applicable in real-life situations are extensively discussed. Introduction Scientific work on facial expressions can be traced back to at least 1872 when Charles Darwin published The Ex
Georgakis C, Panagakis Y, Zafeiriou S, et al., 2016, The Conflict Escalation Resolution (CONFER) Database, Image and Vision Computing, Vol: 65, Pages: 37-48, ISSN: 0262-8856
Conflict is usually defined as a high level of disagreement taking place when individuals act on incompatible goals, interests, or intentions. Research in human sciences has recognized conflict as one of the main dimensions along which an interaction is perceived and assessed. Hence, automatic estimation of conflict intensity in naturalistic conversations would be a valuable tool for the advancement of human-centered computing and the deployment of novel applications for social skills enhancement including conflict management and negotiation. However, machine analysis of conflict is still limited to just a few works, partially due to an overall lack of suitable annotated data, while it has been mostly approached as a conflict or (dis)agreement detection problem based on audio features only. In this work, we aim to overcome the aforementioned limitations by a) presenting the Conflict Escalation Resolution (CONFER) Database, a set of excerpts from audio-visual recordings of televised political debates where conflicts naturally arise, and b) reporting baseline experiments on audio-visual conflict intensity estimation. The database contains approximately 142. min of recordings in Greek language, split over 120 non-overlapping episodes of naturalistic conversations that involve two or three interactants. Subject- and session-independent experiments are conducted on continuous-time (frame-by-frame) estimation of real-valued conflict intensity, as opposed to binary conflict/non-conflict classification. For the problem at hand, the efficiency of various audio and visual features and fusion of them as well as various regression frameworks is examined. Experimental results suggest that there is much room for improvement in the design and development of automated multi-modal approaches to continuous conflict analysis. The CONFER Database is publicly available for non-commercial use at http://ibug.doc.ic.ac.uk/resources/confer/.
Zafeiriou S, Papaioannou A, Kotsia I, et al., 2016, Facial Affect "in-the-wild": A survey and a new database, Computer Vision and Pattern Recognition Workshops (CVPRW), Publisher: IEEE, Pages: 1487-1498, ISSN: 2160-7508
Well-established databases and benchmarks have been developed in the past 20 years for automatic facial behaviour analysis. Nevertheless, for some important problems regarding analysis of facial behaviour, such as (a) estimation of affect in a continuous dimensional space (e.g., valence and arousal) in videos displaying spontaneous facial behaviour and (b) detection of the activated facial muscles (i.e., facial action unit detection), to the best of our knowledge, well-established in-the-wild databases and benchmarks do not exist. That is, the majority of the publicly available corpora for the above tasks contain samples that have been captured in controlled recording conditions and/or captured under a very specific milieu. Arguably, in order to make further progress in automatic understanding of facial behaviour, datasets that have been captured in in the-wild and in various milieus have to be developed. In this paper, we survey the progress that has been recently made on understanding facial behaviour in-the-wild, the datasets that have been developed so far and the methodologies that have been developed, paying particular attention to deep learning techniques for the task. Finally, we make a significant step further and propose a new comprehensive benchmark for training methodologies, as well as assessing the performance of facial affect/behaviour analysis/ understanding in-the-wild. To the best of our knowledge, this is the first time that such a benchmark for valence and arousal "in-the-wild" is presented.
Machine learning algorithms for the analysis of timeseries often depend on the assumption that the utilised data are temporally aligned. Any temporal discrepancies arising in the data is certain to lead to ill-generalisable models, which in turn fail to correctly capture the properties of the task at hand. The temporal alignment of time-series is thus a crucial challenge manifesting in a multitude of applications. Nevertheless, the vast majority of algorithms oriented towards the temporal alignment of time-series are applied directly on the observation space, or utilise simple linear projections. Thus, they fail to capture complex, hierarchical non-linear representations which may prove to be beneficial towards the task of temporal alignment, particularly when dealing with multi-modal data (e.g., aligning visual and acoustic information). To this end, we present the Deep Canonical Time Warping (DCTW), a method which automatically learns complex non-linear representations of multiple time-series, generated such that (i) they are highly correlated, and (ii) temporally in alignment. By means of experiments on four real datasets, we show that the representations learnt via the proposed DCTW significantly outperform state-of-the-art methods in temporal alignment, elegantly handling scenarios with highly heterogeneous features, such as the temporal alignment of acoustic and visual features.
Cheng S, Marras I, Zafeiriou S, et al., 2016, Statistical non-rigid ICP algorithm and its application to 3D face alignment, IMAGE AND VISION COMPUTING, Vol: 58, Pages: 3-12, ISSN: 0262-8856
Hovhannisyan V, Parpas P, Zafeiriou S, 2016, MAGMA: Multi-level accelerated gradient mirror descent algorithm for large-scale convex composite minimization, SIAM Journal on Imaging Sciences, Vol: 9, Pages: 1829-1857, ISSN: 1936-4954
Composite convex optimization models arise in several applications, and are especially prevalentin inverse problems with a sparsity inducing norm and in general convex optimization with simpleconstraints. The most widely used algorithms for convex composite models are accelerated first ordermethods, however they can take a large number of iterations to compute an acceptable solution forlarge-scale problems. In this paper we propose to speed up first order methods by taking advantageof the structure present in many applications and in image processing in particular. Our method isbased on multi-level optimization methods and exploits the fact that many applications that giverise to large scale models can be modelled using varying degrees of fidelity. We use Nesterov’sacceleration techniques together with the multi-level approach to achieve an O(1/√ǫ) convergencerate, where ǫ denotes the desired accuracy. The proposed method has a better convergence ratethan any other existing multi-level method for convex problems, and in addition has the same rateas accelerated methods, which is known to be optimal for first-order methods. Moreover, as ournumerical experiments show, on large-scale face recognition problems our algorithm is several timesfaster than the state of the art.
Tsinalis O, Matthews PM, Guo Y, et al., 2016, Automatic sleep stage scoring with single-channel EEG using convolutional neural networks, Publisher: Arxiv
We used convolutional neural networks (CNNs) for automatic sleep stage scoring based on single-channel electroencephalography (EEG) to learn task-specific filters for classification without using prior domain knowledge. We used an openly available dataset from 20 healthy young adults for evaluation and applied 20-fold cross-validation. We used class balanced random sampling within the stochastic gradient descent (SGD) optimization of the CNN to avoid skewed performance in favor of the most represented sleep stages. We achieved high mean F1-score (81%, range 79-83%), mean accuracy across individual sleep stages (82%, range 80-84%) and overall accuracy (74%, range 71-76%) over all subjects. By analyzing and visualizing the filters that our CNN learns, we found that rules learned by the filters correspond to sleep scoring criteria in the American Academy of Sleep Medicine (AASM) manual that human experts follow. Our method's performance is balanced across classes and our results are comparable to state-of-the-art methods with hand-engineered features. We show that, without using prior domain knowledge, a CNN can automatically learn to distinguish among different normal sleep stages.
Kampouris C, Zafeiriou S, Ghosh A, et al., 2016, Fine-grained Material Classification using Micro-geometry and Reflectance, European Conference on Computer Vision 2016, Publisher: Springer, Pages: 778-792, ISSN: 0302-9743
In this paper we focus on an understudied computer vision problem, particularly how the micro-geometry and the reflectance of a surface can be used to infer its material. To this end, we introduce a new, publicly available database for fine-grained material classification, consisting of over 2000 surfaces of fabrics (http://ibug.doc.ic.ac.uk/resources/fabrics.). The database has been collected using a custom-made portable but cheap and easy to assemble photometric stereo sensor. We use the normal map and the albedo of each surface to recognize its material via the use of handcrafted and learned features and various feature encodings. We also perform garment classification using the same approach. We show that the fusion of normals and albedo information outperforms standard methods which rely only on the use of texture information. Our methodologies, both for data collection, as well as for material classification can be applied easily to many real-word scenarios including design of new robots able to sense materials and industrial inspection.
The two predominant families of deformable models for the task of face alignment are: (i) discriminative cascaded regression models, and (ii) generative models optimised with Gauss-Newton. Although these approaches have been found to work well in practise, they each suffer from convergence issues. Cascaded regression has no theoretical guarantee of convergence to a local minimum and thus may fail to recover the fine details of the object. Gauss-Newton optimisation is not robust to initialisations that are far from the optimal solution. In this paper, we propose the first, to the best of our knowledge, attempt to combine the best of these two worlds under a unified model and report state-of-the-art performance on the most recent facial benchmark challenge.
Sagonas C, Panagakis Y, Zafeiriou S, et al., 2016, Robust statistical frontalization of human and animal faces, International Journal of Computer Vision, Vol: 122, Pages: 270-291, ISSN: 0920-5691
The unconstrained acquisition of facial data in real-world conditions may result in face images with significant pose variations, illumination changes, and occlusions, affecting the performance of facial landmark localization and recognition methods. In this paper, a novel method, robust to pose, illumination variations, and occlusions is proposed for joint face frontalization and landmark localization. Unlike the state-of-the-art methods for landmark localization and pose correction, where large amount of manually annotated images or 3D facial models are required, the proposed method relies on a small set of frontal images only. By observing that the frontal facial image of both humans and animals, is the one having the minimum rank of all different poses, a model which is able to jointly recover the frontalized version of the face as well as the facial landmarks is devised. To this end, a suitable optimization problem is solved, concerning minimization of the nuclear norm (convex surrogate of the rank function) and the matrix (Formula presented.) norm accounting for occlusions. The proposed method is assessed in frontal view reconstruction of human and animal faces, landmark localization, pose-invariant face recognition, face verification in unconstrained conditions, and video inpainting by conducting experiment on 9 databases. The experimental results demonstrate the effectiveness of the proposed method in comparison to the state-of-the-art methods for the target problems.
Alabort-i-Medina J, Zafeiriou S, 2016, A unified framework for compositional fitting of active appearance models, International Journal of Computer Vision, Vol: 121, Pages: 26-64, ISSN: 1573-1405
Active Appearance Models (AAMs) areone of the most popular and well-established techniquesfor modeling deformable objects in computer vision. Inthis paper, we study the problem of fitting AAMs usingCompositional Gradient Descent (CGD) algorithms. Wepresent a unified and complete view of these algorithmsand classify them with respect to three main characteristics:i) cost function; ii) type of composition; andiii) optimization method. Furthermore, we extend theprevious view by: a) proposing a novel Bayesian costfunction that can be interpreted as a general probabilisticformulation of the well-known project-out loss;b) introducing two new types of composition, asymmetricand bidirectional, that combine the gradients of bothimage and appearance model to derive better convergentand more robust CGD algorithms; and c) providingnew valuable insights into existent CGD algorithmsby reinterpreting them as direct applications ofthe Schur complement and the Wiberg method. Finally,in order to encourage open research and facilitate futurecomparisons with our work, we make the implementationof the algorithms studied in this paper publiclyavailable as part of the Menpo Project1.
Snape P, Pszczolkowski S, Zafeiriou S, et al., 2016, A robust similarity measure for volumetric image registration with outliers, Image and Vision Computing, Vol: 52, Pages: 97-113, ISSN: 0262-8856
Image registration under challenging realistic conditions is a very important area of research. In this paper, we focus on algorithms that seek to densely align two volumetric images according to a global similarity measure. Despite intensive research in this area, there is still a need for similarity measures that are robust to outliers common to many different types of images. For example, medical image data is often corrupted by intensity inhomogeneities and may contain outliers in the form of pathologies. In this paper we propose a global similarity measure that is robust to both intensity inhomogeneities and outliers without requiring prior knowledge of the type of outliers. We combine the normalised gradients of images with the cosine function and show that it is theoretically robust against a very general class of outliers. Experimentally, we verify the robustness of our measures within two distinct algorithms. Firstly, we embed our similarity measures within a proof-of-concept extension of the Lucas-Kanade algorithm for volumetric data. Finally, we embed our measures within a popular non-rigid alignment framework based on free-form deformations and show it to be robust against both simulated tumours and intensity inhomogeneities.
Trigeorgis G, Ringeval F, Brückner R, et al., 2016, Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network, 41st IEEE International Conference on Acoustics, Speech, and Signal Processing, Publisher: IEEE
The automatic recognition of spontaneous emotions from speech is a challenging task. On the one hand, acoustic features need to be robust enough to capture the emotional content for various styles of speaking, and while on the other, machine learning algorithms need to be insensitive to outliers while being able to model the context. Whereas the latter has been tackled by the use of Long Short-Term Memory (LSTM) networks, the former is still under very active investigations, even though more than a decade of research has provided a large set of acoustic descriptors. In this paper, we propose a solution to the problem of `context-aware' emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation. In this novel work on the so-called end-to-end speech emotion recognition, we show that the use of the proposed topology significantly outperforms the traditional approaches based on signal processing techniques for the prediction of spontaneous and natural emotions on the RECOLA database.
Zafeiriou S, Zhao G, Pietikainen M, et al., 2016, Editorial of special issue on spontaneous facial behaviour analysis, Computer Vision and Image Understanding, Vol: 147, Pages: 50-51, ISSN: 1090-235X
Trigeorgis G, Bousmalis K, Zafeiriou S, et al., 2016, A deep matrix factorization method for learning attribute representations, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 39, Pages: 417-429, ISSN: 0162-8828
Semi-Non-negative Matrix Factorization is a technique that learns a low-dimensional representation of a dataset that lends itself to a clustering interpretation. It is possible that the mapping between this new representation and our original data matrix contains rather complex hierarchical information with implicit lower-level hidden attributes, that classical one level clustering methodologies can not interpret. In this work we propose a novel model, Deep Semi-NMF, that is able to learn such hidden representations that allow themselves to an interpretation of clustering according to different, unknown attributes of a given dataset. We also present a semisupervised version of the algorithm, named Deep WSF, that allows the use of (partial) prior information for each of the known attributes of a dataset, that allows the model to be used on datasets with mixed attribute knowledge. Finally, we show that our models are able to learn low-dimensional representations that are better suited for clustering, but also classification, outperforming Semi-Non-negative Matrix Factorization, but also other state-of-the-art methodologies variants.
Zafeiriou L, Antonakos E, Zafeiriou S, et al., 2016, Joint Unsupervised Deformable Spatio-Temporal Alignment of Sequences, International Conference on Computer Vision and Pattern Recognition, Publisher: Computer Vision Foundation (CVF)
Typically, the problems of spatial and temporal alignmentof sequences are considered disjoint. That is, in orderto align two sequences, a methodology that (non)-rigidlyaligns the images is first applied, followed by temporalalignment of the obtained aligned images. In this paper, wepropose the first, to the best of our knowledge, methodologythat can jointly spatio-temporally align two sequences,which display highly deformable texture-varying objects.We show that by treating the problems of deformable spatialand temporal alignment jointly, we achieve better resultsthan considering the problems independent. Furthermore,we show that deformable spatio-temporal alignmentof faces can be performed in an unsupervised manner (i.e.,without employing face trackers or building person-specificdeformable models).
Trigeorgis G, Snape P, Nicolaou MA, et al., 2016, Mnemonic Descent Method: A recurrent process applied for end-to-end face alignment, International Conference on Computer Vision and Pattern Recognition, Publisher: Computer Vision Foundation (CVF)
Cascaded regression has recently become the method ofchoice for solving non-linear least squares problems suchas deformable image alignment. Given a sizeable trainingset, cascaded regression learns a set of generic rules thatare sequentially applied to minimise the least squares problem.Despite the success of cascaded regression for problemssuch as face alignment and head pose estimation, thereare several shortcomings arising in the strategies proposedthus far. Specifically, (a) the regressors are learnt independently,(b) the descent directions may cancel one anotherout and (c) handcrafted features (e.g., HoGs, SIFT etc.) aremainly used to drive the cascade, which may be sub-optimalfor the task at hand. In this paper, we propose a combinedand jointly trained convolutional recurrent neural networkarchitecture that allows the training of an end-to-end tosystem that attempts to alleviate the aforementioned drawbacks.The recurrent module facilitates the joint optimisationof the regressors by assuming the cascades form a nonlineardynamical system, in effect fully utilising the informationbetween all cascade levels by introducing a memoryunit that shares information across all levels. The convolutionalmodule allows the network to extract features thatare specialised for the task at hand and are experimentallyshown to outperform hand-crafted features. We show thatthe application of the proposed architecture for the problemof face alignment results in a strong improvement over thecurrent state-of-the-art.
Booth J, Roussos A, Zafeiriou S, et al., 2016, A 3D Morphable Model learnt from 10,000 faces, International Conference on Computer Vision and Pattern Recognition, Publisher: Computer Vision Foundation (CVF)
We present Large Scale Facial Model (LSFM) — a 3DMorphable Model (3DMM) automatically constructed from9,663 distinct facial identities. To the best of our knowledgeLSFM is the largest-scale Morphable Model ever constructed,containing statistical information from a huge varietyof the human population. To build such a large modelwe introduce a novel fully automated and robust MorphableModel construction pipeline. The dataset that LSFM istrained on includes rich demographic information abouteach subject, allowing for the construction of not only aglobal 3DMM but also models tailored for specific age,gender or ethnicity groups. As an application example,we utilise the proposed model to perform age classificationfrom 3D shape alone. Furthermore, we perform a systematicanalysis of the constructed 3DMMs that showcasestheir quality and descriptive power. The presented extensivequalitative and quantitative evaluations reveal that the proposed3DMM achieves state-of-the-art results, outperformingexisting models by a large margin. Finally, for the benefitof the research community, we make publicly availablethe source code of the proposed automatic 3DMM constructionpipeline. In addition, the constructed global 3DMMand a variety of bespoke models tailored by age, genderand ethnicity are available on application to researchersinvolved in medically oriented research.
zhou Y, Antonakos E, Alabort i Medina J, et al., 2016, Estimating Correspondences of Deformable Objects “In-the-wild”, International Conference on Computer Vision and Pattern Recognition, Publisher: Computer Vision Foundation (CVF)
During the past few years we have witnessed the developmentof many methodologies for building and fitting StatisticalDeformable Models (SDMs). The construction ofaccurate SDMs requires careful annotation of images withregards to a consistent set of landmarks. However, the manualannotation of a large amount of images is a tedious,laborious and expensive procedure. Furthermore, for severaldeformable objects, e.g. human body, it is difficult todefine a consistent set of landmarks, and, thus, it becomesimpossible to train humans in order to accurately annotatea collection of images. Nevertheless, for the majority ofobjects, it is possible to extract the shape by object segmentationor even by shape drawing. In this paper, we show forthe first time, to the best of our knowledge, that it is possibleto construct SDMs by putting object shapes in densecorrespondence. Such SDMs can be built with much lesseffort for a large battery of objects. Additionally, we showthat, by sampling the dense model, a part-based SDM canbe learned with its parts being in correspondence. We employour framework to develop SDMs of human arms andlegs, which can be used for the segmentation of the outlineof the human body, as well as to provide better and moreconsistent annotations for body joints.
Zafeiriou S, Tzimiropoulos G, Pantic M, 2016, 300 W: Special issue on facial landmark localisation "in-the-wild" Preface, IMAGE AND VISION COMPUTING, Vol: 47, Pages: 1-2, ISSN: 0262-8856
Sagonas C, Antonakos E, Tzimiropoulos G, et al., 2016, 300 faces In-the-wild challenge: database and results, Image and Vision Computing, Vol: 47, Pages: 3-18, ISSN: 0262-8856
Computer Vision has recently witnessed great research advance towards automatic facial points detection. Numerous methodologies have been proposed during the last few years that achieve accurate and efficient performance. However, fair comparison between these methodologies is infeasible mainly due to two issues. (a) Most existing databases, captured under both constrained and unconstrained (in-the-wild) conditions have been annotated using different mark-ups and, in most cases, the accuracy of the annotations is low. (b) Most published works report experimental results using different training/testing sets, different error metrics and, of course, landmark points with semantically different locations. In this paper, we aim to overcome the aforementioned problems by (a)proposing a semi-automatic annotation technique that was employed to re-annotate most existing facial databases under a unified protocol, and (b)presenting the 300 Faces In-The-Wild Challenge (300-W), the first facial landmark localization challenge that was organized twice, in 2013 and 2015. To the best of our knowledge, this is the first effort towards a unified annotation scheme of massive databases and a fair experimental comparison of existing facial landmark localization systems. The images and annotations of the new testing database that was used in the 300-W challenge are available from http://ibug.doc.ic.ac.uk/resources/300-W_IMAVIS/.
Hong X, Zhao G, Zafeiriou S, et al., 2015, Capturing correlations of local features for image representation, Neurocomputing, Vol: 184, Pages: 99-106, ISSN: 1872-8286
Local descriptors are popular ways to characterize the local properties of images in various computer vision based tasks. To form the global descriptors for image regions, the first-order feature pooling is widely used. However, as the first-order pooling technique treats each dimension of local features separately, the pairwise correlations of local features are usually ignored.Encouraged by the success of recently developed second-order pooling techniques, in this paper we formulate a general second-order pooling framework and explore several analogues of the second-order average and max operations. We comprehensively investigate a variety of moments which are in the central positions to the second-order pooling technique. As a result, the superiority of the second-order standardized moment average pooling (2Standmap) is suggested. We successfully apply 2Standmap to four challenging tasks namely texture classification, medical image analysis, pain expression recognition, and micro-expression recognition. It illustrates the effectiveness of 2Standmap to capture multiple cues and the generalization to both static images and spatial-temporal sequences.
Chrysos G, Antonakos E, Zafeiriou S, et al., 2015, Offline Deformable Face Tracking in Arbitrary Videos, 2015 IEEE International Conference on Computer Vision Workshops, Publisher: IEEE, Pages: 954-962
Generic face detection and facial landmark localization in static imagery are among the most mature and well-studied problems in machine learning and computer vision. Currently, the top performing face detectors achieve a true positive rate of around 75-80% whilst maintaining low false positive rates. Furthermore, the top performing facial landmark localization algorithms obtain low point-to-point errors for more than 70% of commonly benchmarked images captured under unconstrained conditions. The task of facial landmark tracking in videos, however, has attracted much less attention. Generally, a tracking-by-detection framework is applied, where face detection and landmark localization are employed in every frame in order to avoid drifting. Thus, this solution is equivalent to landmark detection in static imagery. Empirically, a straightforward application of such a framework cannot achieve higher performance, on average, than the one reported for static imagery. In this paper, we show for the first time, to the best of our knowledge, that the results of generic face detection and landmark localization can be used to recursively train powerful and accurate person-specific face detectors and landmark localization methods for offline deformable tracking. The proposed pipeline can track landmarks in very challenging long-term sequences captured under arbitrary conditions. The pipeline was used as a semi-automatic tool to annotate the majority of the videos of the 300-VW Challenge.
Sagonas C, Panagakis Y, Zafeiriou S, et al., 2015, Robust statistical face frontalization, 2015 IEEE International Conference on Computer Vision (ICCV), Publisher: IEEE, Pages: 3871-3879, ISSN: 1550-5499
Recently, it has been shown that excellent results can be achieved in both facial landmark localization and pose-invariant face recognition. These breakthroughs are attributed to the efforts of the community to manually annotate facial images in many different poses and to collect 3D facial data. In this paper, we propose a novel method for joint frontal view reconstruction and landmark localization using a small set of frontal images only. By observing that the frontal facial image is the one having the minimum rank of all different poses, an appropriate model which is able to jointly recover the frontalized version of the face as well as the facial landmarks is devised. To this end, a suitable optimization problem, involving the minimization of the nuclear norm and the matrix l1 norm is solved. The proposed method is assessed in frontal face reconstruction, face landmark localization, pose-invariant face recognition, and face verification in unconstrained conditions. The relevant experiments have been conducted on 8 databases. The experimental results demonstrate the effectiveness of the proposed method in comparison to the state-of-the-art methods for the target problems.
Shen J, Zafeiriou S, Chrysos GG, et al., 2015, The first facial landmark tracking in-the-wild challenge: benchmark and results, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), Publisher: IEEE, Pages: 1003-1011, ISSN: 1550-5499
Detection and tracking of faces in image sequences is among the most well studied problems in the intersection of statistical machine learning and computer vision. Often, tracking and detection methodologies use a rigid representation to describe the facial region 1, hence they can neither capture nor exploit the non-rigid facial deformations, which are crucial for countless of applications (e.g., facial expression analysis, facial motion capture, high-performance face recognition etc.). Usually, the non-rigid deformations are captured by locating and tracking the position of a set of fiducial facial landmarks (e.g., eyes, nose, mouth etc.). Recently, we witnessed a burst of research in automatic facial landmark localisation in static imagery. This is partly attributed to the availability of large amount of annotated data, many of which have been provided by the first facial landmark localisation challenge (also known as 300-W challenge). Even though now well established benchmarks exist for facial landmark localisation in static imagery, to the best of our knowledge, there is no established benchmark for assessing the performance of facial landmark tracking methodologies, containing an adequate number of annotated face videos. In conjunction with ICCV'2015 we run the first competition/challenge on facial landmark tracking in long-Term videos. In this paper, we present the first benchmark for long-Term facial landmark tracking, containing currently over 110 annotated videos, and we summarise the results of the competition.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.