Imperial College London

STEFANOS ZAFEIRIOU, PhD

Faculty of EngineeringDepartment of Computing

Reader in Machine Learning and Computer Vision
 
 
 
//

Contact

 

+44 (0)20 7594 8461s.zafeiriou Website CV

 
 
//

Location

 

375Huxley BuildingSouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

220 results found

Sagonas C, Panagakis Y, Leidinger A, Zafeiriou Set al., Robust joint and individual variance explained, 2017 IEEE International Conference on Computer Vision and Pattern Recognition, Publisher: IEEE

Discovering the common (joint) and individual sub-spaces is crucial for analysis of multiple data sets, includingmulti-view and multi-modal data. Several statistical ma-chine learning methods have been developed for discover-ing the common features across multiple data sets. The mostwell studied family of the methods is that of Canonical Cor-relation Analysis (CCA) and its variants. Even though theCCA is a powerful tool, it has several drawbacks that ren-der its application challenging for computer vision appli-cations. That is, it discovers only common features and notindividual ones, and it is sensitive to gross errors presentin visual data. Recently, efforts have been made in orderto develop methods that discover individual and commoncomponents. Nevertheless, these methods are mainly appli-cable in two sets of data. In this paper, we investigate theuse of a recently proposed statistical method, the so-calledJoint and Individual Variance Explained (JIVE) method, forthe recovery of joint and individual components in an arbi-trary number of data sets. Since, the JIVE is not robust togross errors, we propose alternatives, which are both robustto non-Gaussian noise of large magnitude, as well as ableto automatically find the rank of the individual components.We demonstrate the effectiveness of the proposed approachto two computer vision applications, namely facial expres-sion synthesis and face age progression in-the-wild.

Conference paper

Wang M, Panagakis Y, Snape P, Zafeiriou Set al., Learning the multilinear structure of visual data, 2017 IEEE International Conference on Computer Vision and Pattern Recognition, Publisher: IEEE

Statistical decomposition methods are of paramount im-portance in discovering the modes of variations of visualdata. Probably the most prominent linear decompositionmethod is the Principal Component Analysis (PCA), whichdiscovers a single mode of variation in the data. However,in practice, visual data exhibit several modes of variations.For instance, the appearance of faces varies in identity, ex-pression, pose etc. To extract these modes of variations fromvisual data, several supervised methods, such as the Ten-sorFaces, that rely on multilinear (tensor) decomposition(e.g., Higher Order SVD) have been developed. The maindrawbacks of such methods is that they require both labelsregarding the modes of variations and the same number ofsamples under all modes of variations (e.g., the same faceunder different expressions, poses etc.). Therefore, their ap-plicability is limited to well-organised data, usually cap-tured in well-controlled conditions. In this paper, we pro-pose the first general multilinear method, to the best of ourknowledge, that discovers the multilinear structure of visualdata in unsupervised setting. That is, without the presenceof labels. We demonstrate the applicability of the proposedmethod in two applications, namely Shape from Shading(SfS) and expression transfer.

Conference paper

trigeorgis, snape, kokkinos I, Zafeiriou Set al., Face Normals “in-the-wild” using Fully Convolutional Networks, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Publisher: IEEE, ISSN: 2160-7508

Conference paper

Chrysos GG, Antonakos E, Snape P, Asthana A, Zafeiriou Set al., 2017, A comprehensive performance evaluation of deformable face tracking “In-the-Wild”, International Journal of Computer Vision, Vol: 126, Pages: 198-232, ISSN: 0920-5691

Recently, technologies such as face detection, facial landmark localisation and face recognition and verification have matured enough to provide effective and efficient solutions for imagery captured under arbitrary conditions (referred to as “in-the-wild”). This is partially attributed to the fact that comprehensive “in-the-wild” benchmarks have been developed for face detection, landmark localisation and recognition/verification. A very important technology that has not been thoroughly evaluated yet is deformable face tracking “in-the-wild”. Until now, the performance has mainly been assessed qualitatively by visually assessing the result of a deformable face tracking technology on short videos. In this paper, we perform the first, to the best of our knowledge, thorough evaluation of state-of-the-art deformable face tracking pipelines using the recently introduced 300 VW benchmark. We evaluate many different architectures focusing mainly on the task of on-line deformable face tracking. In particular, we compare the following general strategies: (a) generic face detection plus generic facial landmark localisation, (b) generic model free tracking plus generic facial landmark localisation, as well as (c) hybrid approaches using state-of-the-art face detection, model free tracking and facial landmark localisation technologies. Our evaluation reveals future avenues for further research on the topic.

Journal article

Valstar M, Zafeiriou S, Pantic M, 2017, Facial actions as social signals, Social Signal Processing, Pages: 123-154, ISBN: 9781107161269

© Judee K. Burgoon, Nadia Magnenat-Thalmann, Maja Pantic and Alessandro Vinciarelli 2017. According to a recent survey on social signal processing (Vinciarelli, Pantic, & Bourlard, 2009), next-generation computing needs to implement the essence of social intelligence including the ability to recognize human social signals and social behaviors, such as turn taking, politeness, and disagreement, in order to become more effective and more efficient. Social signals and social behaviors are the expression of one’s attitude towards social situation and interplay, and they are manifested through a multiplicity of nonverbal behavioral cues, including facial expressions, body postures and gestures, and vocal outbursts like laughter. Of the many social signals, only face, eye, and posture cues are capable of informing us about all identified social behaviors. During social interaction, it is a social norm that one looks their dyadic partner in the eyes, clearly focusing one’s vision on the face. Facial expressions thus make for very powerful social signals. As one of the most comprehensive and objective ways to describe facial expressions, the facial action coding system (FACS) has recently received significant attention. Automating FACS coding would greatly benefit social signal processing, opening up new avenues to understanding how we communicate through facial expressions. In this chapter we provide a comprehensive overview of research into machine analysis of facial actions. We systematically review all components of such systems: pre-processing, feature extraction, and machine coding of facial actions. In addition, the existing FACS-coded facial expression databases are summarized. Finally, challenges that have to be addressed to make automatic facial action analysis applicable in real-life situations are extensively discussed. Introduction Scientific work on facial expressions can be traced back to at least 1872 when Charles Darwin published The Ex

Book chapter

Georgakis C, Panagakis Y, Zafeiriou S, Pantic Met al., 2016, The Conflict Escalation Resolution (CONFER) Database, Image and Vision Computing, Vol: 65, Pages: 37-48, ISSN: 0262-8856

Conflict is usually defined as a high level of disagreement taking place when individuals act on incompatible goals, interests, or intentions. Research in human sciences has recognized conflict as one of the main dimensions along which an interaction is perceived and assessed. Hence, automatic estimation of conflict intensity in naturalistic conversations would be a valuable tool for the advancement of human-centered computing and the deployment of novel applications for social skills enhancement including conflict management and negotiation. However, machine analysis of conflict is still limited to just a few works, partially due to an overall lack of suitable annotated data, while it has been mostly approached as a conflict or (dis)agreement detection problem based on audio features only. In this work, we aim to overcome the aforementioned limitations by a) presenting the Conflict Escalation Resolution (CONFER) Database, a set of excerpts from audio-visual recordings of televised political debates where conflicts naturally arise, and b) reporting baseline experiments on audio-visual conflict intensity estimation. The database contains approximately 142. min of recordings in Greek language, split over 120 non-overlapping episodes of naturalistic conversations that involve two or three interactants. Subject- and session-independent experiments are conducted on continuous-time (frame-by-frame) estimation of real-valued conflict intensity, as opposed to binary conflict/non-conflict classification. For the problem at hand, the efficiency of various audio and visual features and fusion of them as well as various regression frameworks is examined. Experimental results suggest that there is much room for improvement in the design and development of automated multi-modal approaches to continuous conflict analysis. The CONFER Database is publicly available for non-commercial use at http://ibug.doc.ic.ac.uk/resources/confer/.

Journal article

Zafeiriou S, Papaioannou A, Kotsia I, Nicolaou M, Zhao Get al., 2016, Facial Affect "in-the-wild": A survey and a new database, Computer Vision and Pattern Recognition Workshops (CVPRW), Publisher: IEEE, Pages: 1487-1498, ISSN: 2160-7508

Well-established databases and benchmarks have been developed in the past 20 years for automatic facial behaviour analysis. Nevertheless, for some important problems regarding analysis of facial behaviour, such as (a) estimation of affect in a continuous dimensional space (e.g., valence and arousal) in videos displaying spontaneous facial behaviour and (b) detection of the activated facial muscles (i.e., facial action unit detection), to the best of our knowledge, well-established in-the-wild databases and benchmarks do not exist. That is, the majority of the publicly available corpora for the above tasks contain samples that have been captured in controlled recording conditions and/or captured under a very specific milieu. Arguably, in order to make further progress in automatic understanding of facial behaviour, datasets that have been captured in in the-wild and in various milieus have to be developed. In this paper, we survey the progress that has been recently made on understanding facial behaviour in-the-wild, the datasets that have been developed so far and the methodologies that have been developed, paying particular attention to deep learning techniques for the task. Finally, we make a significant step further and propose a new comprehensive benchmark for training methodologies, as well as assessing the performance of facial affect/behaviour analysis/ understanding in-the-wild. To the best of our knowledge, this is the first time that such a benchmark for valence and arousal "in-the-wild" is presented.

Conference paper

Cheng S, Marras I, Zafeiriou S, Pantic Met al., 2016, Statistical non-rigid ICP algorithm and its application to 3D face alignment, IMAGE AND VISION COMPUTING, Vol: 58, Pages: 3-12, ISSN: 0262-8856

Journal article

Hovhannisyan V, Parpas P, Zafeiriou S, 2016, MAGMA: Multi-level accelerated gradient mirror descent algorithm for large-scale convex composite minimization, SIAM Journal on Imaging Sciences, Vol: 9, Pages: 1829-1857, ISSN: 1936-4954

Composite convex optimization models arise in several applications, and are especially prevalentin inverse problems with a sparsity inducing norm and in general convex optimization with simpleconstraints. The most widely used algorithms for convex composite models are accelerated first ordermethods, however they can take a large number of iterations to compute an acceptable solution forlarge-scale problems. In this paper we propose to speed up first order methods by taking advantageof the structure present in many applications and in image processing in particular. Our method isbased on multi-level optimization methods and exploits the fact that many applications that giverise to large scale models can be modelled using varying degrees of fidelity. We use Nesterov’sacceleration techniques together with the multi-level approach to achieve an O(1/√ǫ) convergencerate, where ǫ denotes the desired accuracy. The proposed method has a better convergence ratethan any other existing multi-level method for convex problems, and in addition has the same rateas accelerated methods, which is known to be optimal for first-order methods. Moreover, as ournumerical experiments show, on large-scale face recognition problems our algorithm is several timesfaster than the state of the art.

Journal article

Tsinalis O, Matthews PM, Guo Y, Zafeiriou Set al., 2016, Automatic sleep stage scoring with single-channel EEG using convolutional neural networks, Publisher: Arxiv

We used convolutional neural networks (CNNs) for automatic sleep stage scoring based on single-channel electroencephalography (EEG) to learn task-specific filters for classification without using prior domain knowledge. We used an openly available dataset from 20 healthy young adults for evaluation and applied 20-fold cross-validation. We used class balanced random sampling within the stochastic gradient descent (SGD) optimization of the CNN to avoid skewed performance in favor of the most represented sleep stages. We achieved high mean F1-score (81%, range 79-83%), mean accuracy across individual sleep stages (82%, range 80-84%) and overall accuracy (74%, range 71-76%) over all subjects. By analyzing and visualizing the filters that our CNN learns, we found that rules learned by the filters correspond to sleep scoring criteria in the American Academy of Sleep Medicine (AASM) manual that human experts follow. Our method's performance is balanced across classes and our results are comparable to state-of-the-art methods with hand-engineered features. We show that, without using prior domain knowledge, a CNN can automatically learn to distinguish among different normal sleep stages.

Working paper

Kampouris C, Zafeiriou S, Ghosh A, Malassiotis Set al., 2016, Fine-grained Material Classification using Micro-geometry and Reflectance, European Conference on Computer Vision 2016, Publisher: Springer, Pages: 778-792, ISSN: 0302-9743

In this paper we focus on an understudied computer vision problem, particularly how the micro-geometry and the reflectance of a surface can be used to infer its material. To this end, we introduce a new, publicly available database for fine-grained material classification, consisting of over 2000 surfaces of fabrics (http://​ibug.​doc.​ic.​ac.​uk/​resources/​fabrics.). The database has been collected using a custom-made portable but cheap and easy to assemble photometric stereo sensor. We use the normal map and the albedo of each surface to recognize its material via the use of handcrafted and learned features and various feature encodings. We also perform garment classification using the same approach. We show that the fusion of normals and albedo information outperforms standard methods which rely only on the use of texture information. Our methodologies, both for data collection, as well as for material classification can be applied easily to many real-word scenarios including design of new robots able to sense materials and industrial inspection.

Conference paper

Antonakos E, Snape P, Trigeorgis G, Zafeiriou Set al., 2016, Adaptive Cascaded Regression, 2016 IEEE International Conference on Image Processing (ICIP), Publisher: IEEE, Pages: 1649-1653, ISSN: 1522-4880

The two predominant families of deformable models for the task of face alignment are: (i) discriminative cascaded regression models, and (ii) generative models optimised with Gauss-Newton. Although these approaches have been found to work well in practise, they each suffer from convergence issues. Cascaded regression has no theoretical guarantee of convergence to a local minimum and thus may fail to recover the fine details of the object. Gauss-Newton optimisation is not robust to initialisations that are far from the optimal solution. In this paper, we propose the first, to the best of our knowledge, attempt to combine the best of these two worlds under a unified model and report state-of-the-art performance on the most recent facial benchmark challenge.

Conference paper

Sagonas C, Panagakis Y, Zafeiriou S, Pantic Met al., 2016, Robust statistical frontalization of human and animal faces, International Journal of Computer Vision, Vol: 122, Pages: 270-291, ISSN: 0920-5691

The unconstrained acquisition of facial data in real-world conditions may result in face images with significant pose variations, illumination changes, and occlusions, affecting the performance of facial landmark localization and recognition methods. In this paper, a novel method, robust to pose, illumination variations, and occlusions is proposed for joint face frontalization and landmark localization. Unlike the state-of-the-art methods for landmark localization and pose correction, where large amount of manually annotated images or 3D facial models are required, the proposed method relies on a small set of frontal images only. By observing that the frontal facial image of both humans and animals, is the one having the minimum rank of all different poses, a model which is able to jointly recover the frontalized version of the face as well as the facial landmarks is devised. To this end, a suitable optimization problem is solved, concerning minimization of the nuclear norm (convex surrogate of the rank function) and the matrix (Formula presented.) norm accounting for occlusions. The proposed method is assessed in frontal view reconstruction of human and animal faces, landmark localization, pose-invariant face recognition, face verification in unconstrained conditions, and video inpainting by conducting experiment on 9 databases. The experimental results demonstrate the effectiveness of the proposed method in comparison to the state-of-the-art methods for the target problems.

Journal article

Alabort-i-Medina J, Zafeiriou S, 2016, A unified framework for compositional fitting of active appearance models, International Journal of Computer Vision, Vol: 121, Pages: 26-64, ISSN: 1573-1405

Active Appearance Models (AAMs) areone of the most popular and well-established techniquesfor modeling deformable objects in computer vision. Inthis paper, we study the problem of fitting AAMs usingCompositional Gradient Descent (CGD) algorithms. Wepresent a unified and complete view of these algorithmsand classify them with respect to three main characteristics:i) cost function; ii) type of composition; andiii) optimization method. Furthermore, we extend theprevious view by: a) proposing a novel Bayesian costfunction that can be interpreted as a general probabilisticformulation of the well-known project-out loss;b) introducing two new types of composition, asymmetricand bidirectional, that combine the gradients of bothimage and appearance model to derive better convergentand more robust CGD algorithms; and c) providingnew valuable insights into existent CGD algorithmsby reinterpreting them as direct applications ofthe Schur complement and the Wiberg method. Finally,in order to encourage open research and facilitate futurecomparisons with our work, we make the implementationof the algorithms studied in this paper publiclyavailable as part of the Menpo Project1.

Journal article

Snape P, Pszczolkowski S, Zafeiriou S, Tzimiropoulos G, Ledig C, Rueckert Det al., 2016, A robust similarity measure for volumetric image registration with outliers, Image and Vision Computing, Vol: 52, Pages: 97-113, ISSN: 0262-8856

Image registration under challenging realistic conditions is a very important area of research. In this paper, we focus on algorithms that seek to densely align two volumetric images according to a global similarity measure. Despite intensive research in this area, there is still a need for similarity measures that are robust to outliers common to many different types of images. For example, medical image data is often corrupted by intensity inhomogeneities and may contain outliers in the form of pathologies. In this paper we propose a global similarity measure that is robust to both intensity inhomogeneities and outliers without requiring prior knowledge of the type of outliers. We combine the normalised gradients of images with the cosine function and show that it is theoretically robust against a very general class of outliers. Experimentally, we verify the robustness of our measures within two distinct algorithms. Firstly, we embed our similarity measures within a proof-of-concept extension of the Lucas-Kanade algorithm for volumetric data. Finally, we embed our measures within a popular non-rigid alignment framework based on free-form deformations and show it to be robust against both simulated tumours and intensity inhomogeneities.

Journal article

Zafeiriou S, Zhao G, Pietikainen M, Chellappa R, Kotsia I, Cohn Jet al., 2016, Editorial of special issue on spontaneous facial behaviour analysis, Computer Vision and Image Understanding, Vol: 147, Pages: 50-51, ISSN: 1090-235X

Journal article

Trigeorgis G, Bousmalis K, Zafeiriou S, Schuller Bet al., 2016, A deep matrix factorization method for learning attribute representations, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 39, Pages: 417-429, ISSN: 0162-8828

Semi-Non-negative Matrix Factorization is a technique that learns a low-dimensional representation of a dataset that lends itself to a clustering interpretation. It is possible that the mapping between this new representation and our original data matrix contains rather complex hierarchical information with implicit lower-level hidden attributes, that classical one level clustering methodologies can not interpret. In this work we propose a novel model, Deep Semi-NMF, that is able to learn such hidden representations that allow themselves to an interpretation of clustering according to different, unknown attributes of a given dataset. We also present a semisupervised version of the algorithm, named Deep WSF, that allows the use of (partial) prior information for each of the known attributes of a dataset, that allows the model to be used on datasets with mixed attribute knowledge. Finally, we show that our models are able to learn low-dimensional representations that are better suited for clustering, but also classification, outperforming Semi-Non-negative Matrix Factorization, but also other state-of-the-art methodologies variants.

Journal article

Trigeorgis G, Snape P, Nicolaou MA, Antonakos E, Zafeiriou Set al., Mnemonic Descent Method: A recurrent process applied for end-to-end face alignment, International Conference on Computer Vision and Pattern Recognition, Publisher: Computer Vision Foundation (CVF)

Cascaded regression has recently become the method ofchoice for solving non-linear least squares problems suchas deformable image alignment. Given a sizeable trainingset, cascaded regression learns a set of generic rules thatare sequentially applied to minimise the least squares problem.Despite the success of cascaded regression for problemssuch as face alignment and head pose estimation, thereare several shortcomings arising in the strategies proposedthus far. Specifically, (a) the regressors are learnt independently,(b) the descent directions may cancel one anotherout and (c) handcrafted features (e.g., HoGs, SIFT etc.) aremainly used to drive the cascade, which may be sub-optimalfor the task at hand. In this paper, we propose a combinedand jointly trained convolutional recurrent neural networkarchitecture that allows the training of an end-to-end tosystem that attempts to alleviate the aforementioned drawbacks.The recurrent module facilitates the joint optimisationof the regressors by assuming the cascades form a nonlineardynamical system, in effect fully utilising the informationbetween all cascade levels by introducing a memoryunit that shares information across all levels. The convolutionalmodule allows the network to extract features thatare specialised for the task at hand and are experimentallyshown to outperform hand-crafted features. We show thatthe application of the proposed architecture for the problemof face alignment results in a strong improvement over thecurrent state-of-the-art.

Conference paper

zhou Y, Antonakos E, Alabort i Medina J, Roussos A, Zafeiriou Set al., Estimating Correspondences of Deformable Objects “In-the-wild”, International Conference on Computer Vision and Pattern Recognition, Publisher: Computer Vision Foundation (CVF)

During the past few years we have witnessed the developmentof many methodologies for building and fitting StatisticalDeformable Models (SDMs). The construction ofaccurate SDMs requires careful annotation of images withregards to a consistent set of landmarks. However, the manualannotation of a large amount of images is a tedious,laborious and expensive procedure. Furthermore, for severaldeformable objects, e.g. human body, it is difficult todefine a consistent set of landmarks, and, thus, it becomesimpossible to train humans in order to accurately annotatea collection of images. Nevertheless, for the majority ofobjects, it is possible to extract the shape by object segmentationor even by shape drawing. In this paper, we show forthe first time, to the best of our knowledge, that it is possibleto construct SDMs by putting object shapes in densecorrespondence. Such SDMs can be built with much lesseffort for a large battery of objects. Additionally, we showthat, by sampling the dense model, a part-based SDM canbe learned with its parts being in correspondence. We employour framework to develop SDMs of human arms andlegs, which can be used for the segmentation of the outlineof the human body, as well as to provide better and moreconsistent annotations for body joints.

Conference paper

Booth J, Roussos A, Zafeiriou S, Ponniah A, Dunaway Det al., A 3D Morphable Model learnt from 10,000 faces, International Conference on Computer Vision and Pattern Recognition, Publisher: Computer Vision Foundation (CVF)

We present Large Scale Facial Model (LSFM) — a 3DMorphable Model (3DMM) automatically constructed from9,663 distinct facial identities. To the best of our knowledgeLSFM is the largest-scale Morphable Model ever constructed,containing statistical information from a huge varietyof the human population. To build such a large modelwe introduce a novel fully automated and robust MorphableModel construction pipeline. The dataset that LSFM istrained on includes rich demographic information abouteach subject, allowing for the construction of not only aglobal 3DMM but also models tailored for specific age,gender or ethnicity groups. As an application example,we utilise the proposed model to perform age classificationfrom 3D shape alone. Furthermore, we perform a systematicanalysis of the constructed 3DMMs that showcasestheir quality and descriptive power. The presented extensivequalitative and quantitative evaluations reveal that the proposed3DMM achieves state-of-the-art results, outperformingexisting models by a large margin. Finally, for the benefitof the research community, we make publicly availablethe source code of the proposed automatic 3DMM constructionpipeline. In addition, the constructed global 3DMMand a variety of bespoke models tailored by age, genderand ethnicity are available on application to researchersinvolved in medically oriented research.

Conference paper

Zafeiriou L, Antonakos E, Zafeiriou S, Pantic Met al., Joint Unsupervised Deformable Spatio-Temporal Alignment of Sequences, International Conference on Computer Vision and Pattern Recognition, Publisher: Computer Vision Foundation (CVF)

Typically, the problems of spatial and temporal alignmentof sequences are considered disjoint. That is, in orderto align two sequences, a methodology that (non)-rigidlyaligns the images is first applied, followed by temporalalignment of the obtained aligned images. In this paper, wepropose the first, to the best of our knowledge, methodologythat can jointly spatio-temporally align two sequences,which display highly deformable texture-varying objects.We show that by treating the problems of deformable spatialand temporal alignment jointly, we achieve better resultsthan considering the problems independent. Furthermore,we show that deformable spatio-temporal alignmentof faces can be performed in an unsupervised manner (i.e.,without employing face trackers or building person-specificdeformable models).

Conference paper

Trigeorgis G, Nicolaou MA, Zafeiriou S, Schuller Bet al., Deep canonical time warping, IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Publisher: IEEE

Conference paper

Zafeiriou S, Tzimiropoulos G, Pantic M, 2016, 300 W: Special issue on facial landmark localisation "in-the-wild" Preface, IMAGE AND VISION COMPUTING, Vol: 47, Pages: 1-2, ISSN: 0262-8856

Journal article

Sagonas C, Antonakos E, Tzimiropoulos G, Zafeiriou S, Pantic Met al., 2016, 300 faces In-the-wild challenge: database and results, Image and Vision Computing, Vol: 47, Pages: 3-18, ISSN: 0262-8856

Computer Vision has recently witnessed great research advance towards automatic facial points detection. Numerous methodologies have been proposed during the last few years that achieve accurate and efficient performance. However, fair comparison between these methodologies is infeasible mainly due to two issues. (a) Most existing databases, captured under both constrained and unconstrained (in-the-wild) conditions have been annotated using different mark-ups and, in most cases, the accuracy of the annotations is low. (b) Most published works report experimental results using different training/testing sets, different error metrics and, of course, landmark points with semantically different locations. In this paper, we aim to overcome the aforementioned problems by (a)proposing a semi-automatic annotation technique that was employed to re-annotate most existing facial databases under a unified protocol, and (b)presenting the 300 Faces In-The-Wild Challenge (300-W), the first facial landmark localization challenge that was organized twice, in 2013 and 2015. To the best of our knowledge, this is the first effort towards a unified annotation scheme of massive databases and a fair experimental comparison of existing facial landmark localization systems. The images and annotations of the new testing database that was used in the 300-W challenge are available from http://ibug.doc.ic.ac.uk/resources/300-W_IMAVIS/.

Journal article

Hong X, Zhao G, Zafeiriou S, Pantic M, Pietikainen Met al., 2015, Capturing correlations of local features for image representation, Neurocomputing, Vol: 184, Pages: 99-106, ISSN: 1872-8286

Local descriptors are popular ways to characterize the local properties of images in various computer vision based tasks. To form the global descriptors for image regions, the first-order feature pooling is widely used. However, as the first-order pooling technique treats each dimension of local features separately, the pairwise correlations of local features are usually ignored.Encouraged by the success of recently developed second-order pooling techniques, in this paper we formulate a general second-order pooling framework and explore several analogues of the second-order average and max operations. We comprehensively investigate a variety of moments which are in the central positions to the second-order pooling technique. As a result, the superiority of the second-order standardized moment average pooling (2Standmap) is suggested. We successfully apply 2Standmap to four challenging tasks namely texture classification, medical image analysis, pain expression recognition, and micro-expression recognition. It illustrates the effectiveness of 2Standmap to capture multiple cues and the generalization to both static images and spatial-temporal sequences.

Journal article

Trigeorgis G, Ringeval F, Brückner R, Marchi E, Nicolaou M, Schuller B, Zafeiriou Set al., Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network, 41st IEEE International Conference on Acoustics, Speech, and Signal Processing, Publisher: IEEE

Conference paper

Chrysos G, Antonakos E, Zafeiriou S, Snape Pet al., 2015, Offline Deformable Face Tracking in Arbitrary Videos, 2015 IEEE International Conference on Computer Vision Workshops, Publisher: IEEE, Pages: 954-962

Generic face detection and facial landmark localization in static imagery are among the most mature and well-studied problems in machine learning and computer vision. Currently, the top performing face detectors achieve a true positive rate of around 75-80% whilst maintaining low false positive rates. Furthermore, the top performing facial landmark localization algorithms obtain low point-to-point errors for more than 70% of commonly benchmarked images captured under unconstrained conditions. The task of facial landmark tracking in videos, however, has attracted much less attention. Generally, a tracking-by-detection framework is applied, where face detection and landmark localization are employed in every frame in order to avoid drifting. Thus, this solution is equivalent to landmark detection in static imagery. Empirically, a straightforward application of such a framework cannot achieve higher performance, on average, than the one reported for static imagery. In this paper, we show for the first time, to the best of our knowledge, that the results of generic face detection and landmark localization can be used to recursively train powerful and accurate person-specific face detectors and landmark localization methods for offline deformable tracking. The proposed pipeline can track landmarks in very challenging long-term sequences captured under arbitrary conditions. The pipeline was used as a semi-automatic tool to annotate the majority of the videos of the 300-VW Challenge.

Conference paper

Sagonas C, Panagakis Y, Zafeiriou S, Pantic Met al., 2015, Robust statistical face frontalization, 2015 IEEE International Conference on Computer Vision (ICCV), Publisher: IEEE, Pages: 3871-3879, ISSN: 1550-5499

Recently, it has been shown that excellent results can be achieved in both facial landmark localization and pose-invariant face recognition. These breakthroughs are attributed to the efforts of the community to manually annotate facial images in many different poses and to collect 3D facial data. In this paper, we propose a novel method for joint frontal view reconstruction and landmark localization using a small set of frontal images only. By observing that the frontal facial image is the one having the minimum rank of all different poses, an appropriate model which is able to jointly recover the frontalized version of the face as well as the facial landmarks is devised. To this end, a suitable optimization problem, involving the minimization of the nuclear norm and the matrix l1 norm is solved. The proposed method is assessed in frontal face reconstruction, face landmark localization, pose-invariant face recognition, and face verification in unconstrained conditions. The relevant experiments have been conducted on 8 databases. The experimental results demonstrate the effectiveness of the proposed method in comparison to the state-of-the-art methods for the target problems.

Conference paper

Shen J, Zafeiriou S, Chrysos GG, Kossaifi J, Tzimiropoulos G, Pantic Met al., 2015, The first facial landmark tracking in-the-wild challenge: benchmark and results, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), Publisher: IEEE, Pages: 1003-1011, ISSN: 1550-5499

Detection and tracking of faces in image sequences is among the most well studied problems in the intersection of statistical machine learning and computer vision. Often, tracking and detection methodologies use a rigid representation to describe the facial region 1, hence they can neither capture nor exploit the non-rigid facial deformations, which are crucial for countless of applications (e.g., facial expression analysis, facial motion capture, high-performance face recognition etc.). Usually, the non-rigid deformations are captured by locating and tracking the position of a set of fiducial facial landmarks (e.g., eyes, nose, mouth etc.). Recently, we witnessed a burst of research in automatic facial landmark localisation in static imagery. This is partly attributed to the availability of large amount of annotated data, many of which have been provided by the first facial landmark localisation challenge (also known as 300-W challenge). Even though now well established benchmarks exist for facial landmark localisation in static imagery, to the best of our knowledge, there is no established benchmark for assessing the performance of facial landmark tracking methodologies, containing an adequate number of annotated face videos. In conjunction with ICCV'2015 we run the first competition/challenge on facial landmark tracking in long-Term videos. In this paper, we present the first benchmark for long-Term facial landmark tracking, containing currently over 110 annotated videos, and we summarise the results of the competition.

Conference paper

Snape P, Roussos A, Panagakis Y, Zafeiriou Set al., 2015, Face flow, 2015 IEEE International Conference on Computer Vision (ICCV), Publisher: IEEE, Pages: 2993-3001, ISSN: 1550-5499

In this paper, we propose a method for the robust and efficient computation of multi-frame optical flow in an expressive sequence of facial images. We formulate a novel energy minimisation problem for establishing dense correspondences between a neutral template and every frame of a sequence. We exploit the highly correlated nature of human expressions by representing dense facial motion using a deformation basis. Furthermore, we exploit the even higher correlation between deformations in a given input sequence by imposing a low-rank prior on the coefficients of the deformation basis, yielding temporally consistent optical flow. Our proposed model-based formulation, in conjunction with the inverse compositional strategy and low-rank matrix optimisation that we adopt, leads to a highly efficient algorithm for calculating facial flow. As experimental evaluation, we show quantitative experiments on a challenging novel benchmark of face sequences, with dense ground truth optical flow provided by motion capture data. We also provide qualitative results on a real sequence displaying fast motion and occlusions. Extensive quantitative and qualitative comparisons demonstrate that the proposed method outperforms state-of-the-art optical flow and dense non-rigid registration techniques, whilst running an order of magnitude faster.

Conference paper

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: id=00581716&limit=30&person=true&page=3&respub-action=search.html