STEFANOS ZAFEIRIOU, PhD

Faculty of Engineering, Department of Computing

Professor in Machine Learning & Computer Vision

Contact

+44 (0)20 7594 8461s.zafeiriou Website CV

Location

375Huxley BuildingSouth Kensington Campus

Summary

Publications

Deng J, Guo J, Xue N, Zafeiriou Set al., 2020, Arcface: additive angular margin loss for deep face recognition, CVPR 2019, Publisher: IEEE, Pages: 4685-4694, ISSN: 2575-7075

One of the main challenges in feature learning usingDeep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss func-tions that enhance discriminative power. Centre loss pe-nalises the distance between the deep features and their cor-responding class centres in the Euclidean space to achieveintra-class compactness. SphereFace assumes that the lin-ear transformation matrix in the last fully connected layercan be used as a representation of the class centres in anangular space and penalises the angles between the deepfeatures and their corresponding weights in a multiplicativeway. Recently, a popular line of research is to incorporatemargins in well-established loss functions in order to max-imise face class separability. In this paper, we propose anAdditive Angular Margin Loss (ArcFace) to obtain highlydiscriminative features for face recognition. The proposedArcFace has a clear geometric interpretation due to the ex-act correspondence to the geodesic distance on the hyper-sphere. We present arguably the most extensive experimen-tal evaluation of all the recent state-of-the-art face recog-nition methods on over 10 face recognition benchmarks in-cluding a new large-scale image database with trillion levelof pairs and a large-scale video dataset. We show that Ar-cFace consistently outperforms the state-of-the-art and canbe easily implemented with negligible computational over-head. We release all refined training data, training codes,pre-trained models and training logs1, which will help re-produce the results in this paper.

Conference paper

Deng J, Xue N, Cheng S, Panagakis I, Zafeiriou Set al., 2019, Side information for face completion: a robust PCA approach, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 41, Pages: 2349-2364, ISSN: 0162-8828

Robust principal component analysis (RPCA) is a powerful method for learning low-rank feature representation of variousvisual data. However, for certain types as well as significant amount of error corruption, it fails to yield satisfactory results; a drawbackthat can be alleviated by exploiting domain-dependent prior knowledge or information. In this paper, we propose two models for theRPCA that take into account such side information, even in the presence of missing values. We apply this framework to the task of UVcompletion which is widely used in pose-invariant face recognition. Moreover, we construct a generative adversarial network (GAN) toextract side information as well as subspaces. These subspaces not only assist in the recovery but also speed up the process in caseof large-scale data. We quantitatively and qualitatively evaluate the proposed approaches through both synthetic data and fivereal-world datasets to verify their effectiveness.

Journal article

Bahri M, Panagakis Y, Zafeiriou SP, 2019, Robust Kronecker component analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 41, Pages: 2365-2379, ISSN: 0162-8828

Dictionary learning and component analysis models are fundamental for learning compact representations relevant to a given task. The model complexity is encoded by means of structure, such as sparsity, low-rankness, or nonnegativity. Unfortunately, approaches like K-SVD that learn dictionaries for sparse coding via Singular Value Decomposition (SVD) are hard to scale, and fragile in the presence of outliers. Conversely, robust component analysis methods such as the Robust Principal Component Analysis (RPCA) are able to recover low-complexity representations from data corrupted with noise of unknown magnitude and support, but do not provide a dictionary that respects the structure of the data, and also involve expensive computations. In this paper, we propose a novel Kronecker-decomposable component analysis model, coined as Robust Kronecker Component Analysis (RKCA), that combines ideas from sparse dictionary learning and robust component analysis. RKCA has several appealing properties, including robustness to gross corruption; it can be used for low-rank modeling, and leverages separability to solve significantly smaller problems. We design an efficient learning algorithm by drawing links with tensor factorizations, and analyze its optimality and low-rankness properties. The effectiveness of the proposed approach is demonstrated on real-world applications, namely background subtraction and image denoising and completion, by performing a thorough comparison with the current state of the art.

Journal article

Knoops PGM, Papaioannou A, Borghi A, Breakey RWF, Wilsons AT, Jeelani O, Zafeiriou S, Steinbacher D, Padwa BL, Dunaway DJ, Schievano Set al., 2019, A machine learning framework for automated diagnosis and computer-assisted planning in plastic and reconstructive surgery, SCIENTIFIC REPORTS, Vol: 9, ISSN: 2045-2322

Author Web Link
Cite
Citations: 38

Journal article

Chrysos GG, Favaro P, Zafeiriou S, 2019, Motion deblurring of faces, International Journal of Computer Vision, Vol: 127, Pages: 801-823, ISSN: 0920-5691

Face analysis lies at the heart of computer vision with remarkable progress in the past decades. Face recognition and tracking are tackled by building invariance to fundamental modes of variation such as illumination, 3D pose. A much less standing mode of variation is motion deblurring, which however presents substantial challenges in face analysis. Recent approaches either make oversimplifying assumptions, e.g. in cases of joint optimization with other tasks, or fail to preserve the highly structured shape/identity information. We introduce a two-step architecture tailored to the challenges of motion deblurring: the first step restores the low frequencies; the second restores the high frequencies, while ensuring that the outputs span the natural images manifold. Both steps are implemented with a supervised data-driven method; to train those we devise a method for creating realistic motion blur by averaging a variable number of frames. The averaged images originate from the 2 MF2 dataset with 19 million facial frames, which we introduce for the task. Considering deblurring as an intermediate step, we conduct a thorough experimentation on high-level face analysis tasks, i.e. landmark localization and face verification, on blurred images. The experimental evaluation demonstrates the superiority of our method.

Journal article

Wang M, Shu Z, Cheng S, Panagakis Y, Samaras D, Zafeiriou Set al., 2019, An adversarial neuro-tensorial approach for learning disentangled representations, International Journal of Computer Vision, Vol: 127, Pages: 743-762, ISSN: 0920-5691

Several factors contribute to the appearance of an object in a visual scene, including pose, illumination, and deformation, among others. Each factor accounts for a source of variability in the data, while the multiplicative interactions of these factors emulate the entangled variability, giving rise to the rich structure of visual object appearance. Disentangling such unobserved factors from visual data is a challenging task, especially when the data have been captured in uncontrolled recording conditions (also referred to as “in-the-wild”) and label information is not available. In this paper, we propose a pseudo-supervised deep learning method for disentangling multiple latent factors of variation in face images captured in-the-wild. To this end, we propose a deep latent variable model, where the multiplicative interactions of multiple latent factors of variation are explicitly modelled by means of multilinear (tensor) structure. We demonstrate that the proposed approach indeed learns disentangled representations of facial expressions and pose, which can be used in various applications, including face editing, as well as 3D face reconstruction and classification of facial expression, identity and pose.

Journal article

Deng J, Roussos A, Chrysos G, Ververas E, Kotsia I, Shen J, Zafeiriou Set al., 2019, The Menpo benchmark for multi-pose 2D and 3D facial landmark localisation and tracking, International Journal of Computer Vision, Vol: 127, Pages: 599-624, ISSN: 0920-5691

In this article, we present the Menpo 2D and Menpo 3D benchmarks, two new datasets for multi-pose 2D and 3D facial landmark localisation and tracking. In contrast to the previous benchmarks such as 300W and 300VW, the proposed benchmarks contain facial images in both semi-frontal and profile pose. We introduce an elaborate semi-automatic methodology for providing high-quality annotations for both the Menpo 2D and Menpo 3D benchmarks. In Menpo 2D benchmark, different visible landmark configurations are designed for semi-frontal and profile faces, thus making the 2D face alignment full-pose. In Menpo 3D benchmark, a united landmark configuration is designed for both semi-frontal and profile faces based on the correspondence with a 3D face model, thus making face alignment not only full-pose but also corresponding to the real-world 3D space. Based on the considerable number of annotated images, we organised Menpo 2D Challenge and Menpo 3D Challenge for face alignment under large pose variations in conjunction with CVPR 2017 and ICCV 2017, respectively. The results of these challenges demonstrate that recent deep learning architectures, when trained with the abundant data, lead to excellent results. We also provide a very simple, yet effective solution, named Cascade Multi-view Hourglass Model, to 2D and 3D face alignment. In our method, we take advantage of all 2D and 3D facial landmark annotations in a joint way. We not only capitalise on the correspondences between the semi-frontal and profile 2D facial landmarks but also employ joint supervision from both 2D and 3D facial landmarks. Finally, we discuss future directions on the topic of face alignment.

Journal article

Kollias D, Tzirakis P, Nicolaou MA, Papaioannou A, Zhao G, Schuller B, Kotsia I, Zafeiriou Set al., 2019, Deep Affect Prediction in-the-Wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond, INTERNATIONAL JOURNAL OF COMPUTER VISION, Vol: 127, Pages: 907-929, ISSN: 0920-5691

Author Web Link
Cite
Citations: 120

Journal article

Gligorijevic V, Panagakis Y, Zafeiriou S, 2019, Non-Negative Matrix Factorizations for Multiplex Network Analysis, Publisher: IEEE COMPUTER SOC

Working paper

Hovhannisyan V, Panagakis I, Parpas P, Zafeiriou Set al., 2019, Fast multilevel algorithms for compressive principal component pursuit, SIAM Journal on Imaging Sciences, Vol: 12, Pages: 624-649, ISSN: 1936-4954

Recovering a low-rank matrix from highly corrupted measurements arises in compressed sensing of structured high-dimensional signals (e.g., videos and hyperspectral images among others). Robust principal component analysis (RPCA), solved via principal component pursuit (PCP), recovers a low-rank matrix from sparse corruptions that are of unknown value and support by decomposing the observation matrix into two terms: a low-rank matrix and a sparse one, accounting for sparse noise and outliers. In the more general setting, where only a fraction of the data matrix has been observed, low-rank matrix recovery is achieved by solving the compressive principal component pursuit (CPCP). Both PCP and CPCP are well-studied convex programs, and numerousiterative algorithms have been proposed for their optimisation. Nevertheless, these algorithms involve singular value decomposition (SVD) at each iteration, which renders their applicability challenging in the case of massive data. In this paper, we propose a multilevel approach for the solution of PCP and CPCP problems. The core principle behind our algorithm is to apply SVD in models of lower-dimensionality than the original one and then lift its solution to the original problem dimension. Hence, our methods rely on the assumption that the low rank component can be represented in a lower dimensional space. We show that the proposed algorithms are easy to implement, converge at the same rate but with much lower iteration cost. Numerical experiments on numerous synthetic and real problems indicate that the proposed multilevel algorithms are several times faster than their original counterparts, namely PCP and CPCP.

Journal article

Ploumpis S, Wang H, Pears N, Smith W, Zafeiriou Set al., 2019, Combining 3D morphable models: a large-scale face-and-head model, CVPR 2019, Publisher: IEEE

Three-dimensional Morphable Models (3DMMs) arepowerful statistical tools for representing the 3D surfacesof an object class. In this context, we identify an interestingquestion that has previously not received research attention:is it possible to combine two or more 3DMMs that (a) arebuilt using different templates that perhaps only partly overlap,(b) have different representation capabilities and (c)are built from different datasets that may not be publiclyavailable?In answering this question, we make two contributions.First, we propose two methods for solving thisproblem: i. use a regressor to complete missing parts ofone model using the other, ii. use the Gaussian Processframework to blend covariance matrices from multiple models.Second, as an example application of our approach,we build a new face-and-head shape model that combinesthe variability and facial detail of the LSFM with the fullhead modelling of the LYHM. The resulting combined shapemodel achieves state-of-the-art performance and outperformsexisting head models by a large margin. Finally, as anapplication experiment, we reconstruct full head representationsfrom single, unconstrained images by utilizing ourproposed large-scale model in conjunction with the Face-Warehouse blendshapes for handling expressions.

Conference paper

Deng J, Zhou Y, Kotsia I, Zafeiriou Set al., 2019, Dense 3D face decoding over 2500FPS: joint texture and shape convolutional mesh decoders, CVPR 2019, Publisher: IEEE

3D Morphable Models (3DMMs) are statistical modelsthat represent facial texture and shape variations using a setof linear bases and more particular Principal ComponentAnalysis (PCA). 3DMMs were used as statistical priors forreconstructing 3D faces from images by solving non-linearleast square optimization problems. Recently, 3DMMs wereused as generative models for training non-linear mappings(i.e., regressors) from image to the parameters of the modelsvia Deep Convolutional Neural Networks (DCNNs). Nev-ertheless, all of the above methods use either fully con-nected layers or 2D convolutions on parametric unwrappedUV spaces leading to large networks with many parame-ters. In this paper, we present the first, to the best of ourknowledge, non-linear 3DMMs by learning joint textureand shape auto-encoders using direct mesh convolutions.We demonstrate how these auto-encoders can be used totrain very light-weight models that perform Coloured MeshDecoding (CMD) in-the-wild at a speed of over 2500 FPS.

Conference paper

Gecer B, Ploumpis S, Kotsia I, Zafeiriou Set al., 2019, GANFIT: generative adversarial network fitting for high fidelity 3D face reconstruction, CVPR 2019, Publisher: IEEE

In the past few years a lot of work has been done towardsreconstructing the 3D facial structure from single imagesby capitalizing on the power of Deep Convolutional NeuralNetworks (DCNNs). In the most recent works, differentiablerenderers were employed in order to learn the relationshipbetween the facial identity features and the parameters ofa 3D morphable model for shape and texture. The texturefeatures either correspond to components of a linear texturespace or are learned by auto-encoders directly fromin-the-wild images. In all cases, the quality of the facialtexture reconstruction of the state-of-the-art methods is stillnot capable of modelling textures in high fidelity. In thispaper, we take a radically different approach and harnessthe power of Generative Adversarial Networks (GANs) andDCNNs in order to reconstruct the facial texture and shapefrom single images. That is, we utilize GANs to train a verypowerful generator of facial texture in UV space. Then, werevisit the original 3D Morphable Models (3DMMs) fittingapproaches making use of non-linear optimization to findthe optimal latent parameters that best reconstruct the testimage but under a new perspective. We optimize the parameterswith the supervision of pretrained deep identity featuresthrough our end-to-end differentiable framework. Wedemonstrate excellent results in photorealistic and identitypreserving 3D face reconstructions and achieve for the firsttime, to the best of our knowledge, facial texture reconstructionwith high-frequency details.1

Conference paper

Nicolaou MA, Zafeiriou S, Kotsia I, Zhao G, Cohn Jet al., 2019, Editorial of special issue on human behaviour analysis "in-the-wild", IEEE Transactions on Affective Computing, Vol: 10, Pages: 4-6, ISSN: 1949-3045

The papers in this special section focus on human face and body image analysis, one of the most researched objects. One of the main reasons behind this popularity lies in the numerous applications of automatic face and body gesture analysis algorithms, that span several fields such as Human-Computer and Human-Robot Interaction (facial expression/body gesture recognition for automatic analysis of affect), medicine and healthcare (detection of emotional and cognitive disorders), as well as biometrics (face recognition, gait recognition). The papers in this section focus on recent efforts towards catalysing progress in automatic analysis of human behaviour in uncontrolled, “in-the-wild” conditions. We summarize research efforts towards the development of research methodologies, database collections and benchmarks, as well as algorithms and systems for machine analysis of human behaviour, focusing on facial expressions, body gestures, speech, as well as various other sensors. We are delighted that the special issue includes authors both from academia as well as the industry.

Journal article

Kollias D, Cheng S, Pantic M, Zafeiriou Set al., 2019, Photorealistic facial synthesis in the dimensional affect space, European Conference on Computer Vision, Publisher: Springer, Pages: 475-491, ISSN: 0302-9743

This paper presents a novel approach for synthesizing facial affect, which is based on our annotating 600,000 frames of the 4DFAB database in terms of valence and arousal. The input of this approach is a pair of these emotional state descriptors and a neutral 2D image of a person to whom the corresponding affect will be synthesized. Given this target pair, a set of 3D facial meshes is selected, which is used to build a blendshape model and generate the new facial affect. To synthesize the affect on the 2D neutral image, 3DMM fitting is performed and the reconstructed face is deformed to generate the target facial expressions. Last, the new face is rendered into the original image. Both qualitative and quantitative experimental studies illustrate the generation of realistic images, when the neutral image is sampled from a variety of well known databases, such as the Aff-Wild, AFEW, Multi-PIE, AFEW-VA, BU-3DFE, Bosphorus.

Conference paper

Chrysos GG, Kossaifi J, Zafeiriou S, 2019, Robust conditional generative adversarial networks, 7th International Conference on Learning Representations, ICLR 2019

© 7th International Conference on Learning Representations, ICLR 2019. All Rights Reserved. Conditional generative adversarial networks (cGAN) have led to large improvements in the task of conditional image generation, which lies at the heart of computer vision. The major focus so far has been on performance improvement, while there has been little effort in making cGAN more robust to noise. The regression (of the generator) might lead to arbitrarily large errors in the output, which makes cGAN unreliable for real-world applications. In this work, we introduce a novel conditional GAN model, called RoCGAN, which leverages structure in the target space of the model to address the issue. Our model augments the generator with an unsupervised pathway, which promotes the outputs of the generator to span the target manifold even in the presence of intense noise. We prove that RoCGAN share similar theoretical properties as GAN and experimentally verify that our model outperforms existing state-of-the-art cGAN architectures by a large margin in a variety of domains including images from natural scenes and faces.

Abstract
Cite
Citations: 2

Journal article

Deng J, Cheng S, Xue N, Zhou Y, Zafeiriou Set al., 2018, UV-GAN: Adversarial facial UV map completion for pose-invariant face recognition, 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Publisher: IEEE, Pages: 7093-7102, ISSN: 1063-6919

Recently proposed robust 3D face alignment methods establish either dense or sparse correspondence between a 3D face model and a 2D facial image. The use of these methods presents new challenges as well as opportunities for facial texture analysis. In particular, by sampling the image using the fitted model, a facial UV can be created. Unfortunately, due to self-occlusion, such a UV map is always incomplete. In this paper, we propose a framework for training Deep Convolutional Neural Network (DCNN) to complete the facial UV map extracted from in-the-wild images. To this end, we first gather complete UV maps by fitting a 3D Morphable Model (3DMM) to various multiview image and video datasets, as well as leveraging on a new 3D dataset with over 3,000 identities. Second, we devise a meticulously designed architecture that combines local and global adversarial DCNNs to learn an identity-preserving facial UV completion model. We demonstrate that by attaching the completed UV to the fitted mesh and generating instances of arbitrary poses, we can increase pose variations for training deep face recognition/verification models, and minimise pose discrepancy during testing, which lead to better performance. Experiments on both controlled and in-the-wild UV datasets prove the effectiveness of our adversarial UV completion model. We achieve state-of-the-art verification accuracy, 94.05%, under the CFP frontal-profile protocol only by combining pose augmentation during training and pose discrepancy reduction during testing. We will release the first in-the-wild UV dataset (we refer as WildUV) that comprises of complete facial UV maps from 1,892 identities for research purposes.

Conference paper

Cheng S, Kotsia I, Pantic M, Zafeiriou Set al., 2018, 4DFAB: A large scale 4D database for facial expression analysis and biometric applications, 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Publisher: IEEE, Pages: 5117-5126, ISSN: 1063-6919

The progress we are currently witnessing in many computer vision applications, including automatic face analysis, would not be made possible without tremendous efforts in collecting and annotating large scale visual databases. To this end, we propose 4DFAB, a new large scale database of dynamic high-resolution 3D faces (over 1,800,000 3D meshes). 4DFAB contains recordings of 180 subjects captured in four different sessions spanning over a five-year period. It contains 4D videos of subjects displaying both spontaneous and posed facial behaviours. The database can be used for both face and facial expression recognition, as well as behavioural biometrics. It can also be used to learn very powerful blendshapes for parametrising facial behaviour. In this paper, we conduct several experiments and demonstrate the usefulness of the database for various applications. The database will be made publicly available for research purposes.

Conference paper

Moschoglou S, Ververas E, Panagakis Y, Nicolaou MA, Zafeiriou Set al., 2018, Multi-attribute robust component analysis for facial UV maps, IEEE Journal of Selected Topics in Signal Processing, Vol: 12, Pages: 1324-1337, ISSN: 1932-4553

The collection of large-scale three-dimensional (3-D) face models has led to significant progress in the field of 3-D face alignment “in-the-wild,” with several methods being proposed toward establishing sparse or dense 3-D correspondences between a given 2-D facial image and a 3-D face model. Utilizing 3-D face alignment improves 2-D face alignment in many ways, such as alleviating issues with artifacts and warping effects in texture images. However, the utilization of 3-D face models introduces a new set of challenges for researchers. Since facial images are commonly captured in arbitrary recording conditions, a considerable amount of missing information and gross outliers is observed (e.g., due to self-occlusion, subjects wearing eye-glasses, and so on). To this end, in this paper we propose the Multi-Attribute Robust Component Analysis (MA-RCA), a novel technique that is suitable for facial UV maps containing a considerable amount of missing information and outliers, while additionally, elegantly incorporates knowledge from various available attributes, such as age and identity. We evaluate the proposed method on problems such as UV denoising, UV completion, facial expression synthesis, and age progression, where MA-RCA outperforms compared techniques.

Journal article

Sagonas, Ververas, Panagakis, Zafeiriou Set al., 2018, Recovering joint and individual components in facial data, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 40, Pages: 2668-2681, ISSN: 0162-8828

A set of images depicting faces with different expressions or in various ages consists of components that are shared across all images (i.e., joint components) imparting to the depicted object the properties of human faces as well as individual components that are related to different expressions or age groups. Discovering the common (joint) and individual components in facial images is crucial for applications such as facial expression transfer and age progression. The problem is rather challenging when dealing with images captured in unconstrained conditions in the presence of sparse non-Gaussian errors of large magnitude (i.e., sparse gross errors or outliers) and contain missing data. In this paper, we investigate the use of a method recently introduced in statistics, the so-called Joint and Individual Variance Explained (JIVE) method, for the robust recovery of joint and individual components in visual facial data consisting of an arbitrary number of views. Since the JIVE is not robust to sparse gross errors, we propose alternatives, which are (1) robust to sparse gross, non-Gaussian noise, (2) able to automatically find the individual components rank, and (3) can handle missing data. We demonstrate the effectiveness of the proposed methods to several computer vision applications, namely facial expression synthesis and 2D and 3D face age progression ‘in-the-wild’.

Journal article

Chrysos G, Zafeiriou SP, 2018, PD2T: Person-specific Detection, Deformable Tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 40, Pages: 2555-2568, ISSN: 0162-8828

Face detection/alignment has reached a satisfactory state in static images captured under arbitrary conditions. Such methods typically perform (joint) fitting independently for each frame and are used in commercial applications; however in the majority of the real-world scenarios the dynamic scenes are of interest. Hence, we argue that generic fitting per frame is suboptimal (it discards the informative correlation of sequential frames) and propose to learn person-specific statistics from the video to improve the generic results. To that end, we introduce a meticulously studied pipeline, which we name PD\textsuperscript{2}T, that performs person-specific detection and landmark localisation. We carry out extensive experimentation with a diverse set of i) generic fitting results, ii) different objects (human faces, animal faces) that illustrate the powerful properties of our proposed pipeline and experimentally verify that PD\textsuperscript{2}T outperforms all the compared methods.

Journal article

Wang M, Panagakis Y, Snape P, Zafeiriou SPet al., 2018, Disentangling the modes of variation in unlabelled data, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 40, Pages: 2682-2695, ISSN: 0162-8828

Statistical methods are of paramount importance in discovering the modes of variation in visual data. The Principal Component Analysis (PCA) is probably the most prominent method for extracting a single mode of variation in the data. However, in practice, several factors contribute to the appearance of visual objects including pose, illumination, and deformation, to mention a few. To extract these modes of variations from visual data, several supervised methods, such as the TensorFaces relying on multilinear (tensor) decomposition have been developed. The main drawbacks of such methods is that they require both labels regarding the modes of variations and the same number of samples under all modes of variations (e.g., the same face under different expressions, poses etc.). Therefore, their applicability is limited to well-organised data, usually captured in well-controlled conditions. In this paper, we propose a novel general multilinear matrix decomposition method that discovers the multilinear structure of possibly incomplete sets of visual data in unsupervised setting (i.e., without the presence of labels). We also propose extensions of the method with sparsity and low-rank constraints in order to handle noisy data, captured in unconstrained conditions. Besides that, a graph-regularised variant of the method is also developed in order to exploit available geometric or label information for some modes of variations. We demonstrate the applicability of the proposed method in several computer vision tasks, including Shape from Shading (SfS) (in the wild and with occlusion removal), expression transfer, and estimation of surface normals from images captured in the wild.

Journal article

Booth J, Roussos A, Ververas E, Antonakos E, Poumpis S, Panagakis Y, Zafeiriou SPet al., 2018, 3D reconstruction of "in-the-wild" faces in images and videos, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 40, Pages: 2638-2652, ISSN: 0162-8828

3D Morphable Models (3DMMs) are powerful statistical models of 3D facial shape and texture, and among the state-of-the-art methods for reconstructing facial shape from single images. With the advent of new 3D sensors, many 3D facial datasets have been collected containing both neutral as well as expressive faces. However, all datasets are captured under controlled conditions. Thus, even though powerful 3D facial shape models can be learnt from such data, it is difficult to build statistical texture models that are sufficient to reconstruct faces captured in unconstrained conditions ("in-the-wild"). In this paper, we propose the first "in-the-wild" 3DMM by combining a statistical model of facial identity and expression shape with an "in-the-wild" texture model. We show that such an approach allows for the development of a greatly simplified fitting procedure for images and videos, as there is no need to optimise with regards to the illumination parameters. We have collected three new databases that combine "in-the-wild" images and video with ground truth 3D facial geometry, the first of their kind, and report extensive quantitative evaluations using them that demonstrate our method is state-of-the-art.

Journal article

Kollias D, Zafeiriou S, 2018, Training deep neural networks with different datasets In-the-wild: The emotion recognition paradigm, 2018 International Joint Conference on Neural Networks (IJCNN), Publisher: IEEE, ISSN: 2161-4407

A novel procedure is presented in this paper, for training a deep convolutional and recurrent neural network, taking into account both the available training data set and some information extracted from similar networks trained with other relevant data sets. This information is included in an extended loss function used for the network training, so that the network can have an improved performance when applied to the other data sets, without forgetting the learned knowledge from the original data set. Facial expression and emotion recognition in-the-wild is the test bed application that is used to demonstrate the improved performance achieved using the proposed approach. In this framework, we provide an experimental study on categorical emotion recognition using datasets from a very recent related emotion recognition challenge.

Conference paper

Booth J, Roussos A, Ververas E, Antonakos E, Ploumpis S, Panagakis Y, Zafeiriou Set al., 2018, 3D Reconstruction of "In-the-Wild" Faces in Images and Videos, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, Vol: 40, Pages: 2638-2652, ISSN: 0162-8828

Journal article

Chrysos GG, Antonakos E, Zafeiriou S, 2018, IPST: Incremental Pictorial Structures for Model-Free Tracking of Deformable Objects, IEEE TRANSACTIONS ON IMAGE PROCESSING, Vol: 27, Pages: 3529-3540, ISSN: 1057-7149

Journal article

Kampouris C, Zafeiriou S, Ghosh A, 2018, Diffuse-specular separation using binary spherical gradient illumination, Eurographics Symposium on Rendering (EGSR) 2018, Publisher: The Eurographics Association, ISSN: 1727-3463

We introduce a novel method for view-independent diffuse-specular separation of albedo and photometric normals withoutrequiring polarization using binary spherical gradient illumination. The key idea is that with binary gradient illumination, adielectric surface oriented towards the dark hemisphere exhibits pure diffuse reflectance while a surface oriented towards thebright hemisphere exhibits both diffuse and specular reflectance. We exploit this observation to formulate diffuse-specular separationbased on color-space analysis of a surface’s response to binary spherical gradients and their complements. The methoddoes not impose restrictions on viewpoints and requires fewer photographs for multiview acquisition than polarized sphericalgradient illumination. We further demonstrate an efficient two-shot capture using spectral multiplexing of the illumination thatenables diffuse-specular separation of albedo and heuristic separation of photometric normals.

Conference paper

Zhou Y, Deng J, Zafeiriou S, 2018, Improve accurate pose alignment and action localization by dense pose estimation, 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG), Publisher: IEEE, Pages: 480-484, ISSN: 2326-5396

In this work we explore the use of shape-based representations as an auxiliary source of supervision for pose estimation and action recognition. We show that shape-based representations can act as a source of `privileged information' that complements and extends the pure landmark-level annotations. We explore 2D shape-based supervision signals, such as Support Vector Shape. Our experiments show that shape-based supervision signals substantially improve pose alignment accuracy in the form of a cascade architecture. We outperform state-of-the-art methods on the MPII and LSP datasets, while using substantially shallower networks. For action localization in untrimmed videos, our method introduces additional classification signals based on the structured segment networks (SSN) and further improved the performance. To be specific, dense human pose and landmarks localization signals are involved in detection progress. We applied out network to all frames of videos alongside with output from SSN to further improve detection accuracy, especially for pose related and sparsely annotated videos. The method in general achieves state-of-the-art performance on Activity Detection Task for ActivityNet Challenge2017 test set and witnesses remarkable improvement on pose related and sparsely annotated categories e.g. sports.

Conference paper

Songsri-in K, Trigeorgis G, Zafeiriou S, 2018, Deep & Deformable: Convolutional Mixtures of Deformable Part-based Models, 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG), Publisher: IEEE, Pages: 218-225, ISSN: 2326-5396

Conference paper

Trigeorgis G, Nicolaou M, Schuller B, Zafeiriou Set al., 2018, Deep canonical time warping for simultaneous alignment and representation learning of sequences, IEEE transactions on Pattern Analysis and Machine Intelligence, Vol: 40, Pages: 1128-1138, ISSN: 2160-9292

Machine learning algorithms for the analysis of time-series often depend on the assumption that utilised data are temporally aligned. Any temporal discrepancies arising in the data is certain to lead to ill-generalisable models, which in turn fail to correctly capture properties of the task at hand. The temporal alignment of time-series is thus a crucial challenge manifesting in a multitude of applications. Nevertheless, the vast majority of algorithms oriented towards temporal alignment are either applied directly on the observation space or simply utilise linear projections - thus failing to capture complex, hierarchical non-linear representations that may prove beneficial, especially when dealing with multi-modal data (e.g., visual and acoustic information). To this end, we present Deep Canonical Time Warping (DCTW), a method that automatically learns non-linear representations of multiple time-series that are (i) maximally correlated in a shared subspace, and (ii) temporally aligned. Furthermore, we extend DCTW to a supervised setting, where during training, available labels can be utilised towards enhancing the alignment process. By means of experiments on four datasets, we show that the representations learnt significantly outperform state-of-the-art methods in temporal alignment, elegantly handling scenarios with heterogeneous feature sets, such as the temporal alignment of acoustic and visual information.

Journal article

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00581716&limit=30&person=true