Publications
928 results found
Kossaifi J, Walecki R, Panagakis Y, et al., 2021, SEWA DB: A rich database for audio-visual emotion and sentiment research in the wild, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 43, Pages: 1022-1040, ISSN: 0162-8828
Natural human-computer interaction and audio-visual human behaviour sensing systems, which would achieve robust performance in-the-wild are more needed than ever as digital devices are becoming indispensable part of our life more and more. Accurately annotated real-world data are the crux in devising such systems. However, existing databases usually consider controlled settings, low demographic variability, and a single task. In this paper, we introduce the SEWA database of more than 2000 minutes of audio-visual data of 398 people coming from six cultures, 50% female, and uniformly spanning the age range of 18 to 65 years old. Subjects were recorded in two different contexts: while watching adverts and while discussing adverts in a video chat. The database includes rich annotations of the recordings in terms of facial landmarks, facial action units (FAU), various vocalisations, mirroring, and continuously valued valence, arousal, liking, agreement, and prototypic examples of (dis)liking. This database aims to be an extremely valuable resource for researchers in affective computing and automatic human sensing and is expected to push forward the research in human behaviour analysis, including cultural studies. Along with the database, we provide extensive baseline experiments for automatic FAU detection and automatic valence, arousal and (dis)liking intensity estimation.
Cheng J, Liang R, Liang Z, et al., 2021, A Deep Adaptation Network for Speech Enhancement: Combining a Relativistic Discriminator With Multi-Kernel Maximum Mean Discrepancy, IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, Vol: 29, Pages: 41-53, ISSN: 2329-9290
Han J, Zhang Z, Pantic M, et al., 2021, Internet of emotional people: Towards continual affective computing cross cultures via audiovisual signals, FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, Vol: 114, Pages: 294-306, ISSN: 0167-739X
Pandit V, Schmitt M, Cummins N, et al., 2020, I see it in your eyes: Training the shallowest-possible CNN to recognise emotions and pain from muted web-assisted in-the-wild video-chats in real-time, INFORMATION PROCESSING & MANAGEMENT, Vol: 57, ISSN: 0306-4573
- Author Web Link
- Cite
- Citations: 1
Zhang Z, Metaxas DN, Lee H-Y, et al., 2020, Guest Editorial Special Issue on Adversarial Learning in Computational Intelligence, IEEE Transactions on Emerging Topics in Computational Intelligence, Vol: 4, Pages: 414-416
Dong F, Qian K, Ren Z, et al., 2020, Machine Listening for Heart Status Monitoring: Introducing and Benchmarking HSS-The Heart Sounds Shenzhen Corpus, IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, Vol: 24, Pages: 2082-2092, ISSN: 2168-2194
Han J, Zhang Z, Ren Z, et al., 2020, Exploring Perception Uncertainty for Emotion Recognition in Dyadic Conversation and Music Listening, COGNITIVE COMPUTATION, ISSN: 1866-9956
Amiriparian S, Cummins N, Gerczuk M, et al., 2020, "Are You Playing a Shooter Again?!" Deep Representation Learning for Audio-Based Video Game Genre Recognition, IEEE TRANSACTIONS ON GAMES, Vol: 12, Pages: 145-154, ISSN: 2475-1502
- Author Web Link
- Cite
- Citations: 2
Parada-Cabaleiro E, Costantini G, Batliner A, et al., 2020, DEMoS: an Italian emotional speech corpus Elicitation methods, machine learning, and perception, LANGUAGE RESOURCES AND EVALUATION, Vol: 54, Pages: 341-383, ISSN: 1574-020X
Schuller DM, Schuller BW, 2020, A Review on Five Recent and Near-Future Developments in Computational Processing of Emotion in the Human Voice, EMOTION REVIEW, Vol: 13, Pages: 44-50, ISSN: 1754-0739
- Author Web Link
- Cite
- Citations: 1
Kaklauskas A, Zavadskas EK, Schuller B, et al., 2020, Customized ViNeRS Method for Video Neuro-Advertising of Green Housing, INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, Vol: 17
Wu P, Sun X, Zhao Z, et al., 2020, Classification of Lung Nodules Based on Deep Residual Networks and Migration Learning, COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, Vol: 2020, ISSN: 1687-5265
- Author Web Link
- Cite
- Citations: 1
Pokorny FB, Bartl-Pokorny KD, Zhang D, et al., 2020, Efficient Collection and Representation of Preverbal Data in Typical and Atypical Development, JOURNAL OF NONVERBAL BEHAVIOR, Vol: 44, Pages: 419-436, ISSN: 0191-5886
- Author Web Link
- Cite
- Citations: 2
Zhao Z, Bao Z, Zhang Z, et al., 2020, Automatic Assessment of Depression From Speech via a Hierarchical Attention Transfer Network and Attention Autoencoders, IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, Vol: 14, Pages: 423-434, ISSN: 1932-4553
Deng J, Schuller B, Eyben F, et al., 2020, Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration, NEURAL COMPUTING & APPLICATIONS, Vol: 32, Pages: 1095-1107, ISSN: 0941-0643
Parada-Cabaleiro E, Batliner A, Baird A, et al., 2020, The perception of emotional cues by children in artificial background noise, INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, Vol: 23, Pages: 169-182, ISSN: 1381-2416
Zhang Z, Han J, Qian K, et al., 2020, Snore-GANs: improving automatic snore sound classification with synthesized data, IEEE Journal of Biomedical and Health Informatics, Vol: 24, Pages: 300-310, ISSN: 2168-2194
One of the frontier issues that severely hamper the development of automatic snore sound classification (ASSC) associates to the lack of sufficient supervised training data. To cope with this problem, we propose a novel data augmentation approach based on semi-supervised conditional Generative Adversarial Networks (scGANs), which aims to automatically learn a mapping strategy from a random noise space to original data distribution. The proposed approach has the capability of well synthesizing ‘realistic’ high-dimensional data, while requiring no additional annotation process. To handle the mode collapse problem of GANs, we further introduce an ensemble strategy to enhance the diversity of the generated data. The systematic experiments conducted on a widely used Munich-Passau snore sound corpus demonstrate that the scGANs-based systems can remarkably outperform other classic data augmentation systems, and are also competitive to other recently reported systems for ASSC.
Haque KN, Rana R, Schuller BW, 2020, High-Fidelity Audio Generation and Representation Learning With Guided Adversarial Autoencoder, IEEE ACCESS, Vol: 8, Pages: 223509-223528, ISSN: 2169-3536
Amiriparian S, Schmitt M, Ottl S, et al., 2020, Deep unsupervised representation learning for audio-based medical applications, Intelligent Systems Reference Library, Pages: 137-164
© Springer Nature Switzerland AG 2020. Feature learning denotes a set of approaches for transforming raw input data into representations that can be effectively utilised in solving machine learning problems. Classifiers or regressors require training data which is computationally suitable to process. However, real-world data, e.g., an audio recording from a group of people talking in a park whilst in the background a dog is barking and a musician is playing the guitar, or health-related data such as coughing and sneezing recorded by consumer smartphones, comprises a remarkably variable and complex nature. For understanding such data, developing expert-designed, hand-crafted features often demands for an exhaustive amount of time and resources. Another disadvantage of such features is the lack of generalisation, i.e., there is a need for re-engineering new features for new tasks. Therefore, it is inevitable to develop automatic representation learning methods. In this chapter, we first discuss the preliminaries of contemporary representation learning techniques for computer audition tasks. Hereby, we differentiate between approaches based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). We then introduce and evaluate three state-of-the-art deep learning systems for unsupervised representation learning from raw audio: (1) pre-trained image classification CNNs, (2) a deep convolutional generative adversarial network (DCGAN), and (3) a recurrent sequence-to-sequence autoencoder (S2SAE). For each of these algorithms, the representations are obtained from the spectrograms of the input audio data. Finally, for a range of audio-based machine learning tasks, including abnormal heart sound classification, snore sound classification, and bipolar disorder recognition, we evaluate the efficacy of the deep representations, which are: (i) the activations of the fully connected layers of the pre-trained CNNs, (ii) the activations of the discriminat
Schuller DM, Schuller BW, 2020, The Challenge of Automatic Eating Behaviour Analysis and Tracking, Intelligent Systems Reference Library, Pages: 187-204
© 2020, Springer Nature Switzerland AG. Computer-based tracking of eating behaviour is recently finding great interest by a broader choice of modalities such as by audio and video, or movement sensors, in particular in wearable every-day settings. Here, we provide an extensive insight into the current state-of-play for automatic tracking with a broader view on sensors and information used up to this point. The chapter is largely guided by and including results from the Interspeech 2015 Computational Paralinguistics Challenge (ComParE) Eating Sub-Challenge and the audio/visual Eating Analysis and Tracking (EAT) 2018 Challenge, both co-organised by the last author. The relevance is given by use-cases in health care and wellbeing including, amongst others, assistive technologies for individuals with eating disorders potentially leading either to under- or overeating or special health conditions such as diabetes. The chapter touches upon different feature representations including feature brute-forcing, bag-of-audio-word representations, and deep end-to-end learning from a raw sensor signal. It further reports on machine learning approaches used in the field including deep learning and conventional approaches. In the conclusion, the chapter discusses also usability aspects to foster optimal adherence, such as sensor placement, energy consumption, explainability, and privacy aspects.
Yang Z, Qian K, Ren Z, et al., 2020, Learning multi-resolution representations for acoustic scene classification via neural networks, Pages: 133-143, ISBN: 9789811527555
© Springer Nature Singapore Pte Ltd 2020. This study investigates the performance of wavelet as well as conventional temporal and spectral features for acoustic scene classification, testing the effectiveness of both feature sets when combined with neural networks on acoustic scene classification. The TUT Acoustic Scenes 2017 Database is used in the evaluation of the system. The model with wavelet energy feature achieved 74.8 % and 60.2 % on development and evaluation set respectively, which is better than the model using temporal and spectral feature set (72.9 % and 59.4 %). Additionally, to optimise the generalisation and robustness of the models, a decision fusion method based on the posterior probability of each audio scene is used. Comparing with the baseline system of the Detection and Classification Acoustic Scenes and Events 2017 (DCASE 2017) challenge, the best decision fusion model achieves 79.2 % and 63.8 % on the development and evaluation sets, respectively, where both results significantly exceed the baseline system result of 74.8 % and 61.0 % (confirmed by one tailed z-test p < 0.01 and p < 0.05 respectively.
Keren G, Sabato S, Schuller B, 2020, Analysis of loss functions for fast single-class classification, KNOWLEDGE AND INFORMATION SYSTEMS, Vol: 62, Pages: 337-358, ISSN: 0219-1377
- Author Web Link
- Cite
- Citations: 1
Latif S, Rana R, Khalifa S, et al., 2020, Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition, IEEE Transactions on Affective Computing
IEEE Inspite the emerging importance of Speech Emotion Recognition (SER), the state-of-the-art accuracy is quite low and needs improvement to make commercial applications of SER viable. A key underlying reason for the low accuracy is the scarcity of emotion datasets, which is a challenge for developing any robust machine learning model in general. In this paper, we propose a solution to this problem: a multi-task learning framework that uses auxiliary tasks for which data is abundantly available. We show that utilisation of this additional data can improve the primary task of SER for which only limited labelled data is available. In particular, we use gender identifications and speaker recognition as auxiliary tasks, which allow the use of very large datasets, e.g., speaker classification datasets. To maximise the benefit of multi-task learning, we further use an adversarial autoencoder (AAE) within our framework, which has a strong capability to learn powerful and discriminative features. Furthermore, the unsupervised AAE in combination with the supervised classification networks enables semi-supervised learning which incorporates a discriminative component in the AAE unsupervised training pipeline. The proposed model is rigorously evaluated for categorical and dimensional emotion, and cross-corpus scenarios. Experimental results demonstrate that the proposed model achieves state-of-the-art performance on two publicly available dataset.
Littmann M, Selig K, Cohen-Lavi L, et al., 2020, Validity of machine learning in biology and medicine increased through collaborations across fields of expertise, NATURE MACHINE INTELLIGENCE, Vol: 2, Pages: 18-24
- Author Web Link
- Cite
- Citations: 7
Zhang Z, Qian K, Schuller BW, et al., 2020, An Online Robot Collision Detection and Identification Scheme by Supervised Learning and Bayesian Decision Theory, IEEE Transactions on Automation Science and Engineering, Pages: 1-13, ISSN: 1545-5955
Rizos G, Schuller BW, 2020, Average Jane, Where Art Thou? – Recent Avenues in Efficient Machine Learning Under Subjectivity Uncertainty, Information Processing and Management of Uncertainty in Knowledge-Based Systems, Publisher: Springer International Publishing, Pages: 42-55, ISBN: 9783030501457
Marchi E, Schuller B, Baird A, et al., 2019, The ASC-Inclusion Perceptual Serious Gaming Platform for Autistic Children, IEEE TRANSACTIONS ON GAMES, Vol: 11, Pages: 328-339, ISSN: 2475-1502
Amiriparian S, Han J, Schmitt M, et al., 2019, Synchronization in Interpersonal Speech, FRONTIERS IN ROBOTICS AND AI, Vol: 6, ISSN: 2296-9144
Amiriparian S, Ottl S, Gerczuk M, et al., 2019, Audio-based eating analysis and tracking utilising deep spectrum features
© 2019 IEEE. This This paper proposes a deep learning system for audio-based eating analysis on the ICMI 2018 Eating Analysis and Tracking (EAT) challenge corpus. We utilise Deep Spectrum features which are image classification convolutional neural network (CNN) descriptors. We extract the Deep Spectrum features by forwarding Mel-spectrograms from input audio through deep task-independent pre-trained CNNs, including AlexNet and VGG16. We then use the activations of first (fc6), second (fc7), and third (fc8) fully connected layers from these networks as feature vectors. We obtain the best classification result by using the first fully connected layer (fc6) of AlexNet for extracting the features from Mel-spectrograms with a window size of 160 ms and a hop size of 80 ms and a viridis colour map. Finally, we build Bag-of-Deep-Features (BoDF) which is the quantisation of the Deep Spectrum features. In comparison to the best baseline results on the test partitions of the Food Type and the Likability sub-challenges, unweighted average recall is increased from 67.2 percent to 79.9 percent and from 54.2 percent to 56.1 percent, respectively. For the test partition of the Difficulty sub-challenge the concordance correlation coefficient is increased from.506 to.509.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.