733 results found
Xie Y, Liang R, Liang Z, et al., 2019, Speech Emotion Classification Using Attention-Based LSTM, IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, Vol: 27, Pages: 1675-1685, ISSN: 2329-9290
Ringeval F, Cummins N, Schuller B, et al., 2019, AVEC'19: Audio/visual emotion challenge and workshop, Pages: 2718-2719
© 2019 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-6889-6/19/10. The ninth Audio-Visual Emotion Challenge and workshop AVEC 2019 was held in conjunction with ACM Multimedia'19. This year, the AVEC series addressed major novelties with three distinct tasks: State-of-Mind Sub-challenge (SoMS), Detecting Depression with Artificial Intelligence Sub-challenge (DDS), and Cross-cultural Emotion Sub-challenge (CES). The SoMS was based on a novel dataset (USoM corpus) that includes self-reported mood (10-point Likert scale) after the narrative of personal stories (two positive and two negative). The DDS was based on a large extension of the DAICWOZ corpus (c f. AVEC 2016) that includes new recordings of patients suffering from depression with the virtual agent conducting the interview being, this time, wholly driven by AI, i. e., without any human intervention. The CES was based on the SEWA dataset (c. f. AVEC 2018) that has been extended with the inclusion of new participants in order to investigate how emotion knowledge of Western European cultures (German, Hungarian) can be transferred to the Chinese culture. In this summary, we mainly describe participation and conditions of the AVEC Challenge.
Ringeval F, Schuller B, Valstar M, et al., 2019, AVEC 2019 chairs' welcome
Ringeval F, Schuller B, Valstar M, et al., 2019, AVEC 2019 workshop and challenge: State-of-mind, detecting depression with ai, and cross-cultural affect recognition, AVEC 2019 - Proceedings of the 9th International Audio/Visual Emotion Challenge and Workshop, co-located with MM 2019, Pages: 3-12
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. The Audio/Visual Emotion Challenge and Workshop (AVEC 2019) "State-of-Mind, Detecting Depression with AI, and Cross-cultural Affect Recognition" is the ninth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: state-of-mind recognition, depression assessment with AI, and cross-cultural affect sensing, respectively.
Kossaifi J, Walecki R, Panagakis Y, et al., 2019, SEWA DB: A rich database for audio-visual emotion and sentiment research in the wild., IEEE Transactions on Pattern Analysis and Machine Intelligence, ISSN: 0162-8828
Natural human-computer interaction and audio-visual human behaviour sensing systems, which would achieve robust performance in-the-wild are more needed than ever as digital devices are becoming indispensable part of our life more and more. Accurately annotated real-world data are the crux in devising such systems. However, existing databases usually consider controlled settings, low demographic variability, and a single task. In this paper, we introduce the SEWA database of more than 2000 minutes of audio-visual data of 398 people coming from six cultures, 50% female, and uniformly spanning the age range of 18 to 65 years old. Subjects were recorded in two different contexts: while watching adverts and while discussing adverts in a video chat. The database includes rich annotations of the recordings in terms of facial landmarks, facial action units (FAU), various vocalisations, mirroring, and continuously valued valence, arousal, liking, agreement, and prototypic examples of (dis)liking. This database aims to be an extremely valuable resource for researchers in affective computing and automatic human sensing and is expected to push forward the research in human behaviour analysis, including cultural studies. Along with the database, we provide extensive baseline experiments for automatic FAU detection and automatic valence, arousal and (dis)liking intensity estimation.
Janott C, Schmitt M, Heiser C, et al., 2019, VOTE versus ACLTE: comparison of two snoring noise classifications using machine learning methods, HNO, Vol: 67, Pages: 670-678, ISSN: 0017-6192
Schuller B, 2019, Responding to uncertainty in emotion recognition, JOURNAL OF INFORMATION COMMUNICATION & ETHICS IN SOCIETY, Vol: 17, Pages: 299-303, ISSN: 1477-996X
Kollias D, Tzirakis P, Nicolaou MA, et al., 2019, Deep Affect Prediction in-the-Wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond, International Journal of Computer Vision, Vol: 127, Pages: 907-929, ISSN: 0920-5691
Han J, Zhang Z, Cummins N, et al., 2019, Adversarial Training in Affective Computing and Sentiment Analysis: Recent Advances and Perspectives, IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, Vol: 14, Pages: 68-81, ISSN: 1556-603X
Zhang Z, Han J, Coutinho E, et al., 2019, Dynamic difficulty awareness training for continuous emotion prediction, IEEE Transactions on Multimedia, Vol: 21, Pages: 1289-1301, ISSN: 1941-0077
Time-continuous emotion prediction has become an increasingly compelling task in machine learning. Considerable efforts have been made to advance the performance of these systems. Nonetheless, the main focus has been the development of more sophisticated models and the incorporation of different expressive modalities (e.g., speech, face, and physiology). In this paper, motivated by the benefit of difficulty awareness in a human learning procedure, we propose a novel machine learning framework, namely, Dynamic Difficulty Awareness Training (DDAT), which sheds fresh light on the research - directly exploiting the difficulties in learning to boost the machine learning process. The DDAT framework consists of two stages: information retrieval and information exploitation. In the first stage, we make use of the reconstruction error of input features or the annotation uncertainty to estimate the difficulty of learning specific information. The obtained difficulty level is then used in tandem with original features to update the model input in a second learning stage with the expectation that the model can learn to focus on high difficulty regions of the learning process. We perform extensive experiments on a benchmark database (RECOLA) to evaluate the effectiveness of the proposed framework. The experimental results show that our approach outperforms related baselines as well as other well-established time-continuous emotion prediction systems, which suggests that dynamically integrating the difficulty information for neural networks can help enhance the learning process.
Han J, Zhang Z, Ren Z, et al., 2019, Implicit Fusion by Joint Audiovisual Training for Emotion Recognition in Mono Modality, Pages: 5861-5865, ISSN: 1520-6149
© 2019 IEEE. Despite significant advances in emotion recognition from one individual modality, previous studies fail to take advantage of other modalities to train models in mono-modal scenarios. In this work, we propose a novel joint training model which implicitly fuses audio and visual information in the training procedure for either speech or facial emotion recognition. Specifically, the model consists of one modality-specific network per individual modality and one shared network to map both audio and visual cues into final predictions. In the training process, we additionally take the loss from one auxiliary modality into account besides the main modality. To evaluate the effectiveness of the implicit fusion model, we conduct extensive experiments for mono-modal emotion classification and regression, and find that the implicit fusion models outperform the standard mono-modal training process.
Kim JY, Liu C, Calvo RA, et al., 2019, A comparison of online automatic speech recognition systems and the nonverbal responses to unintelligible speech, Publisher: arXiv
Automatic Speech Recognition (ASR) systems have proliferated over the recentyears to the point that free platforms such as YouTube now provide speechrecognition services. Given the wide selection of ASR systems, we contribute tothe field of automatic speech recognition by comparing the relative performanceof two sets of manual transcriptions and five sets of automatic transcriptions(Google Cloud, IBM Watson, Microsoft Azure, Trint, and YouTube) to helpresearchers to select accurate transcription services. In addition, we identifynonverbal behaviors that are associated with unintelligible speech, asindicated by high word error rates. We show that manual transcriptions remainsuperior to current automatic transcriptions. Amongst the automatictranscription services, YouTube offers the most accurate transcription service.For non-verbal behavioral involvement, we provide evidence that the variabilityof smile intensities from the listener is high (low) when the speaker is clear(unintelligible). These findings are derived from videoconferencinginteractions between student doctors and simulated patients; therefore, wecontribute towards both the ASR literature and the healthcare communicationskills teaching community.
Zhang Z, Han J, Qian K, et al., Snore-GANs: improving automatic snore sound classification with synthesized data, IEEE Journal of Biomedical and Health Informatics, ISSN: 2168-2194
One of the frontier issues that severely hamper the development of automatic snore sound classification (ASSC) associates to the lack of sufficient supervised training data. To cope with this problem, we propose a novel data augmentation approach based on semi-supervised conditional Generative Adversarial Networks (scGANs), which aims to automatically learn a mapping strategy from a random noise space to original data distribution. The proposed approach has the capability of well synthesizing ‘realistic’ high-dimensional data, while requiring no additional annotation process. To handle the mode collapse problem of GANs, we further introduce an ensemble strategy to enhance the diversity of the generated data. The systematic experiments conducted on a widely used Munich-Passau snore sound corpus demonstrate that the scGANs-based systems can remarkably outperform other classic data augmentation systems, and are also competitive to other recently reported systems for ASSC.
Qian K, Schmitt M, Janott C, et al., 2019, A Bag of Wavelet Features for Snore Sound Classification., Ann Biomed Eng, Vol: 47, Pages: 1000-1011
Snore sound (SnS) classification can support a targeted surgical approach to sleep related breathing disorders. Using machine listening methods, we aim to find the location of obstruction and vibration within a subject's upper airway. Wavelet features have been demonstrated to be efficient in the recognition of SnSs in previous studies. In this work, we use a bag-of-audio-words approach to enhance the low-level wavelet features extracted from SnS data. A Naïve Bayes model was selected as the classifier based on its superiority in initial experiments. We use SnS data collected from 219 independent subjects under drug-induced sleep endoscopy performed at three medical centres. The unweighted average recall achieved by our proposed method is 69.4%, which significantly ([Formula: see text] one-tailed z-test) outperforms the official baseline (58.5%), and beats the winner (64.2%) of the INTERSPEECH COMPARE Challenge 2017 Snoring sub-challenge. In addition, the conventionally used features like formants, mel-scale frequency cepstral coefficients, subband energy ratios, spectral frequency features, and the features extracted by the OPENSMILE toolkit are compared with our proposed feature set. The experimental results demonstrate the effectiveness of the proposed method in SnS classification.
Schuller B, 2019, Microexpressions: A Chance for Computers to Beat Humans at Detecting Hidden Emotions?, COMPUTER, Vol: 52, Pages: 4-5, ISSN: 0018-9162
Zhang Y, Michi A, Wagner J, et al., A Generic Human-Machine Annotation Framework Based on Dynamic Cooperative Learning., IEEE Trans Cybern
The task of obtaining meaningful annotations is a tedious work, incurring considerable costs and time consumption. Dynamic active learning and cooperative learning are recently proposed approaches to reduce human effort of annotating data with subjective phenomena. In this paper, we introduce a novel generic annotation framework, with the aim to achieve the optimal tradeoff between label reliability and cost reduction by making efficient use of human and machine work force. To this end, we use dropout to assess model uncertainty and thereby to decide which instances can be automatically labeled by the machine and which ones require human inspection. In addition, we propose an early stopping criterion based on inter-rater agreement in order to focus human resources on those ambiguous instances that are difficult to label. In contrast to the existing algorithms, the new confidence measures are not only applicable to binary classification tasks but also regression problems. The proposed method is evaluated on the benchmark datasets for non-native English prosody estimation, provided in the INTERSPEECH computational paralinguistics challenge. In the result, the novel dynamic cooperative learning algorithm yields 0.424 Spearman's correlation coefficient compared to 0.413 with passive learning, while reducing the amount of human annotations by 74%.
Schuller BW, 2019, IEEE Transactions on Affective Computing-On Novelty and Valence, IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, Vol: 10, Pages: 1-2, ISSN: 1949-3045
Xu X, Deng J, Coutinho E, et al., 2019, Connecting Subspace Learning and Extreme Learning Machine in Speech Emotion Recognition, IEEE Transactions on Multimedia, ISSN: 1520-9210
IEEE Speech Emotion Recognition (SER) is a powerful tool for endowing computers with the capacity to process information about the affective states of users in human-machine interactions. Recent research has shown the effectiveness of graph embedding based subspace learning and extreme learning machine applied to SER, but there are still various drawbacks in these two techniques that limit their application. Regarding subspace learning, the change from linearity to nonlinearity is usually achieved through kernelisation, while extreme learning machines only take label information into consideration at the output layer. In order to overcome these drawbacks, this paper leverages extreme learning machine for dimensionality reduction and proposes a novel framework to combine spectral regression based subspace learning and extreme learning machine. The proposed framework contains three stages - data mapping, graph decomposition, and regression. At the data mapping stage, various mapping strategies provide different views of the samples. At the graph decomposition stage, specifically designed embedding graphs provide a possibility to better represent the structure of data, through generating virtual coordinates. Finally, at the regression stage, dimension-reduced mappings are achieved by connecting the virtual coordinates and data mapping. Using this framework, we propose several novel dimensionality reduction algorithms, apply them to SER tasks, and compare their performance to relevant state-of-the-art methods. Our results on several paralinguistic corpora show that our proposed techniques lead to significant improvements.
Grabowski K, Rynkiewicz A, Lassalle A, et al., 2019, Emotional expression in psychiatric conditions: New technology for clinicians, PSYCHIATRY AND CLINICAL NEUROSCIENCES, Vol: 73, Pages: 50-62, ISSN: 1323-1316
Demir F, Sengur A, Lu H, et al., 2019, COMPACT BILINEAR DEEP FEATURES FOR ENVIRONMENTAL SOUND RECOGNITION, International Conference on Artificial Intelligence and Data Processing (IDAP), Publisher: IEEE
Rizos G, Schuller B, 2019, MODELLING SAMPLE INFORMATIVENESS FOR DEEP AFFECTIVE COMPUTING, 44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE, Pages: 3482-3486, ISSN: 1520-6149
Zhao Z, Bao Z, Zhao Y, et al., 2019, Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition, IEEE ACCESS, Vol: 7, Pages: 97515-97525, ISSN: 2169-3536
Xu X, Deng J, Cummins N, et al., 2019, Autonomous emotion learning in speech: A view of zero-shot speech emotion recognition, Pages: 949-953, ISSN: 2308-457X
Copyright © 2019 ISCA Conventionally, speech emotion recognition is achieved using passive learning approaches. Differing from such approaches, we herein propose and develop a dynamic method of autonomous emotion learning based on zero-shot learning. The proposed methodology employs emotional dimensions as the attributes in the zero-shot learning paradigm, resulting in two phases of learning, namely attribute learning and label learning. Attribute learning connects the paralinguistic features and attributes utilising speech with known emotional labels, while label learning aims at defining unseen emotions through the attributes. The experimental results achieved on the CINEMO corpus indicate that zero-shot learning is a useful technique for autonomous speech-based emotion learning, achieving accuracies considerably better than chance level and an attribute-based gold-standard setup. Furthermore, different emotion recognition tasks, emotional attributes, and employed approaches strongly influence system performance.
Guo Y, Zhao Z, Ma Y, et al., 2019, Speech augmentation via speaker-specific noise in unseen environment, Pages: 1781-1785, ISSN: 2308-457X
Copyright © 2019 ISCA Speech augmentation is a common and effective strategy to avoid overfitting and improve on the robustness of an emotion recognition model. In this paper, we investigate for the first time the intrinsic attributes in a speech signal using the multi-resolution analysis theory and the Hilbert-Huang Spectrum, with the goal of developing a robust speech augmentation approach from raw speech data. Specifically, speech decomposition in a double tree complex wavelet transform domain is realized, to obtain sub-speech signals; then, the Hilbert Spectrum using Hilbert-Huang Transform is calculated for each sub-band to capture the noise content in unseen environments with the voice restriction to 100−4000 Hz; finally, the speech-specific noise that varies with the speaker individual, scenarios, environment, and voice recording equipment, can be reconstructed from the top two high-frequency sub-bands to enhance the raw signal. Our proposed speech augmentation is demonstrated using five robust machine learning architectures based on the RAVDESS database, achieving up to 9.3 % higher accuracy compared to the performance on raw data for an emotion recognition task.
Tzirakis P, Nicolaou MA, Schuller B, et al., 2019, Time-series Clustering with Jointly Learning Deep Representations, Clusters and Temporal Boundaries, 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG), Publisher: IEEE, Pages: 438-442, ISSN: 2326-5396
Pandit V, Schmitt M, Cummins N, et al., 2019, I know how you feel now, and here's why!: Demystifying Time-continuous High Resolution Text-based Affect Predictions In theWild, 32nd IEEE International Symposium on Computer-Based Medical Systems (IEEE CBMS), Publisher: IEEE, Pages: 465-470, ISSN: 2372-9198
Schuller B, Weninger F, Zhang Y, et al., 2019, Affective and behavioural computing: Lessons learnt from the First Computational Paralinguistics Challenge, COMPUTER SPEECH AND LANGUAGE, Vol: 53, Pages: 156-180, ISSN: 0885-2308
Han J, Zhang Z, Ren Z, et al., 2019, EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings, IEEE Transactions on Affective Computing, Pages: 1-1
Lingenfelser F, Wagner J, Deng J, et al., 2018, Asynchronous and Event-Based Fusion Systems for Affect Recognition on Naturalistic Data in Comparison to Conventional Approaches, IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, Vol: 9, Pages: 410-423, ISSN: 1949-3045
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.