915 results found
Zhang Z, Metaxas DN, Lee H-Y, et al., 2020, Guest Editorial Special Issue on Adversarial Learning in Computational Intelligence, IEEE Transactions on Emerging Topics in Computational Intelligence, Vol: 4, Pages: 414-416
Dong F, Qian K, Ren Z, et al., 2020, Machine Listening for Heart Status Monitoring: Introducing and Benchmarking HSS-The Heart Sounds Shenzhen Corpus, IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, Vol: 24, Pages: 2082-2092, ISSN: 2168-2194
Parada-Cabaleiro E, Costantini G, Batliner A, et al., 2020, DEMoS: an Italian emotional speech corpus Elicitation methods, machine learning, and perception, LANGUAGE RESOURCES AND EVALUATION, Vol: 54, Pages: 341-383, ISSN: 1574-020X
Amiriparian S, Cummins N, Gerczuk M, et al., 2020, "Are You Playing a Shooter Again?!" Deep Representation Learning for Audio-Based Video Game Genre Recognition, IEEE TRANSACTIONS ON GAMES, Vol: 12, Pages: 145-154, ISSN: 2475-1502
Schuller DM, Schuller BW, 2020, A Review on Five Recent and Near-Future Developments in Computational Processing of Emotion in the Human Voice, EMOTION REVIEW, ISSN: 1754-0739
Kaklauskas A, Zavadskas EK, Schuller B, et al., 2020, Customized ViNeRS Method for Video Neuro-Advertising of Green Housing, INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, Vol: 17
Wu P, Sun X, Zhao Z, et al., 2020, Classification of Lung Nodules Based on Deep Residual Networks and Migration Learning, COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, Vol: 2020, ISSN: 1687-5265
Pokorny FB, Bartl-Pokorny KD, Zhang D, et al., 2020, Efficient Collection and Representation of Preverbal Data in Typical and Atypical Development, JOURNAL OF NONVERBAL BEHAVIOR, ISSN: 0191-5886
Deng J, Schuller B, Eyben F, et al., 2020, Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration, NEURAL COMPUTING & APPLICATIONS, Vol: 32, Pages: 1095-1107, ISSN: 0941-0643
Zhao Z, Bao Z, Zhang Z, et al., 2020, Automatic Assessment of Depression From Speech via a Hierarchical Attention Transfer Network and Attention Autoencoders, IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, Vol: 14, Pages: 423-434, ISSN: 1932-4553
Parada-Cabaleiro E, Batliner A, Baird A, et al., 2020, The perception of emotional cues by children in artificial background noise, INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, Vol: 23, Pages: 169-182, ISSN: 1381-2416
Keren G, Sabato S, Schuller B, 2020, Analysis of loss functions for fast single-class classification, KNOWLEDGE AND INFORMATION SYSTEMS, Vol: 62, Pages: 337-358, ISSN: 0219-1377
Amiriparian S, Schmitt M, Ottl S, et al., 2020, Deep unsupervised representation learning for audio-based medical applications, Intelligent Systems Reference Library, Pages: 137-164
© Springer Nature Switzerland AG 2020. Feature learning denotes a set of approaches for transforming raw input data into representations that can be effectively utilised in solving machine learning problems. Classifiers or regressors require training data which is computationally suitable to process. However, real-world data, e.g., an audio recording from a group of people talking in a park whilst in the background a dog is barking and a musician is playing the guitar, or health-related data such as coughing and sneezing recorded by consumer smartphones, comprises a remarkably variable and complex nature. For understanding such data, developing expert-designed, hand-crafted features often demands for an exhaustive amount of time and resources. Another disadvantage of such features is the lack of generalisation, i.e., there is a need for re-engineering new features for new tasks. Therefore, it is inevitable to develop automatic representation learning methods. In this chapter, we first discuss the preliminaries of contemporary representation learning techniques for computer audition tasks. Hereby, we differentiate between approaches based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). We then introduce and evaluate three state-of-the-art deep learning systems for unsupervised representation learning from raw audio: (1) pre-trained image classification CNNs, (2) a deep convolutional generative adversarial network (DCGAN), and (3) a recurrent sequence-to-sequence autoencoder (S2SAE). For each of these algorithms, the representations are obtained from the spectrograms of the input audio data. Finally, for a range of audio-based machine learning tasks, including abnormal heart sound classification, snore sound classification, and bipolar disorder recognition, we evaluate the efficacy of the deep representations, which are: (i) the activations of the fully connected layers of the pre-trained CNNs, (ii) the activations of the discriminat
Zhang Z, Han J, Qian K, et al., 2020, Snore-GANs: improving automatic snore sound classification with synthesized data, IEEE Journal of Biomedical and Health Informatics, Vol: 24, Pages: 300-310, ISSN: 2168-2194
One of the frontier issues that severely hamper the development of automatic snore sound classification (ASSC) associates to the lack of sufficient supervised training data. To cope with this problem, we propose a novel data augmentation approach based on semi-supervised conditional Generative Adversarial Networks (scGANs), which aims to automatically learn a mapping strategy from a random noise space to original data distribution. The proposed approach has the capability of well synthesizing ‘realistic’ high-dimensional data, while requiring no additional annotation process. To handle the mode collapse problem of GANs, we further introduce an ensemble strategy to enhance the diversity of the generated data. The systematic experiments conducted on a widely used Munich-Passau snore sound corpus demonstrate that the scGANs-based systems can remarkably outperform other classic data augmentation systems, and are also competitive to other recently reported systems for ASSC.
Latif S, Rana R, Khalifa S, et al., 2020, Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition, IEEE Transactions on Affective Computing
IEEE Inspite the emerging importance of Speech Emotion Recognition (SER), the state-of-the-art accuracy is quite low and needs improvement to make commercial applications of SER viable. A key underlying reason for the low accuracy is the scarcity of emotion datasets, which is a challenge for developing any robust machine learning model in general. In this paper, we propose a solution to this problem: a multi-task learning framework that uses auxiliary tasks for which data is abundantly available. We show that utilisation of this additional data can improve the primary task of SER for which only limited labelled data is available. In particular, we use gender identifications and speaker recognition as auxiliary tasks, which allow the use of very large datasets, e.g., speaker classification datasets. To maximise the benefit of multi-task learning, we further use an adversarial autoencoder (AAE) within our framework, which has a strong capability to learn powerful and discriminative features. Furthermore, the unsupervised AAE in combination with the supervised classification networks enables semi-supervised learning which incorporates a discriminative component in the AAE unsupervised training pipeline. The proposed model is rigorously evaluated for categorical and dimensional emotion, and cross-corpus scenarios. Experimental results demonstrate that the proposed model achieves state-of-the-art performance on two publicly available dataset.
Costin H, Schuller B, Florea AM, 2020, Preface
Schuller DM, Schuller BW, 2020, The Challenge of Automatic Eating Behaviour Analysis and Tracking, Intelligent Systems Reference Library, Pages: 187-204
© 2020, Springer Nature Switzerland AG. Computer-based tracking of eating behaviour is recently finding great interest by a broader choice of modalities such as by audio and video, or movement sensors, in particular in wearable every-day settings. Here, we provide an extensive insight into the current state-of-play for automatic tracking with a broader view on sensors and information used up to this point. The chapter is largely guided by and including results from the Interspeech 2015 Computational Paralinguistics Challenge (ComParE) Eating Sub-Challenge and the audio/visual Eating Analysis and Tracking (EAT) 2018 Challenge, both co-organised by the last author. The relevance is given by use-cases in health care and wellbeing including, amongst others, assistive technologies for individuals with eating disorders potentially leading either to under- or overeating or special health conditions such as diabetes. The chapter touches upon different feature representations including feature brute-forcing, bag-of-audio-word representations, and deep end-to-end learning from a raw sensor signal. It further reports on machine learning approaches used in the field including deep learning and conventional approaches. In the conclusion, the chapter discusses also usability aspects to foster optimal adherence, such as sensor placement, energy consumption, explainability, and privacy aspects.
Yang Z, Qian K, Ren Z, et al., 2020, Learning multi-resolution representations for acoustic scene classification via neural networks, Pages: 133-143, ISBN: 9789811527555
© Springer Nature Singapore Pte Ltd 2020. This study investigates the performance of wavelet as well as conventional temporal and spectral features for acoustic scene classification, testing the effectiveness of both feature sets when combined with neural networks on acoustic scene classification. The TUT Acoustic Scenes 2017 Database is used in the evaluation of the system. The model with wavelet energy feature achieved 74.8 % and 60.2 % on development and evaluation set respectively, which is better than the model using temporal and spectral feature set (72.9 % and 59.4 %). Additionally, to optimise the generalisation and robustness of the models, a decision fusion method based on the posterior probability of each audio scene is used. Comparing with the baseline system of the Detection and Classification Acoustic Scenes and Events 2017 (DCASE 2017) challenge, the best decision fusion model achieves 79.2 % and 63.8 % on the development and evaluation sets, respectively, where both results significantly exceed the baseline system result of 74.8 % and 61.0 % (confirmed by one tailed z-test p < 0.01 and p < 0.05 respectively.
Zhang Z, Qian K, Schuller BW, et al., 2020, An Online Robot Collision Detection and Identification Scheme by Supervised Learning and Bayesian Decision Theory, IEEE Transactions on Automation Science and Engineering, Pages: 1-13, ISSN: 1545-5955
Rizos G, Schuller BW, 2020, Average Jane, Where Art Thou? – Recent Avenues in Efficient Machine Learning Under Subjectivity Uncertainty, Information Processing and Management of Uncertainty in Knowledge-Based Systems, Publisher: Springer International Publishing, Pages: 42-55, ISBN: 9783030501457
Marchi E, Schuller B, Baird A, et al., 2019, The ASC-Inclusion Perceptual Serious Gaming Platform for Autistic Children, IEEE TRANSACTIONS ON GAMES, Vol: 11, Pages: 328-339, ISSN: 2475-1502
Amiriparian S, Ottl S, Gerczuk M, et al., 2019, Audio-based eating analysis and tracking utilising deep spectrum features
© 2019 IEEE. This This paper proposes a deep learning system for audio-based eating analysis on the ICMI 2018 Eating Analysis and Tracking (EAT) challenge corpus. We utilise Deep Spectrum features which are image classification convolutional neural network (CNN) descriptors. We extract the Deep Spectrum features by forwarding Mel-spectrograms from input audio through deep task-independent pre-trained CNNs, including AlexNet and VGG16. We then use the activations of first (fc6), second (fc7), and third (fc8) fully connected layers from these networks as feature vectors. We obtain the best classification result by using the first fully connected layer (fc6) of AlexNet for extracting the features from Mel-spectrograms with a window size of 160 ms and a hop size of 80 ms and a viridis colour map. Finally, we build Bag-of-Deep-Features (BoDF) which is the quantisation of the Deep Spectrum features. In comparison to the best baseline results on the test partitions of the Food Type and the Likability sub-challenges, unweighted average recall is increased from 67.2 percent to 79.9 percent and from 54.2 percent to 56.1 percent, respectively. For the test partition of the Difficulty sub-challenge the concordance correlation coefficient is increased from.506 to.509.
Schmid J, Schneider M, Hoß A, et al., 2019, A deep learning approach for location independent throughput prediction
© 2019 IEEE. Mobile communication has become a part of everyday life and is considered to support reliability and safety in traffic use cases such as conditionally automated driving. Nevertheless, prediction of Quality of Service parameters, particularly throughput, is still a challenging task while on the move. Whereas most approaches in this research field rely on historical data measurements, mapped to the corresponding coordinates in the area of interest, this paper proposes a throughput prediction method that focuses on a location independent approach. In order to compensate the missing positioning information, mainly used for spatial clustering, our model uses low-level mobile network parameters, improved by additional feature engineering to retrieve abstracted location information, e. g., surrounding building size and street type. Thus, the major advantage of our method is the applicability to new regions without the prerequisite of conducting an extensive measurement campaign in advance. Therefore, we embed analysis results for underlying temporal relations in the design of different deep neuronal network types. Finally, model performances are evaluated and compared to traditional models, such as the support vector or random forest regression, which were harnessed in previous investigations.
Xie Y, Liang R, Liang Z, et al., 2019, Speech Emotion Classification Using Attention-Based LSTM, IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, Vol: 27, Pages: 1675-1685, ISSN: 2329-9290
Ringeval F, Schuller B, Valstar M, et al., 2019, AVEC 2019 workshop and challenge: State-of-mind, detecting depression with ai, and cross-cultural affect recognition, AVEC 2019 - Proceedings of the 9th International Audio/Visual Emotion Challenge and Workshop, co-located with MM 2019, Pages: 3-12
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. The Audio/Visual Emotion Challenge and Workshop (AVEC 2019) "State-of-Mind, Detecting Depression with AI, and Cross-cultural Affect Recognition" is the ninth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: state-of-mind recognition, depression assessment with AI, and cross-cultural affect sensing, respectively.
Ringeval F, Schuller B, Valstar M, et al., 2019, AVEC 2019 chairs' welcome
Ringeval F, Schuller B, Valstar M, et al., 2019, AVEC 2019 chairs' welcome
Kossaifi J, Walecki R, Panagakis Y, et al., 2019, SEWA DB: A rich database for audio-visual emotion and sentiment research in the wild., IEEE Transactions on Pattern Analysis and Machine Intelligence, ISSN: 0162-8828
Natural human-computer interaction and audio-visual human behaviour sensing systems, which would achieve robust performance in-the-wild are more needed than ever as digital devices are becoming indispensable part of our life more and more. Accurately annotated real-world data are the crux in devising such systems. However, existing databases usually consider controlled settings, low demographic variability, and a single task. In this paper, we introduce the SEWA database of more than 2000 minutes of audio-visual data of 398 people coming from six cultures, 50% female, and uniformly spanning the age range of 18 to 65 years old. Subjects were recorded in two different contexts: while watching adverts and while discussing adverts in a video chat. The database includes rich annotations of the recordings in terms of facial landmarks, facial action units (FAU), various vocalisations, mirroring, and continuously valued valence, arousal, liking, agreement, and prototypic examples of (dis)liking. This database aims to be an extremely valuable resource for researchers in affective computing and automatic human sensing and is expected to push forward the research in human behaviour analysis, including cultural studies. Along with the database, we provide extensive baseline experiments for automatic FAU detection and automatic valence, arousal and (dis)liking intensity estimation.
Baird A, Amiriparian S, Schuller B, 2019, Can Deep Generative Audio be Emotional? Towards an Approach for Personalised Emotional Audio Generation
© 2019 IEEE. The ability for sound to evoke states of emotion is well known across fields of research, with clinical and holistic practitioners utilising audio to create listener experiences which target specific needs. Neural network-based generative models have in recent years shown promise for generating high-fidelity based on a raw audio input. With this in mind, this study utilises the WaveNet generative model to explore the ability of such networks to retain the emotionality of raw audio speech inputs. We train various models on 2-classes (happy and sad) of an emotional speech corpus containing 68 native Italian speakers. When classifying the combined original and generated audio, hand-crafted feature sets achieve at best 75.5 % unweighted average recall, a 2 percent point improvement over the original only audio features. Additionally, from a two-tailed test on the predictions, we find that the audio features from the original speech concatenated with the generated audio features provides significantly different test result compared to the baseline. Both findings indicating promise for emotion-based audio generation.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.