Publications
12 results found
Hogg A, Liu H, Mads J, et al., 2023, Exploring the Impact of Transfer Learning on GAN-Based HRTF Upsampling, EAA Forum Acusticum, European Congress on Acoustics
Engel I, Daugintis R, Vicente T, et al., 2023, The SONICOM HRTF dataset, Journal of the Audio Engineering Society, Vol: 71, Pages: 241-253, ISSN: 0004-7554
Immersive audio technologies, ranging from rendering spatialized sounds accurately to efficient room simulations, are vital to the success of augmented and virtual realities. To produce realistic sounds through headphones, the human body and head must both be taken into account. However, the measurement of the influence of the external human morphology on the sounds incoming to the ears, which is often referred to as head-related transfer function (HRTF), is expensive and time-consuming. Several datasets have been created over the years to help researcherswork on immersive audio; nevertheless, the number of individuals involved and amount of data collected is often insufficient for modern machine-learning approaches. Here, the SONICOM HRTF dataset is introduced to facilitate reproducible research in immersive audio. This dataset contains the HRTF of 120 subjects, as well as headphone transfer functions; 3D scans of ears, heads, and torsos; and depth pictures at different angles around subjects' heads.
McKnight S, Hogg AOT, Neo VW, et al., 2022, Studying human-based speaker diarization and comparing to state-of-the-art systems, APSIPA 2022, Publisher: IEEE, Pages: 394-401
Human-based speaker diarization experiments were carried out on a five-minute extract of a typical AMI corpus meeting to see how much variance there is in human reviews based on hearing only and to compare with state-of-the-art diarization systems on the same extract. There are three distinct experiments: (a) one with no prior information; (b) one with the ground truth speech activity detection (GT-SAD); and (c) one with the blank ground truth labels (GT-labels). The results show that most human reviews tend to be quite similar, albeit with some outliers, but the choice of GT-labels can make a dramatic difference to scored performance. Using the GT-SAD provides a big advantage and improves human review scores substantially, though small differences in the GT-SAD used can have a dramatic effect on results. The use of forgiveness collars is shown to be unhelpful. The results show that state-of-the-art systems can outperform the best human reviews when no prior information is provided. However, the best human reviews still outperform state-of-the-art systems when starting from the GT-SAD.
Neo VW, Weiss S, McKnight S, et al., 2022, Polynomial eigenvalue decomposition-based target speaker voice activity detection in the presence of competing talkers, International Workshop on Acoustic Signal Enhancement (IWAENC), Publisher: IEEE, Pages: 1-5
Voice activity detection (VAD) algorithms are essential for many speech processing applications, such as speaker diarization, automatic speech recognition, speech enhancement, and speech coding. With a good VAD algorithm, non-speech segments can be excluded to improve the performance and computation of these applications. In this paper, we propose a polynomial eigenvalue decomposition-based target-speaker VAD algorithm to detect unseen target speakers in the presence of competing talkers. The proposed approach uses frame-based processing to compute the syndrome energy, used for testing the presence or absence of a target speaker. The proposed approach is consistently among the best in F1 and balanced accuracy scores over the investigated range of signal to interference ratio (SIR) from -10 dB to 20 dB.
McKnight S, Hogg A, Neo V, et al., 2022, A study of salient modulation domain features for speaker identification, Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Publisher: IEEE, Pages: 705-712
This paper studies the ranges of acoustic andmodulation frequencies of speech most relevant for identifyingspeakers and compares the speaker-specific information presentin the temporal envelope against that present in the temporalfine structure. This study uses correlation and feature importancemeasures, random forest and convolutional neural network mod-els, and reconstructed speech signals with specific acoustic and/ormodulation frequencies removed to identify the salient points. Itis shown that the range of modulation frequencies associated withthe fundamental frequency is more important than the 1-16 Hzrange most commonly used in automatic speech recognition, andthat the 0 Hz modulation frequency band contains significantspeaker information. It is also shown that the temporal envelopeis more discriminative among speakers than the temporal finestructure, but that the temporal fine structure still contains usefuladditional information for speaker identification. This researchaims to provide a timely addition to the literature by identifyingspecific aspects of speech relevant for speaker identification thatcould be used to enhance the discriminant capabilities of machinelearning models.
Hogg A, Neo V, Weiss S, et al., 2021, A polynomial eigenvalue decomposition MUSIC approach for broadband sound source localization, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE, Pages: 326-330
Direction of arrival (DoA) estimation for sound source localization is increasingly prevalent in modern devices. In this paper, we explore a polynomial extension to the multiple signal classification (MUSIC) algorithm, spatio-spectral polynomial (SSP)-MUSIC, and evaluate its performance when using speech sound sources. In addition, we also propose three essential enhancements for SSP-MUSIC to work with noisy reverberant audio data. This paper includes an analysis of SSP-MUSIC using speech signals in a simulated room for different noise and reverberation conditions and the first task of the LOCATA challenge. We show that SSP-MUSIC is more robust to noise and reverberation compared to independent frequency bin (IFB) approaches and improvements can be seen for single sound source localization at signal-to-noise ratios (SNRs) below 5 dB and reverberation times (T60s) larger than 0.7 s.
Hogg AOT, Evers C, Naylor PA, 2021, Multichannel Overlapping Speaker Segmentation Using Multiple Hypothesis Tracking Of Acoustic And Spatial Features, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE
Hogg A, Evers C, Moore A, et al., 2021, Overlapping speaker segmentation using multiple hypothesis tracking of fundamental frequency, IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol: 29, Pages: 1479-1490, ISSN: 2329-9290
This paper demonstrates how the harmonic structure of voiced speech can be exploited to segment multiple overlapping speakers in a speaker diarization task. We explore how a change in the speaker can be inferred from a change in pitch. We show that voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker’s utterance in the presence of an additional active speaker. This system is bench-marked against a segmentation system from the literature that employs a bidirectional long short term memory network (BLSTM) approach and requires training. Experimental results highlight that the proposed approach outperforms the BLSTM baseline approach by 12.9% in terms of HIT rate for speaker segmentation. We also show that the estimated pitch tracks of our system can be used as features to the BLSTM to achieve further improvements of 1.21% in terms of coverage and 2.45% in terms of purity
McKnight SW, Hogg AOT, Naylor PA, 2021, Analysis of Phonetic Dependence of Segmentation Errors in Speaker Diarization, 2020 28th European Signal Processing Conference (EUSIPCO), Publisher: IEEE
Hogg AOT, Evers C, Naylor PA, 2019, Multiple Hypothesis Tracking for Overlapping Speaker Segmentation, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE
Sharma D, Hogg AOT, Wang Y, et al., 2019, Non-Intrusive POLQA Estimation of Speech Quality using Recurrent Neural Networks, 2019 27th European Signal Processing Conference (EUSIPCO), Publisher: IEEE
Hogg AOT, Evers C, Naylor PA, 2019, Speaker Change Detection Using Fundamental Frequency with Application to Multi-talker Segmentation, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.