Imperial College London

Dr Aidan O. T. Hogg

Faculty of EngineeringDyson School of Design Engineering

Honorary Research Associate
 
 
 
//

Contact

 

a.hogg Website

 
 
//

Location

 

RCS1 214Royal College of ScienceSouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

13 results found

Hogg A, Jenkins M, Liu H, Squires I, Cooper S, Picinali Let al., 2024, HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection, IEEE Transactions on Audio, Speech and Language Processing, ISSN: 1558-7916

An individualised (HRTF) is very important for creating realistic (VR) and (AR) environments. However, acoustically measuring high-quality HRTFs requires expensive equipment and an acoustic lab setting. To overcome these limitations and to make this measurement more efficient HRTF upsampling has been exploited in the past where a high-resolution HRTF is created from a low-resolution one. This paper demonstrates how (GAN) can be applied to HRTF upsampling. We propose a novel approach that transforms the HRTF data for direct use with a convolutional (SRGAN). This new approach is benchmarked against three baselines: barycentric upsampling, (SH) upsampling and an HRTF selection approach. Experimental results show that the proposed method outperforms all three baselines in terms of (LSD) and localisation performance using perceptual models when the input HRTF is sparse (less than 20 measured positions).

Journal article

Hogg A, Liu H, Mads J, Picinali Let al., 2023, Exploring the Impact of Transfer Learning on GAN-Based HRTF Upsampling, EAA Forum Acusticum, European Congress on Acoustics

Conference paper

Engel I, Daugintis R, Vicente T, Hogg AOT, Pauwels J, Tournier AJ, Picinali Let al., 2023, The SONICOM HRTF dataset, Journal of the Audio Engineering Society, Vol: 71, Pages: 241-253, ISSN: 0004-7554

Immersive audio technologies, ranging from rendering spatialized sounds accurately to efficient room simulations, are vital to the success of augmented and virtual realities. To produce realistic sounds through headphones, the human body and head must both be taken into account. However, the measurement of the influence of the external human morphology on the sounds incoming to the ears, which is often referred to as head-related transfer function (HRTF), is expensive and time-consuming. Several datasets have been created over the years to help researcherswork on immersive audio; nevertheless, the number of individuals involved and amount of data collected is often insufficient for modern machine-learning approaches. Here, the SONICOM HRTF dataset is introduced to facilitate reproducible research in immersive audio. This dataset contains the HRTF of 120 subjects, as well as headphone transfer functions; 3D scans of ears, heads, and torsos; and depth pictures at different angles around subjects' heads.

Journal article

McKnight S, Hogg AOT, Neo VW, Naylor PAet al., 2022, Studying human-based speaker diarization and comparing to state-of-the-art systems, APSIPA 2022, Publisher: IEEE, Pages: 394-401

Human-based speaker diarization experiments were carried out on a five-minute extract of a typical AMI corpus meeting to see how much variance there is in human reviews based on hearing only and to compare with state-of-the-art diarization systems on the same extract. There are three distinct experiments: (a) one with no prior information; (b) one with the ground truth speech activity detection (GT-SAD); and (c) one with the blank ground truth labels (GT-labels). The results show that most human reviews tend to be quite similar, albeit with some outliers, but the choice of GT-labels can make a dramatic difference to scored performance. Using the GT-SAD provides a big advantage and improves human review scores substantially, though small differences in the GT-SAD used can have a dramatic effect on results. The use of forgiveness collars is shown to be unhelpful. The results show that state-of-the-art systems can outperform the best human reviews when no prior information is provided. However, the best human reviews still outperform state-of-the-art systems when starting from the GT-SAD.

Conference paper

Neo VW, Weiss S, McKnight S, Hogg A, Naylor PAet al., 2022, Polynomial eigenvalue decomposition-based target speaker voice activity detection in the presence of competing talkers, International Workshop on Acoustic Signal Enhancement (IWAENC), Publisher: IEEE, Pages: 1-5

Voice activity detection (VAD) algorithms are essential for many speech processing applications, such as speaker diarization, automatic speech recognition, speech enhancement, and speech coding. With a good VAD algorithm, non-speech segments can be excluded to improve the performance and computation of these applications. In this paper, we propose a polynomial eigenvalue decomposition-based target-speaker VAD algorithm to detect unseen target speakers in the presence of competing talkers. The proposed approach uses frame-based processing to compute the syndrome energy, used for testing the presence or absence of a target speaker. The proposed approach is consistently among the best in F1 and balanced accuracy scores over the investigated range of signal to interference ratio (SIR) from -10 dB to 20 dB.

Conference paper

McKnight S, Hogg A, Neo V, Naylor Pet al., 2022, A study of salient modulation domain features for speaker identification, Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Publisher: IEEE, Pages: 705-712

This paper studies the ranges of acoustic andmodulation frequencies of speech most relevant for identifyingspeakers and compares the speaker-specific information presentin the temporal envelope against that present in the temporalfine structure. This study uses correlation and feature importancemeasures, random forest and convolutional neural network mod-els, and reconstructed speech signals with specific acoustic and/ormodulation frequencies removed to identify the salient points. Itis shown that the range of modulation frequencies associated withthe fundamental frequency is more important than the 1-16 Hzrange most commonly used in automatic speech recognition, andthat the 0 Hz modulation frequency band contains significantspeaker information. It is also shown that the temporal envelopeis more discriminative among speakers than the temporal finestructure, but that the temporal fine structure still contains usefuladditional information for speaker identification. This researchaims to provide a timely addition to the literature by identifyingspecific aspects of speech relevant for speaker identification thatcould be used to enhance the discriminant capabilities of machinelearning models.

Conference paper

Hogg A, Neo V, Weiss S, Evers C, Naylor Pet al., 2021, A polynomial eigenvalue decomposition MUSIC approach for broadband sound source localization, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE, Pages: 326-330

Direction of arrival (DoA) estimation for sound source localization is increasingly prevalent in modern devices. In this paper, we explore a polynomial extension to the multiple signal classification (MUSIC) algorithm, spatio-spectral polynomial (SSP)-MUSIC, and evaluate its performance when using speech sound sources. In addition, we also propose three essential enhancements for SSP-MUSIC to work with noisy reverberant audio data. This paper includes an analysis of SSP-MUSIC using speech signals in a simulated room for different noise and reverberation conditions and the first task of the LOCATA challenge. We show that SSP-MUSIC is more robust to noise and reverberation compared to independent frequency bin (IFB) approaches and improvements can be seen for single sound source localization at signal-to-noise ratios (SNRs) below 5 dB and reverberation times (T60s) larger than 0.7 s.

Conference paper

Hogg AOT, Evers C, Naylor PA, 2021, Multichannel Overlapping Speaker Segmentation Using Multiple Hypothesis Tracking Of Acoustic And Spatial Features, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE

Conference paper

Hogg A, Evers C, Moore A, Naylor Pet al., 2021, Overlapping speaker segmentation using multiple hypothesis tracking of fundamental frequency, IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol: 29, Pages: 1479-1490, ISSN: 2329-9290

This paper demonstrates how the harmonic structure of voiced speech can be exploited to segment multiple overlapping speakers in a speaker diarization task. We explore how a change in the speaker can be inferred from a change in pitch. We show that voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker’s utterance in the presence of an additional active speaker. This system is bench-marked against a segmentation system from the literature that employs a bidirectional long short term memory network (BLSTM) approach and requires training. Experimental results highlight that the proposed approach outperforms the BLSTM baseline approach by 12.9% in terms of HIT rate for speaker segmentation. We also show that the estimated pitch tracks of our system can be used as features to the BLSTM to achieve further improvements of 1.21% in terms of coverage and 2.45% in terms of purity

Journal article

McKnight SW, Hogg AOT, Naylor PA, 2021, Analysis of Phonetic Dependence of Segmentation Errors in Speaker Diarization, 2020 28th European Signal Processing Conference (EUSIPCO), Publisher: IEEE

Conference paper

Hogg AOT, Evers C, Naylor PA, 2019, Multiple Hypothesis Tracking for Overlapping Speaker Segmentation, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE

Conference paper

Sharma D, Hogg AOT, Wang Y, Nour-Eldin A, Naylor PAet al., 2019, Non-Intrusive POLQA Estimation of Speech Quality using Recurrent Neural Networks, 2019 27th European Signal Processing Conference (EUSIPCO), Publisher: IEEE

Conference paper

Hogg AOT, Evers C, Naylor PA, 2019, Speaker Change Detection Using Fundamental Frequency with Application to Multi-talker Segmentation, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE

Conference paper

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00817908&limit=30&person=true