Publications

Journal article

Hogg A, Jenkins M, Liu H, Squires I, Cooper S, Picinali Let al., 2024,

HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection

, IEEE Transactions on Audio, Speech and Language Processing, ISSN: 1558-7916

An individualised (HRTF) is very important for creating realistic (VR) and (AR) environments. However, acoustically measuring high-quality HRTFs requires expensive equipment and an acoustic lab setting. To overcome these limitations and to make this measurement more efficient HRTF upsampling has been exploited in the past where a high-resolution HRTF is created from a low-resolution one. This paper demonstrates how (GAN) can be applied to HRTF upsampling. We propose a novel approach that transforms the HRTF data for direct use with a convolutional (SRGAN). This new approach is benchmarked against three baselines: barycentric upsampling, (SH) upsampling and an HRTF selection approach. Experimental results show that the proposed method outperforms all three baselines in terms of (LSD) and localisation performance using perceptual models when the input HRTF is sparse (less than 20 measured positions).

Journal article

Neo VW, Redif S, McWhirter JG, Pestana J, Proudler IK, Weiss S, Naylor PAet al., 2023,

Polynomial eigenvalue decomposition for multichannel broadband signal processing

, IEEE: Signal Processing Magazine, Vol: 40, Pages: 18-37, ISSN: 1053-5888

This article is devoted to the polynomial eigenvalue decomposition (PEVD) and its applications in broadband multichannel signal processing, motivated by the optimum solutions provided by the eigenvalue decomposition (EVD) for the narrow-band case [1], [2]. In general, the successful techniques from narrowband problems can also be applied to broadband ones, leading to improved solutions. Multichannel broadband signals arise at the core of many essential commercial applications such as telecommunications, speech processing, healthcare monitoring, astronomy and seismic surveillance, and military technologies like radar, sonar and communications [3]. The success of these applications often depends on the performance of signal processing tasks, including data compression [4], source localization [5], channel coding [6], signal enhancement [7], beamforming [8], and source separation [9]. In most cases and for narrowband signals, performing an EVD is the key to the signal processing algorithm. Therefore, this paper aims to introduce PEVD as a novel mathematical technique suitable for many broadband signal processing applications.

Conference paper

Hogg A, Liu H, Mads J, Picinali Let al., 2023,

Exploring the Impact of Transfer Learning on GAN-Based HRTF Upsampling

, EAA Forum Acusticum, European Congress on Acoustics

Cite

Conference paper

Sanguedolce G, Naylor PA, Geranmayeh F, 2023,

Uncovering the potential for a weakly supervised end-to-end model in recognising speech from patient with post-stroke aphasia

, 5th Clinical Natural Language Processing Workshop, Publisher: Association for Computational Linguistics, Pages: 182-190

Post-stroke speech and language deficits (aphasia) significantly impact patients' quality of life. Many with mild symptoms remain undiagnosed, and the majority do not receive the intensive doses of therapy recommended, due to healthcare costs and/or inadequate services. Automatic Speech Recognition (ASR) may help overcome these difficulties by improving diagnostic rates and providing feedback during tailored therapy. However, its performance is often unsatisfactory due to the high variability in speech errors and scarcity of training datasets. This study assessed the performance of Whisper, a recently released end-to-end model, in patients with post-stroke aphasia (PWA). We tuned its hyperparameters to achieve the lowest word error rate (WER) on aphasic speech. WER was significantly higher in PWA compared to age-matched controls (10.3% vs 38.5%, p < 0.001). We demonstrated that worse WER was related to the more severe aphasia as measured by expressive (overt naming, and spontaneous speech production) and receptive (written and spoken comprehension) language assessments. Stroke lesion size did not affect the performance of Whisper. Linear mixed models accounting for demographic factors, therapy duration, and time since stroke, confirmed worse Whisper performance with left hemispheric frontal lesions. We discuss the implications of these findings for how future ASR can be improved in PWA.

Conference paper

McKnight S, Hogg AOT, Neo VW, Naylor PAet al., 2022,

Studying human-based speaker diarization and comparing to state-of-the-art systems

, APSIPA 2022, Publisher: IEEE, Pages: 394-401

Human-based speaker diarization experiments were carried out on a five-minute extract of a typical AMI corpus meeting to see how much variance there is in human reviews based on hearing only and to compare with state-of-the-art diarization systems on the same extract. There are three distinct experiments: (a) one with no prior information; (b) one with the ground truth speech activity detection (GT-SAD); and (c) one with the blank ground truth labels (GT-labels). The results show that most human reviews tend to be quite similar, albeit with some outliers, but the choice of GT-labels can make a dramatic difference to scored performance. Using the GT-SAD provides a big advantage and improves human review scores substantially, though small differences in the GT-SAD used can have a dramatic effect on results. The use of forgiveness collars is shown to be unhelpful. The results show that state-of-the-art systems can outperform the best human reviews when no prior information is provided. However, the best human reviews still outperform state-of-the-art systems when starting from the GT-SAD.

Conference paper

D'Olne E, Neo VW, Naylor PA, 2022,

Speech enhancement in distributed microphone arrays using polynomial eigenvalue decomposition

, Europen Signal Processing Conference (EUSIPCO), Publisher: IEEE, Pages: 55-59, ISSN: 2219-5491

As the number of connected devices equipped withmultiple microphones increases, scientific interest in distributedmicrophone array processing grows. Current beamforming meth-ods heavily rely on estimating quantities related to array geom-etry, which is extremely challenging in real, non-stationary envi-ronments. Recent work on polynomial eigenvalue decomposition(PEVD) has shown promising results for speech enhancement insingular arrays without requiring the estimation of any array-related parameter [1]. This work extends these results to therealm of distributed microphone arrays, and further presentsa novel framework for speech enhancement in distributed mi-crophone arrays using PEVD. The proposed approach is shownto almost always outperform optimum beamformers located atarrays closest to the desired speaker. Moreover, the proposedapproach exhibits very strong robustness to steering vectorerrors.

Conference paper

D'Olne E, Neo VW, Naylor PA, 2022,

Frame-based space-time covariance matrix estimation for polynomial eigenvalue decomposition-based speech enhancement

, International Workshop on Acoustic Signal Enhancement (IWAENC), Publisher: IEEE, Pages: 1-5

Recent work in speech enhancement has proposed a polynomial eigenvalue decomposition (PEVD) method, yielding significant intelligibility and noise-reduction improvements without introducing distortions in the enhanced signal [1]. The method relies on the estimation of a space-time covariance matrix, performed in batch mode such that a sufficiently long portion of the noisy signal is used to derive an accurate estimate. However, in applications where the scene is nonstationary, this approach is unable to adapt to changes in the acoustic scenario. This paper thus proposes a frame-based procedure for the estimation of space-time covariance matrices and investigates its impact on subsequent PEVD speech enhancement. The method is found to yield spatial filters and speech enhancement improvements comparable to the batch method in [1], showing potential for real-time processing.

Conference paper

Neo VW, D'Olne E, Moore AH, Naylor PAet al., 2022,

Fixed beamformer design using polynomial eigenvalue decomposition

, International Workshop on Acoustic Signal Enhancement (IWAENC), Publisher: IEEE, Pages: 1-5

Array processing is widely used in many speech applications involving multiple microphones. These applications include automaticspeech recognition, robot audition, telecommunications, and hearing aids. A spatio-temporal filter for the array allows signals fromdifferent microphones to be combined desirably to improve the application performance. This paper will analyze and visually interpretthe eigenvector beamformers designed by the polynomial eigenvaluedecomposition (PEVD) algorithm, which are suited for arbitrary arrays. The proposed fixed PEVD beamformers are lightweight, withan average filter length of 114 and perform comparably to classicaldata-dependent minimum variance distortionless response (MVDR)and linearly constrained minimum variance (LCMV) beamformersfor the separation of sources closely spaced by 5 degrees.

Conference paper

Neo VW, Weiss S, McKnight S, Hogg A, Naylor PAet al., 2022,

Polynomial eigenvalue decomposition-based target speaker voice activity detection in the presence of competing talkers

, International Workshop on Acoustic Signal Enhancement (IWAENC), Publisher: IEEE, Pages: 1-5

Voice activity detection (VAD) algorithms are essential for many speech processing applications, such as speaker diarization, automatic speech recognition, speech enhancement, and speech coding. With a good VAD algorithm, non-speech segments can be excluded to improve the performance and computation of these applications. In this paper, we propose a polynomial eigenvalue decomposition-based target-speaker VAD algorithm to detect unseen target speakers in the presence of competing talkers. The proposed approach uses frame-based processing to compute the syndrome energy, used for testing the presence or absence of a target speaker. The proposed approach is consistently among the best in F1 and balanced accuracy scores over the investigated range of signal to interference ratio (SIR) from -10 dB to 20 dB.

Conference paper

Neo VW, Weiss S, Naylor PA, 2022,

A polynomial subspace projection approach for the detection of weak voice activity

, Sensor Signal Processing for Defence conference (SSPD), Publisher: IEEE, Pages: 1-5

A voice activity detection (VAD) algorithm identifies whether or not time frames contain speech. It is essential for many military and commercial speech processing applications, including speech enhancement, speech coding, speaker identification, and automatic speech recognition. In this work, we adopt earlier work on detecting weak transient signals and propose a polynomial subspace projection pre-processor to improve an existing VAD algorithm. The proposed multi-channel pre-processor projects the microphone signals onto a lower dimensional subspace which attempts to remove the interferer components and thus eases the detection of the speech target. Compared to applying the same VAD to the microphone signal, the proposed approach almost always improves the F1 and balanced accuracy scores even in adverse environments, e.g. -30 dB SIR, which may be typical of operations involving noisy machinery and signal jamming scenarios.

Conference paper

McKnight S, Hogg A, Neo V, Naylor Pet al., 2022,

A study of salient modulation domain features for speaker identification

, Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Publisher: IEEE, Pages: 705-712

This paper studies the ranges of acoustic andmodulation frequencies of speech most relevant for identifyingspeakers and compares the speaker-specific information presentin the temporal envelope against that present in the temporalfine structure. This study uses correlation and feature importancemeasures, random forest and convolutional neural network mod-els, and reconstructed speech signals with specific acoustic and/ormodulation frequencies removed to identify the salient points. Itis shown that the range of modulation frequencies associated withthe fundamental frequency is more important than the 1-16 Hzrange most commonly used in automatic speech recognition, andthat the 0 Hz modulation frequency band contains significantspeaker information. It is also shown that the temporal envelopeis more discriminative among speakers than the temporal finestructure, but that the temporal fine structure still contains usefuladditional information for speaker identification. This researchaims to provide a timely addition to the literature by identifyingspecific aspects of speech relevant for speaker identification thatcould be used to enhance the discriminant capabilities of machinelearning models.

Conference paper

Neo V, Evers C, Naylor P, 2021,

Polynomial matrix eigenvalue decomposition-based source separation using informed spherical microphone arrays

, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE, Pages: 201-205

Audio source separation is essential for many applications such as hearing aids, telecommunications, and robot audition. Subspace decomposition approaches using polynomial matrix eigenvalue decomposition (PEVD) algorithms applied to the microphone signals, or lower-dimension eigenbeams for spherical microphone arrays, are effective for speech enhancement. In this work, we extend the work from speech enhancement and propose a PEVD subspace algorithm that uses eigenbeams for source separation. The proposed PEVD-based source separation approach performs comparably with state-of-the-art algorithms, such as those based on independent component analysis (ICA) and multi-channel non-negative matrix factorization (MNMF). Informal listening examples also indicate that our method does not introduce any audible artifacts.

Conference paper

Hogg A, Neo V, Weiss S, Evers C, Naylor Pet al., 2021,

A polynomial eigenvalue decomposition MUSIC approach for broadband sound source localization

, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE, Pages: 326-330

Direction of arrival (DoA) estimation for sound source localization is increasingly prevalent in modern devices. In this paper, we explore a polynomial extension to the multiple signal classification (MUSIC) algorithm, spatio-spectral polynomial (SSP)-MUSIC, and evaluate its performance when using speech sound sources. In addition, we also propose three essential enhancements for SSP-MUSIC to work with noisy reverberant audio data. This paper includes an analysis of SSP-MUSIC using speech signals in a simulated room for different noise and reverberation conditions and the first task of the LOCATA challenge. We show that SSP-MUSIC is more robust to noise and reverberation compared to independent frequency bin (IFB) approaches and improvements can be seen for single sound source localization at signal-to-noise ratios (SNRs) below 5 dB and reverberation times (T60s) larger than 0.7 s.

Journal article

Neo V, Evers C, Naylor P, 2021,

Enhancement of noisy reverberant speech using polynomial matrix eigenvalue decomposition

, IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol: 29, Pages: 3255-3266, ISSN: 2329-9290

Speech enhancement is important for applications such as telecommunications, hearing aids, automatic speech recognition and voice-controlled systems. Enhancement algorithms aim to reduce interfering noise and reverberation while minimizing any speech distortion. In this work for speech enhancement, we propose to use polynomial matrices to model the spatial, spectral and temporal correlations between the speech signals received by a microphone array and polynomial matrix eigenvalue decomposition (PEVD) to decorrelate in space, time and frequency simultaneously. We then propose a blind and unsupervised PEVD-based speech enhancement algorithm. Simulations and informal listening examples involving diverse reverberant and noisy environments have shown that our method can jointly suppress noise and reverberation, thereby achieving speech enhancement without introducing processing artefacts into the enhanced signal.

Conference paper

Hogg AOT, Evers C, Naylor PA, 2021,

Multichannel Overlapping Speaker Segmentation Using Multiple Hypothesis Tracking Of Acoustic And Spatial Features

, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE

Conference paper

Neo VW, Evers C, Naylor PA, 2021,

Polynomial matrix eigenvalue decomposition of spherical harmonics for speech enhancement

, IEEE International Conference on Acoustics, Speech and Signal Processing, Publisher: IEEE, Pages: 786-790

Speech enhancement algorithms using polynomial matrix eigen value decomposition (PEVD) have been shown to be effective for noisy and reverberant speech. However, these algorithms do not scale well in complexity with the number of channels used in the processing. For a spherical microphone array sampling an order-limited sound field, the spherical harmonics provide a compact representation of the microphone signals in the form of eigen beams. We propose a PEVD algorithm that uses only the lower dimension eigen beams for speech enhancement at a significantly lower computation cost. The proposed algorithm is shown to significantly reduce complexity while maintaining full performance. Informal listening examples have also indicated that the processing does not introduce any noticeable artefacts.

Journal article

Hogg A, Evers C, Moore A, Naylor Pet al., 2021,

Overlapping speaker segmentation using multiple hypothesis tracking of fundamental frequency

, IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol: 29, Pages: 1479-1490, ISSN: 2329-9290

This paper demonstrates how the harmonic structure of voiced speech can be exploited to segment multiple overlapping speakers in a speaker diarization task. We explore how a change in the speaker can be inferred from a change in pitch. We show that voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker’s utterance in the presence of an additional active speaker. This system is bench-marked against a segmentation system from the literature that employs a bidirectional long short term memory network (BLSTM) approach and requires training. Experimental results highlight that the proposed approach outperforms the BLSTM baseline approach by 12.9% in terms of HIT rate for speaker segmentation. We also show that the estimated pitch tracks of our system can be used as features to the BLSTM to achieve further improvements of 1.21% in terms of coverage and 2.45% in terms of purity

Conference paper

Neo VW, Evers C, Naylor PA, 2021,

Speech dereverberation performance of a polynomial-EVD subspace approach

, European Signal Processing Conference (EUSIPCO), Publisher: IEEE, ISSN: 2076-1465

The degradation of speech arising from additive background noise and reverberation affects the performance of important speech applications such as telecommunications, hearing aids, voice-controlled systems and robot audition. In this work, we focus on dereverberation. It is shown that the parameterized polynomial matrix eigenvalue decomposition (PEVD)-based speech enhancement algorithm exploits the lack of correlation between speech and the late reflections to enhance the speech component associated with the direct path and early reflections. The algorithm's performance is evaluated using simulations involving measured acoustic impulse responses and noise from the ACE corpus. The simulations and informal listening examples have indicated that the PEVD-based algorithm performs dereverberation over a range of SNRs without introducing any noticeable processing artefacts.

Conference paper

McKnight SW, Hogg A, Naylor P, 2020,

Analysis of phonetic dependence of segmentation errors in speaker diarization

, European Signal Processing Conference (EUSIPCO), Publisher: IEEE, ISSN: 2076-1465

Evaluation of speaker segmentation and diarization normally makes use of forgiveness collars around ground truth speaker segment boundaries such that estimated speaker segment boundaries with such collars are considered completely correct. This paper shows that the popular recent approach of removing forgiveness collars from speaker diarization evaluation tools can unfairly penalize speaker diarization systems that correctly estimate speaker segment boundaries. The uncertainty in identifying the start and/or end of a particular phoneme means that the ground truth segmentation is not perfectly accurate, and even trained human listeners are unable to identify phoneme boundaries with full consistency. This research analyses the phoneme dependence of this uncertainty, and shows that it depends on (i) whether the phoneme being detected is at the start or end of an utterance and (ii) what the phoneme is, so that the use of a uniform forgiveness collar is inadequate. This analysis is expected to point the way towards more indicative and repeatable assessment of the performance of speaker diarization systems.

Conference paper

Neo VW, Evers C, Naylor PA, 2020,

PEVD-based speech enhancement in reverberant environments

, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE, Pages: 186-190

The enhancement of noisy speech is important for applications involving human-to-human interactions, such as telecommunications and hearing aids, as well as human-to-machine interactions, such as voice-controlled systems and robot audition. In this work, we focus on reverberant environments. It is shown that, by exploiting the lack of correlation between speech and the late reflections, further noise reduction can be achieved. This is verified using simulations involving actual acoustic impulse responses and noise from the ACE corpus. The simulations show that even without using a noise estimator, our proposed method simultaneously achieves noise reduction, and enhancement of speech quality and intelligibility, in reverberant environments over a wide range of SNRs. Furthermore, informal listening examples highlight that our approach does not introduce any significant processing artefacts such as musical noise.

Imperial College London

Latest News

Natural and Machine Hearing

HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection

Polynomial eigenvalue decomposition for multichannel broadband signal processing

Exploring the Impact of Transfer Learning on GAN-Based HRTF Upsampling

Uncovering the potential for a weakly supervised end-to-end model in recognising speech from patient with post-stroke aphasia

Studying human-based speaker diarization and comparing to state-of-the-art systems

Speech enhancement in distributed microphone arrays using polynomial eigenvalue decomposition

Frame-based space-time covariance matrix estimation for polynomial eigenvalue decomposition-based speech enhancement

Fixed beamformer design using polynomial eigenvalue decomposition

Polynomial eigenvalue decomposition-based target speaker voice activity detection in the presence of competing talkers

A polynomial subspace projection approach for the detection of weak voice activity

A study of salient modulation domain features for speaker identification

Polynomial matrix eigenvalue decomposition-based source separation using informed spherical microphone arrays

A polynomial eigenvalue decomposition MUSIC approach for broadband sound source localization

Enhancement of noisy reverberant speech using polynomial matrix eigenvalue decomposition

Multichannel Overlapping Speaker Segmentation Using Multiple Hypothesis Tracking Of Acoustic And Spatial Features

Polynomial matrix eigenvalue decomposition of spherical harmonics for speech enhancement

Overlapping speaker segmentation using multiple hypothesis tracking of fundamental frequency

Speech dereverberation performance of a polynomial-EVD subspace approach

Analysis of phonetic dependence of segmentation errors in speaker diarization

PEVD-based speech enhancement in reverberant environments

Publications

Search or filter publications

Filter by type:

Filter by year:

Results

Search results

Exploring the Impact of Transfer Learning on GAN-Based HRTF Upsampling

A study of salient modulation domain features for speaker identification