Patrick A. Naylor

Faculty of Engineering, Department of Electrical and Electronic Engineering

Professor of Speech & Acoustic Signal Processing

Contact

+44 (0)20 7594 6235p.naylor Website

Location

803Electrical EngineeringSouth Kensington Campus

Summary

Publications

D'Olne E, Moore AH, Naylor PA, Donley J, Tourbabin V, Lunner Tet al., 2024, Group Conversations in Noisy Environments (GiN) - Multimedia Recordings for Location-Aware Speech Enhancement, IEEE Open Journal of Signal Processing, Vol: 5, Pages: 374-382

Recent years have seen a growing interest in the use of smart glasses mounted with microphones to solve the cocktail party problem using beamforming techniques or machine learning. Many such approaches could bring substantial advances in hearing aid or Augmented Reality (AR) research. To validate these methods, the EasyCom [Donley et al., 2021] dataset introduced high-quality multi-modal recordings of conversations in noise, including egocentric multi-channel microphone array audio, speech source pose, and headset microphone audio. While providing comprehensive data, EasyCom lacks diversity in the acoustic environments considered and the degree of overlapping speech in conversations. This work therefore presents the Group in Noise (GiN) dataset of over 2 hours of group conversations in noisy environments recorded using binaural microphones and a pair of glasses mounted with 5 microphones. The recordings took place in 3 rooms and contain 6 seated participants as well as a standing facilitator. The data also include close-talking microphone audio and head-pose data for each speaker, an audio channel from a fixed reference microphone, and automatically annotated speaker activity information. A baseline method is used to demonstrate the use of the data for speech enhancement. The dataset is publicly available in d'Olne et al. [2023].

Abstract
Cite

Journal article

Grinstein E, Hicks CM, Van Waterschoot T, Brookes M, Naylor PAet al., 2024, The Neural-SRP Method for Universal Robust Multi-Source Tracking, IEEE Open Journal of Signal Processing, Vol: 5, Pages: 19-28

Neural networks have achieved state-of-the-art performance on the task of acoustic Direction-of-Arrival (DOA) estimation using microphone arrays. Neural models can be classified as end-to-end or hybrid, each class showing advantages and disadvantages. This work introduces Neural-SRP, an end-to-end neural network architecture for DOA estimation inspired by the classical Steered Response Power (SRP) method, which overcomes limitations of current neural models. We evaluate the architecture on multiple scenarios, namely, multi-source DOA tracking and single-source DOA tracking under the presence of directional and diffuse noise. The experiments demonstrate that our proposed method compares favourably in terms of computational and localization performance with established neural methods on various recorded and simulated benchmark datasets.

Abstract
Cite

Journal article

Culling JF, D'Olne EFC, Davies BD, Powell N, Naylor PAet al., 2023, Practical utility of a head-mounted gaze-directed beamforming system., J Acoust Soc Am, Vol: 154, Pages: 3760-3768

Assistive auditory devices that enhance signal-to-noise ratio must follow the user's changing attention; errors could lead to the desired source being suppressed as noise. A method for measuring the practical benefit of attention-following speech enhancement is described and used to show a benefit for gaze-directed beamforming over natural binaural hearing. First, participants watched a recorded video conference call between two people with six additional interfering voices in different directions. The directions of the target voices corresponded to the spatial layout of their video streams. A simulated beamformer was yoked to the participant's gaze direction using an eye tracker. For the control condition, all eight voices were spatially distributed in a simulation of unaided binaural hearing. Participants completed questionnaires on the content of the conversation, scoring twice as high in the questionnaires for the beamforming condition. Sentence-by-sentence intelligibility was then measured using new participants who viewed the same audiovisual stimulus for each isolated sentence. Participants recognized twice as many words in the beamforming condition. The results demonstrate the potential practical benefit of gaze-directed beamforming for hearing aids and illustrate how detailed intelligibility data can be retrieved from an experiment that involves behavioral engagement in an ongoing listening task.

Journal article

Grinstein E, Neo VW, Naylor PA, 2023, Dual input neural networks for positional sound source localization., EURASIP J. Audio Speech Music. Process., Vol: 2023, Pages: 32-32

Cite

Journal article

Neo VW, Redif S, McWhirter JG, Pestana J, Proudler IK, Weiss S, Naylor PAet al., 2023, Polynomial eigenvalue decomposition for multichannel broadband signal processing, IEEE: Signal Processing Magazine, Vol: 40, Pages: 18-37, ISSN: 1053-5888

This article is devoted to the polynomial eigenvalue decomposition (PEVD) and its applications in broadband multichannel signal processing, motivated by the optimum solutions provided by the eigenvalue decomposition (EVD) for the narrow-band case [1], [2]. In general, the successful techniques from narrowband problems can also be applied to broadband ones, leading to improved solutions. Multichannel broadband signals arise at the core of many essential commercial applications such as telecommunications, speech processing, healthcare monitoring, astronomy and seismic surveillance, and military technologies like radar, sonar and communications [3]. The success of these applications often depends on the performance of signal processing tasks, including data compression [4], source localization [5], channel coding [6], signal enhancement [7], beamforming [8], and source separation [9]. In most cases and for narrowband signals, performing an EVD is the key to the signal processing algorithm. Therefore, this paper aims to introduce PEVD as a novel mathematical technique suitable for many broadband signal processing applications.

Journal article

Neo VW, Evers C, Weiss S, Naylor PAet al., 2023, Signal compaction using polynomial EVD for spherical array processing with applications, IEEE Transactions on Audio, Speech and Language Processing, Vol: 31, Pages: 3537-3549, ISSN: 1558-7916

Multi-channel signals captured by spatially separated sensors often contain a high level of data redundancy. A compact signal representation enables more efficient storage and processing, which has been exploited for data compression, noise reduction, and speech and image coding. This paper focuses on the compact representation of speech signals acquired by spherical microphone arrays. A polynomial matrix eigenvalue decomposition (PEVD) can spatially decorrelate signals over a range of time lags and is known to achieve optimum multi-channel data compaction. However, the complexity of PEVD algorithms scales at best cubically with the number of channel signals, e.g., the number of microphones comprised in a spherical array used for processing. In contrast, the spherical harmonic transform (SHT) provides a compact spatial representation of the 3-dimensional sound field measured by spherical microphone arrays, referred to as eigenbeam signals, at a cost that rises only quadratically with the number of microphones. Yet, the SHT's spatially orthogonal basis functions cannot completely decorrelate sound field components over a range of time lags. In this work, we propose to exploit the compact representation offered by the SHT to reduce the number of channels used for subsequent PEVD processing. In the proposed framework for signal representation, we show that the diagonality factor improves by up to 7 dB over the microphone signal representation with a significantly lower computation cost. Moreover, when applying this framework to speech enhancement and source separation, the proposed method improves metrics known as short-time objective intelligibility (STOI) and source-to-distortion ratio (SDR) by up to 0.2 and 20 dB, respectively.

Journal article

Grinstein E, Neo VW, Naylor PA, 2023, Dual input neural networks for positional sound source localization, Eurasip Journal on Audio, Speech, and Music Processing, Vol: 2023, Pages: 1-12, ISSN: 1687-4714

In many signal processing applications, metadata may be advantageously used in conjunction with a high dimensional signal to produce a desired output. In the case of classical Sound Source Localization (SSL) algorithms, information from a high dimensional, multichannel audio signals received by many distributed microphones is combined with information describing acoustic properties of the scene, such as the microphones’ coordinates in space, to estimate the position of a sound source. We introduce Dual Input Neural Networks (DI-NNs) as a simple and effective way to model these two data types in a neural network. We train and evaluate our proposed DI-NN on scenarios of varying difficulty and realism and compare it against an alternative architecture, a classical Least-Squares (LS) method as well as a classical Convolutional Recurrent Neural Network (CRNN). Our results show that the DI-NN significantly outperforms the baselines, achieving a five times lower localization error than the LS method and two times lower than the CRNN in a test dataset of real recordings.

Journal article

Sanguedolce G, Naylor PA, Geranmayeh F, 2023, Uncovering the potential for a weakly supervised end-to-end model in recognising speech from patient with post-stroke aphasia, 5th Clinical Natural Language Processing Workshop, Publisher: Association for Computational Linguistics, Pages: 182-190

Post-stroke speech and language deficits (aphasia) significantly impact patients' quality of life. Many with mild symptoms remain undiagnosed, and the majority do not receive the intensive doses of therapy recommended, due to healthcare costs and/or inadequate services. Automatic Speech Recognition (ASR) may help overcome these difficulties by improving diagnostic rates and providing feedback during tailored therapy. However, its performance is often unsatisfactory due to the high variability in speech errors and scarcity of training datasets. This study assessed the performance of Whisper, a recently released end-to-end model, in patients with post-stroke aphasia (PWA). We tuned its hyperparameters to achieve the lowest word error rate (WER) on aphasic speech. WER was significantly higher in PWA compared to age-matched controls (10.3% vs 38.5%, p < 0.001). We demonstrated that worse WER was related to the more severe aphasia as measured by expressive (overt naming, and spontaneous speech production) and receptive (written and spoken comprehension) language assessments. Stroke lesion size did not affect the performance of Whisper. Linear mixed models accounting for demographic factors, therapy duration, and time since stroke, confirmed worse Whisper performance with left hemispheric frontal lesions. We discuss the implications of these findings for how future ASR can be improved in PWA.

Conference paper

Richard G, Smaragdis P, Gannot S, Naylor PA, Makino S, Kellermann W, Sugiyama Aet al., 2023, Audio Signal Processing in the 21st Century: The important outcomes of the past 25 years, IEEE SIGNAL PROCESSING MAGAZINE, Vol: 40, Pages: 12-26, ISSN: 1053-5888

Journal article

Guiraud P, Moore AH, Vos RR, Naylor PA, Brookes Met al., 2023, Using a single-channel reference with the MBSTOI binaural intelligibility metric, Speech Communication, Vol: 149, Pages: 74-83, ISSN: 0167-6393

In order to assess the intelligibility of a target signal in a noisy environment, intrusive speech intelligibility metrics are typically used. They require a clean reference signal to be available which can be difficult to obtain especially for binaural metrics like the modified binaural short time objective intelligibility metric (MBSTOI). We here present a hybrid version of MBSTOI that incorporates a deep learning stage that allows the metric to be computed with only a single-channel clean reference signal. The models presented are trained on simulated data containing target speech, localised noise, diffuse noise, and reverberation. The hybrid output metrics are then compared directly to MBSTOI to assess performances. Results show the performance of our single channel reference vs MBSTOI. The outcome of this work offers a fast and flexible way to generate audio data for machine learning (ML) and highlights the potential for low level implementation of ML into existing tools.

Journal article

Grinstein E, Brookes M, Naylor PA, 2023, Graph Neural Networks for Sound Source Localization on Distributed Microphone Networks, ISSN: 1520-6149

Distributed Microphone Arrays (DMAs) present many challenges with respect to centralized microphone arrays. An important requirement of applications on these arrays is handling a variable number of input channels. We consider the use of Graph Neural Networks (GNNs) as a solution to this challenge. We present a localization method using the Relation Network GNN, which we show shares many similarities to classical signal processing algorithms for Sound Source Localization (SSL). We apply our method for the task of SSL and validate it experimentally using an unseen number of microphones. We test different feature extractors and show that our approach significantly outperforms classical baselines.

Abstract
Cite
Citations: 2

Conference paper

Sharma D, Nespoli F, Gong R, Naylor PAet al., 2023, Canonical Voice Conversion and Dual-Channel Processing For Improved Voice Privacy of Speech Recognition Data, Pages: 66-70, ISSN: 2219-5491

This paper addresses the need for enhancing the privacy of test data in a deployed automatic speech recognition (ASR) system so that what was said cannot be linked to who said it, a process we describe as acoustic de-identification. Existing techniques can be used to modify voice characteristics to make the speaker identity unrecognizable, but normally at the expense of ASR performance. We present a novel approach for improving ASR performance on acoustically de-identified voice data. Our method exploits a dual-channel input to a self-attention channel combinator front-end to an end-to-end ASR system, and data augmentation, where some amount of original speech data is used in model training. The voice data is de-identified by a zero-shot voice style transfer system to the voice of a registered, canonical speaker. We show that the proposed approach achieves a significant improvement in privacy as demonstrated by a 10x increase in the EER of an automatic speaker verification system, while also improving the ASR accuracy as demonstrated by a 18.3% reduction in WER relative to a single channel model baseline model when tested on acoustically de-identified speech.

Abstract
Cite

Conference paper

Fosburgh J, Sharma D, Naylor PA, 2023, ROOM ADAPTATION OF TRAINING DATA FOR DISTANT SPEECH RECOGNITION, Pages: 71-75, ISSN: 2219-5491

We present a novel signal processing-based approach for estimating room impulse responses for augmentation of ASR training data that is best suited to the reverberation characteristics in a particular acoustic space. Our approach estimates an impulse response of a room by using a supervised adaptive system identification algorithm to extract the relative transfer function between a speech source played through a loudspeaker and recorded by a microphone. These impulse responses can then be applied to clean speech files to create an augmented training set for an ASR system. Given the availability of a small amount of this type of playback audio for a room, we show that an ASR model trained with our data augmentation approach can provide a 19% relative reduction in word error rate compared to a system using random augmentation.

Abstract
Cite

Conference paper

Hafezi S, Moore AH, Guiraud P, Naylor PA, Donley J, Tourbabin V, Lunner Tet al., 2023, Subspace Hybrid Beamforming for Head-Worn Microphone Arrays, ISSN: 1520-6149

A two-stage multi-channel speech enhancement method is proposed which consists of a novel adaptive beamformer, Hybrid Minimum Variance Distortionless Response (MVDR), Isotropic-MVDR (Iso), and a novel multi-channel spectral Principal Components Analysis (PCA) denoising. In the first stage, the Hybrid-MVDR performs multiple MVDRs using a dictionary of pre-defined noise field models and picks the minimum-power outcome, which benefits from the robustness of signal-independent beamforming and the performance of adaptive beamforming. In the second stage, the outcomes of Hybrid and Iso are jointly used in a two-channel PCA-based denoising to remove the 'musical noise' produced by Hybrid beamformer. On a dataset of real 'cocktail-party' recordings with head-worn array, the proposed method outperforms the baseline superdirective beamformer in noise suppression (fwSegSNR, SDR, SIR, SAR) and speech intelligibility (STOI) with similar speech quality (PESQ) improvement.

Abstract
Cite
Citations: 1

Conference paper

Sathyapriyan V, Pedersen MS, Brookes M, Østergaard J, Naylor PA, Jensen Jet al., 2023, Speech enhancement using binary estimator selection applied to hearing aids with a remote microphone, Pages: 38-42

This paper introduces a speech enhancement algorithm for hearing assistive devices, e.g., hearing aids, connected to a remote microphone. Remote microphones are especially beneficial to hearing aid users when they are present in environments with low signal-to-noise ratios. The transmission of the acoustic data from the remote microphone to the hearing aid unit, however, happens through a wireless channel that is prone to network delays. Such delays, that occur in any real-world application, make the remote microphone signal less valuable, in contrast to when the transmission is assumed to be error-free and instantaneous, as is often done in the literature. To make use of the remote microphone signal, despite the delay, we propose an estimator selection method that selects between the minimum mean-square error estimate of the desired signal, made using the hearing aid signals and the delayed remote microphone signal, respectively. This binary selection is made by comparing the normalized mean-square errors of the two desired signal estimates. We show that the proposed method provides a benefit in estimated speech intelligibility, for delays in transmission up to 30 ms at a signal-to-noise ratio of 0 dB, in comparison to the minimum mean-square error estimate made using only the hearing aid microphone signals.

Abstract
Cite

Conference paper

Nespoli F, Barreda D, Bitzer J, Naylor PAet al., 2023, Two-Stage Voice Anonymization for Enhanced Privacy, Pages: 3854-3858, ISSN: 2308-457X

In recent years, the need for privacy preservation when manipulating or storing personal data, including speech, has become a major issue. In this paper, we present a system addressing the speaker-level anonymization problem. We propose and evaluate a two-stage anonymization pipeline exploiting a state-of-the-art anonymization model described in the Voice Privacy Challenge 2022 in combination with a zero-shot voice conversion architecture able to capture speaker characteristics from a few seconds of speech. We show this architecture can lead to strong privacy preservation while preserving pitch information. Finally, we propose a new compressed metric to evaluate anonymization systems in privacy scenarios with different constraints on privacy and utility.

Abstract
Cite

Conference paper

Guiraud P, Moore AH, Vos RR, Naylor PA, Brookes Met al., 2023, The MBSTOI Binaural Intelligibility Metric Using a Close-Talking Microphone Reference, ISSN: 1520-6149

Intelligibility metrics are a fast way to determine how comprehensible a target signal is in a noisy situation. Most metrics however rely on having a clean reference signal for computation and are not adapted to live recordings. In this paper the deep correlation modified binaural short time objective intelligibility metric (Dcor-MBSTOI) is evaluated with a single-channel close-talking microphone signal as the reference. This reference signal inevitably contains some background noise and crosstalk from non-target sources. It is found that intelligibility is overestimated when using the close-talking microphone signal directly but that this overestimation can be eliminated by applying speech enhancement to the reference signal.

Abstract
Cite

Conference paper

Nespoli F, Pohlhausen J, Naylor PA, Bitzer Jet al., 2023, Long-term Conversation Analysis: Exploring Utility and Privacy, Pages: 26-30

The analysis of conversations recorded in everyday life requires privacy protection. In this contribution, we explore a privacy-preserving feature extraction method based on input feature dimension reduction, spectral smoothing and the low-cost speaker anonymization technique based on McAdams coefficient. We assess the utility of the feature extraction methods with a voice activity detection and a speaker diarization system, while privacy protection is determined with a speech recognition and a speaker verification model. We show that the combination of the McAdams coefficient and spectral smoothing maintains the utility while improving privacy.

Abstract
Cite

Conference paper

McKnight SW, Hogg AOT, Neo VW, Naylor PAet al., 2023, Uncertainty Quantification in Machine Learning for Joint Speaker Diarization and Identification., CoRR, Vol: abs/2312.16763

Cite

Journal article

McKnight S, Hogg AOT, Neo VW, Naylor PAet al., 2022, Studying human-based speaker diarization and comparing to state-of-the-art systems, APSIPA 2022, Publisher: IEEE, Pages: 394-401

Human-based speaker diarization experiments were carried out on a five-minute extract of a typical AMI corpus meeting to see how much variance there is in human reviews based on hearing only and to compare with state-of-the-art diarization systems on the same extract. There are three distinct experiments: (a) one with no prior information; (b) one with the ground truth speech activity detection (GT-SAD); and (c) one with the blank ground truth labels (GT-labels). The results show that most human reviews tend to be quite similar, albeit with some outliers, but the choice of GT-labels can make a dramatic difference to scored performance. Using the GT-SAD provides a big advantage and improves human review scores substantially, though small differences in the GT-SAD used can have a dramatic effect on results. The use of forgiveness collars is shown to be unhelpful. The results show that state-of-the-art systems can outperform the best human reviews when no prior information is provided. However, the best human reviews still outperform state-of-the-art systems when starting from the GT-SAD.

Conference paper

D'Olne E, Neo VW, Naylor PA, 2022, Speech enhancement in distributed microphone arrays using polynomial eigenvalue decomposition, Europen Signal Processing Conference (EUSIPCO), Publisher: IEEE, Pages: 55-59, ISSN: 2219-5491

As the number of connected devices equipped withmultiple microphones increases, scientific interest in distributedmicrophone array processing grows. Current beamforming meth-ods heavily rely on estimating quantities related to array geom-etry, which is extremely challenging in real, non-stationary envi-ronments. Recent work on polynomial eigenvalue decomposition(PEVD) has shown promising results for speech enhancement insingular arrays without requiring the estimation of any array-related parameter [1]. This work extends these results to therealm of distributed microphone arrays, and further presentsa novel framework for speech enhancement in distributed mi-crophone arrays using PEVD. The proposed approach is shownto almost always outperform optimum beamformers located atarrays closest to the desired speaker. Moreover, the proposedapproach exhibits very strong robustness to steering vectorerrors.

Conference paper

Neo VW, Weiss S, McKnight S, Hogg A, Naylor PAet al., 2022, Polynomial eigenvalue decomposition-based target speaker voice activity detection in the presence of competing talkers, International Workshop on Acoustic Signal Enhancement (IWAENC), Publisher: IEEE, Pages: 1-5

Voice activity detection (VAD) algorithms are essential for many speech processing applications, such as speaker diarization, automatic speech recognition, speech enhancement, and speech coding. With a good VAD algorithm, non-speech segments can be excluded to improve the performance and computation of these applications. In this paper, we propose a polynomial eigenvalue decomposition-based target-speaker VAD algorithm to detect unseen target speakers in the presence of competing talkers. The proposed approach uses frame-based processing to compute the syndrome energy, used for testing the presence or absence of a target speaker. The proposed approach is consistently among the best in F1 and balanced accuracy scores over the investigated range of signal to interference ratio (SIR) from -10 dB to 20 dB.

Conference paper

D'Olne E, Neo VW, Naylor PA, 2022, Frame-based space-time covariance matrix estimation for polynomial eigenvalue decomposition-based speech enhancement, International Workshop on Acoustic Signal Enhancement (IWAENC), Publisher: IEEE, Pages: 1-5

Recent work in speech enhancement has proposed a polynomial eigenvalue decomposition (PEVD) method, yielding significant intelligibility and noise-reduction improvements without introducing distortions in the enhanced signal [1]. The method relies on the estimation of a space-time covariance matrix, performed in batch mode such that a sufficiently long portion of the noisy signal is used to derive an accurate estimate. However, in applications where the scene is nonstationary, this approach is unable to adapt to changes in the acoustic scenario. This paper thus proposes a frame-based procedure for the estimation of space-time covariance matrices and investigates its impact on subsequent PEVD speech enhancement. The method is found to yield spatial filters and speech enhancement improvements comparable to the batch method in [1], showing potential for real-time processing.

Conference paper

Neo VW, D'Olne E, Moore AH, Naylor PAet al., 2022, Fixed beamformer design using polynomial eigenvalue decomposition, International Workshop on Acoustic Signal Enhancement (IWAENC), Publisher: IEEE, Pages: 1-5

Array processing is widely used in many speech applications involving multiple microphones. These applications include automaticspeech recognition, robot audition, telecommunications, and hearing aids. A spatio-temporal filter for the array allows signals fromdifferent microphones to be combined desirably to improve the application performance. This paper will analyze and visually interpretthe eigenvector beamformers designed by the polynomial eigenvaluedecomposition (PEVD) algorithm, which are suited for arbitrary arrays. The proposed fixed PEVD beamformers are lightweight, withan average filter length of 114 and perform comparably to classicaldata-dependent minimum variance distortionless response (MVDR)and linearly constrained minimum variance (LCMV) beamformersfor the separation of sources closely spaced by 5 degrees.

Conference paper

Tokala V, Brookes M, Naylor P, 2022, Binaural speech enhancement using STOI-optimal masks, International Workshop on Acoustic Signal Enhancement (IWAENC) 2022, Publisher: IEEE, Pages: 1-5

STOI-optimal masking has been previously proposed and developed for single-channel speech enhancement. In this paper, we consider the extension to the task of binaural speech enhancement in which the spatial information is known to be important to speech understanding and therefore should bepreserved by the enhancement processing. Masks are estimated for each of the binaural channels individually and a ‘better-ear listening’ mask is computed by choosing the maximum of the two masks. The estimated mask is used to supply probability information about the speech presence in eachtime-frequency bin to an Optimally-modified Log Spectral Amplitude (OM-LSA) enhancer. We show that using the pro-posed method for binaural signals with a directional noise not only improves the SNR of the noisy signal but also preserves the binaural cues and intelligibility.

Conference paper

Neo VW, Weiss S, Naylor PA, 2022, A polynomial subspace projection approach for the detection of weak voice activity, Sensor Signal Processing for Defence conference (SSPD), Publisher: IEEE, Pages: 1-5

A voice activity detection (VAD) algorithm identifies whether or not time frames contain speech. It is essential for many military and commercial speech processing applications, including speech enhancement, speech coding, speaker identification, and automatic speech recognition. In this work, we adopt earlier work on detecting weak transient signals and propose a polynomial subspace projection pre-processor to improve an existing VAD algorithm. The proposed multi-channel pre-processor projects the microphone signals onto a lower dimensional subspace which attempts to remove the interferer components and thus eases the detection of the speech target. Compared to applying the same VAD to the microphone signal, the proposed approach almost always improves the F1 and balanced accuracy scores even in adverse environments, e.g. -30 dB SIR, which may be typical of operations involving noisy machinery and signal jamming scenarios.

Conference paper

Moore AH, Green T, Brookes DM, Naylor PAet al., 2022, Measuring audio-visual speech intelligibility under dynamic listening conditions using virtual reality, AES 2022 International Audio for Virtual and Augmented Reality Conference, Publisher: Audio Engineering Society (AES), Pages: 1-8

The ELOSPHERES project is a collaboration between researchers at Imperial College London and University College London which aims to improve the efficacy of hearing aids. The benefit obtained from hearing aids varies significantly between listeners and listening environments. The noisy, reverberant environments which most people find challenging bear little resemblance to the clinics in which consultations occur. In order to make progress in speech enhancement, algorithms need to be evaluated under realistic listening conditions. A key aim of ELOSPHERES is to create a virtual reality-based test environment in which alternative speech enhancement algorithms can be evaluated using a listener-in-the-loop paradigm. In this paper we present the sap-elospheres-audiovisual-test (SEAT) platform and report the results of an initial experiment in which it was used to measure the benefit of visual cues in a speech intelligibility in spatial noise task.

Conference paper

H Moore A, Hafezi S, R Vos R, A Naylor P, Brookes Met al., 2022, A compact noise covariance matrix model for MVDR beamforming, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol: 30, Pages: 2049-2061, ISSN: 2329-9290

Acoustic beamforming is routinely used to improve the SNR of the received signal in applications such as hearing aids, robot audition, augmented reality, teleconferencing, source localisation and source tracking. The beamformer can be made adaptive by using an estimate of the time-varying noise covariance matrix in the spectral domain to determine an optimised beam pattern in each frequency bin that is specific to the acoustic environment and that can respond to temporal changes in it. However, robust estimation of the noise covariance matrix remains a challenging task especially in non-stationary acoustic environments. This paper presents a compact model of the signal covariance matrix that is defined by a small number of parameters whose values can be reliably estimated. The model leads to a robust estimate of the noise covariance matrix which can, in turn, be used to construct a beamformer. The performance of beamformers designed using this approach is evaluated for a spherical microphone array under a range of conditions using both simulated and measured room impulse responses. The proposed approach demonstrates consistent gains in intelligibility and perceptual quality metrics compared to the static and adaptive beamformers used as baselines.

Journal article

McKnight S, Hogg A, Neo V, Naylor Pet al., 2022, A study of salient modulation domain features for speaker identification, Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Publisher: IEEE, Pages: 705-712

This paper studies the ranges of acoustic andmodulation frequencies of speech most relevant for identifyingspeakers and compares the speaker-specific information presentin the temporal envelope against that present in the temporalfine structure. This study uses correlation and feature importancemeasures, random forest and convolutional neural network mod-els, and reconstructed speech signals with specific acoustic and/ormodulation frequencies removed to identify the salient points. Itis shown that the range of modulation frequencies associated withthe fundamental frequency is more important than the 1-16 Hzrange most commonly used in automatic speech recognition, andthat the 0 Hz modulation frequency band contains significantspeaker information. It is also shown that the temporal envelopeis more discriminative among speakers than the temporal finestructure, but that the temporal fine structure still contains usefuladditional information for speaker identification. This researchaims to provide a timely addition to the literature by identifyingspecific aspects of speech relevant for speaker identification thatcould be used to enhance the discriminant capabilities of machinelearning models.

Conference paper

Green T, Hilkhuysen G, Huckvale M, Rosen S, Brookes M, Moore A, Naylor P, Lightburn L, Xue Wet al., 2022, Speech recognition with a hearing-aid processing scheme combining beamforming with mask-informed speech enhancement, Trends in Hearing, Vol: 26, Pages: 1-16, ISSN: 2331-2165

A signal processing approach combining beamforming with mask-informed speech enhancement was assessed by measuring sentence recognition in listeners with mild-to-moderate hearing impairment in adverse listening conditions that simulated the output of behind-the-ear hearing aids in a noisy classroom. Two types of beamforming were compared: binaural, with the two microphones of each aid treated as a single array, and bilateral, where independent left and right beamformers were derived. Binaural beamforming produces a narrower beam, maximising improvement in signal-to-noise ratio (SNR), but eliminates the spatial diversity that is preserved in bilateral beamforming. Each beamformer type was optimised for the true target position and implemented with and without additional speech enhancement in which spectral features extracted from the beamformer output were passed to a deep neural network trained to identify time-frequency regions dominated by target speech. Additional conditions comprising binaural beamforming combined with speech enhancement implemented using Wiener filtering or modulation-domain Kalman filtering were tested in normally-hearing (NH) listeners. Both beamformer types gave substantial improvements relative to no processing, with significantly greater benefit for binaural beamforming. Performance with additional mask-informed enhancement was poorer than with beamforming alone, for both beamformer types and both listener groups. In NH listeners the addition of mask-informed enhancement produced significantly poorer performance than both other forms of enhancement, neither of which differed from the beamformer alone. In summary, the additional improvement in SNR provided by binaural beamforming appeared to outweigh loss of spatial information, while speech understanding was not further improved by the mask-informed enhancement method implemented here.

Journal article

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00004259&limit=30&person=true