Imperial College London

Patrick A. Naylor

Faculty of EngineeringDepartment of Electrical and Electronic Engineering

Professor of Speech & Acoustic Signal Processing
 
 
 
//

Contact

 

+44 (0)20 7594 6235p.naylor Website

 
 
//

Location

 

803Electrical EngineeringSouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

389 results found

Neo VW, D'Olne E, Moore AH, Naylor PAet al., 2022, Fixed beamformer design using polynomial eigenvalue decomposition, International Workshop on Acoustic Signal Enhancement (IWAENC)

Array processing is widely used in many speech applications involving multiple microphones. These applications include automaticspeech recognition, robot audition, telecommunications, and hearing aids. A spatio-temporal filter for the array allows signals fromdifferent microphones to be combined desirably to improve the application performance. This paper will analyze and visually interpretthe eigenvector beamformers designed by the polynomial eigenvaluedecomposition (PEVD) algorithm, which are suited for arbitrary arrays. The proposed fixed PEVD beamformers are lightweight, withan average filter length of 114 and perform comparably to classicaldata-dependent minimum variance distortionless response (MVDR)and linearly constrained minimum variance (LCMV) beamformersfor the separation of sources closely spaced by 5 degrees.

Conference paper

McKnight S, Hogg AOT, Neo VW, Naylor PAet al., 2022, Studying Human-Based Speaker Diarization and Comparing to State-of-the-Art Systems, APSIPA 2022

Conference paper

Neo VW, Weiss S, McKnight S, Hogg A, Naylor PAet al., 2022, Polynomial eigenvalue decomposition-based target speaker voice activity detection in the presence of competing talkers, 17th International Workshop on Acoustic Signal Enhancement

Voice activity detection (VAD) algorithms are essential for many speech processing applications, such as speaker diarization, automatic speech recognition, speech enhancement, and speech coding. With a good VAD algorithm, non-speech segments can be excluded to improve the performance and computation of these applications. In this paper, we propose a polynomial eigenvalue decomposition-based target-speaker VAD algorithm to detect unseen target speakers in the presence of competing talkers. The proposed approach uses frame-based processing to compute the syndrome energy, used for testing the presence or absence of a target speaker. The proposed approach is consistently among the best in F1 and balanced accuracy scores over the investigated range of signal to interference ratio (SIR) from -10 dB to 20 dB.

Conference paper

Moore AH, Green T, Brookes DM, Naylor PAet al., 2022, Measuring audio-visual speech intelligibility under dynamic listening conditions using virtual reality, AES 2022 International Audio for Virtual and Augmented Reality Conference, Publisher: Audio Engineering Society (AES), Pages: 1-8

The ELOSPHERES project is a collaboration between researchers at Imperial College London and University College London which aims to improve the efficacy of hearing aids. The benefit obtained from hearing aids varies significantly between listeners and listening environments. The noisy, reverberant environments which most people find challenging bear little resemblance to the clinics in which consultations occur. In order to make progress in speech enhancement, algorithms need to be evaluated under realistic listening conditions. A key aim of ELOSPHERES is to create a virtual reality-based test environment in which alternative speech enhancement algorithms can be evaluated using a listener-in-the-loop paradigm. In this paper we present the sap-elospheres-audiovisual-test (SEAT) platform and report the results of an initial experiment in which it was used to measure the benefit of visual cues in a speech intelligibility in spatial noise task.

Conference paper

D'Olne E, Neo VW, Naylor PA, 2022, Frame-based space-time covariance matrix estimation for polynomial eigenvalue decomposition-based speech enhancement, International Workshop on Acoustic Signal Enhancement (IWAENC), Publisher: IEEE

Recent work in speech enhancement has proposed a poly-nomial eigenvalue decomposition (PEVD) method, yieldingsignificant intelligibility and noise-reduction improvementswithout introducing distortions in the enhanced signal [1].The method relies on the estimation of a space-time covari-ance matrix, performed in batch mode such that a sufficientlylong portion of the noisy signal is used to derive an accurateestimate. However, in applications where the scene is non-stationary, this approach is unable to adapt to changes in theacoustic scenario. This paper thus proposes a frame-basedprocedure for the estimation of space-time covariance matri-ces and investigates its impact on subsequent PEVD speechenhancement. The method is found to yield spatial filters andspeech enhancement improvements comparable to the batchmethod in [1], showing potential for real-time processing.

Conference paper

Tokala V, Brookes M, Naylor P, 2022, Binaural speech enhancement using STOI-optimal masks, International Workshop on Acoustic Signal Enhancement (IWAENC) 2022, Publisher: IEEE

STOI-optimal masking has been previously proposed and developed for single-channel speech enhancement. In this paper, we consider the extension to the task of binaural speech enhancement in which the spatial information is known to be important to speech understanding and therefore should bepreserved by the enhancement processing. Masks are estimated for each of the binaural channels individually and a ‘better-ear listening’ mask is computed by choosing the maximum of the two masks. The estimated mask is used to supply probability information about the speech presence in eachtime-frequency bin to an Optimally-modified Log Spectral Amplitude (OM-LSA) enhancer. We show that using the pro-posed method for binaural signals with a directional noise not only improves the SNR of the noisy signal but also preserves the binaural cues and intelligibility.

Conference paper

Neo VW, Weiss S, Naylor PA, 2022, A polynomial subspace projection approach for the detection of weak voice activity, Sensor Signal Processing for Defence conference (SSPD), Publisher: IEEE

A voice activity detection (VAD) algorithm identifies whether or not time frames contain speech. It is essential for many military and commercial speech processing applications, including speech enhancement, speech coding, speaker identification, and automatic speech recognition. In this work, we adopt earlier work on detecting weak transient signals and propose a polynomial subspace projection pre-processor to improve an existing VAD algorithm. The proposed multi-channel pre-processor projects the microphone signals onto a lower dimensional subspace which attempts to remove the interferer components and thus eases the detection of the speech target. Compared to applying the same VAD to the microphone signal, the proposed approach almost always improves the F1 and balanced accuracy scores even in adverse environments, e.g. -30 dB SIR, which may be typical of operations involving noisy machinery and signal jamming scenarios.

Conference paper

H Moore A, Hafezi S, R Vos R, A Naylor P, Brookes Met al., 2022, A compact noise covariance matrix model for MVDR beamforming, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol: 30, Pages: 2049-2061, ISSN: 2329-9290

Acoustic beamforming is routinely used to improve the SNR of the received signal in applications such as hearing aids, robot audition, augmented reality, teleconferencing, source localisation and source tracking. The beamformer can be made adaptive by using an estimate of the time-varying noise covariance matrix in the spectral domain to determine an optimised beam pattern in each frequency bin that is specific to the acoustic environment and that can respond to temporal changes in it. However, robust estimation of the noise covariance matrix remains a challenging task especially in non-stationary acoustic environments. This paper presents a compact model of the signal covariance matrix that is defined by a small number of parameters whose values can be reliably estimated. The model leads to a robust estimate of the noise covariance matrix which can, in turn, be used to construct a beamformer. The performance of beamformers designed using this approach is evaluated for a spherical microphone array under a range of conditions using both simulated and measured room impulse responses. The proposed approach demonstrates consistent gains in intelligibility and perceptual quality metrics compared to the static and adaptive beamformers used as baselines.

Journal article

D'Olne E, Neo VW, Naylor PA, 2022, Speech enhancement in distributed microphone arrays using polynomial eigenvalue decomposition, Europen Signal Processing Conference (EUSIPCO), Publisher: IEEE, ISSN: 2219-5491

As the number of connected devices equipped withmultiple microphones increases, scientific interest in distributedmicrophone array processing grows. Current beamforming meth-ods heavily rely on estimating quantities related to array geom-etry, which is extremely challenging in real, non-stationary envi-ronments. Recent work on polynomial eigenvalue decomposition(PEVD) has shown promising results for speech enhancement insingular arrays without requiring the estimation of any array-related parameter [1]. This work extends these results to therealm of distributed microphone arrays, and further presentsa novel framework for speech enhancement in distributed mi-crophone arrays using PEVD. The proposed approach is shownto almost always outperform optimum beamformers located atarrays closest to the desired speaker. Moreover, the proposedapproach exhibits very strong robustness to steering vectorerrors.

Conference paper

McKnight S, Hogg A, Neo V, Naylor Pet al., 2022, A study of salient modulation domain features for speaker identification, Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Publisher: IEEE, Pages: 705-712

This paper studies the ranges of acoustic andmodulation frequencies of speech most relevant for identifyingspeakers and compares the speaker-specific information presentin the temporal envelope against that present in the temporalfine structure. This study uses correlation and feature importancemeasures, random forest and convolutional neural network mod-els, and reconstructed speech signals with specific acoustic and/ormodulation frequencies removed to identify the salient points. Itis shown that the range of modulation frequencies associated withthe fundamental frequency is more important than the 1-16 Hzrange most commonly used in automatic speech recognition, andthat the 0 Hz modulation frequency band contains significantspeaker information. It is also shown that the temporal envelopeis more discriminative among speakers than the temporal finestructure, but that the temporal fine structure still contains usefuladditional information for speaker identification. This researchaims to provide a timely addition to the literature by identifyingspecific aspects of speech relevant for speaker identification thatcould be used to enhance the discriminant capabilities of machinelearning models.

Conference paper

Sathyapriyan V, Pedersen MS, Ostergaard J, Brookes M, Naylor PA, Jensen Jet al., 2022, A Linear MMSE Filter Using Delayed Remote Microphone Signals for Speech Enhancement in Hearing Aid Applications

Existing methods that use remote microphones with hearing aid (HA) noise reduction systems, assume the wireless transmission to be instantaneous. In practice, however, there exists a time difference of arrival (TDOA) between the wirelessly transmitted target signals and the acoustic signal arriving from the target source to the HA device, which degrades their noise reduction performance. As speech is correlated between consecutive time-frames in the short-time Fourier transform (STFT) domain, we propose a linear minimum mean-square error (MMSE) estimator to estimate the desired signal, by combining multiple HA microphone signals with multiple consecutive time-frames of the remote microphone signal. We derive closed form expressions for the resulting filter weights and interpret them in terms of existing multi-channel and multi-frame methods. The simulation results validate the interpretation and show that using a multi-frame method along with a multi-channel method is an advantage, in the presence of unknown, positive TDOA between the microphone signals.

Conference paper

Grinstein E, Naylor PA, 2022, Deep Complex-Valued Convolutional-Recurrent Networks for Single Source DOA Estimation

Despite having conceptual and practical advantages, Complex-Valued Neural Networkss (CVNNs) have been much less explored for audio signal processing tasks than their real-valued counterparts. We investigate the use of a complex-valued Convolutional Recurrent Neural Network (CRNN) for Direction-of-Arrival (DOA) estimation of a single sound source on an enclosed room. By training and testing our model with recordings from the DCASE 2019 dataset, we show our architecture compares favourably to a real-valued CRNN counter-part both in terms of estimation error as well as speed of convergence. We also show visualizations of the complex-valued feature representations learned by our method and provide interpretations for them.

Conference paper

Moore AH, Green T, Brookes M, Naylor PAet al., 2022, Measuring audio-visual speech intelligibility under dynamic listening conditions using virtual reality, Pages: 411-418

The ELOSPHERES project is a collaboration between researchers at Imperial College London and University College London which aims to improve the efficacy of hearing aids. The benefit obtained from hearing aids varies significantly between listeners and listening environments. The noisy, reverberant environments which most people find challenging bear little resemblance to the clinics in which consultations occur. In order to make progress in speech enhancement, algorithms need to be evaluated under realistic listening conditions. A key aim of ELOSPHERES is to create a virtual reality-based test environment in which alternative speech enhancement algorithms can be evaluated using a listener-in-the-loop paradigm. In this paper we present the sap-elospheresaudiovisual- test (SEAT) platform and report the results of an initial experiment in which it was used to measure the benefit of visual cues in a speech intelligibility in spatial noise task.

Conference paper

Guiraud P, Moore AH, Vos RR, Naylor PA, Brookes Met al., 2022, Machine Learning for Parameter Estimation in the MBSTOI Binaural Intelligibility Metric

Speech intelligibility metrics are widely used as a replacement for lengthy, in-person intelligibility tests. The short time objective intelligibility (STOI) metric and the more recent modified binaural STOI (MBSTOI), have proven to be reliable in many situations. At the same time, the recent wider accessibility of machine learning (ML) and deep learning (DL) models has lead to the creation of many ML based intelligibility metrics hoping to further improve such metrics. The deep equalisation cancellation MBSTOI model (DEC-MBSTOI) is here presented as a first step toward a hybrid format. The lengthy Equalisation Cancellation (EC) stage of MBSTOI is replaced by a DL model and its performance assessed in terms of a sensitivity analysis performed on the EC stage of MB-STOI. ML is here used to solve an arguably simple problem for accurate metric reproduction and good performance is observed.

Conference paper

Guiraud P, Hafezi S, Naylor PA, Moore AH, Donley J, Tourbabin V, Lunner Tet al., 2022, An Introduction to the Speech Enhancement for Augmented Reality (Spear) Challenge

It is well known that microphone arrays can be used to enhance a target speaker in a noisy, reverberant environment, with both spatial (e.g. beamforming) and statistical (e.g. source separation) methods proving effective. Head-worn microphone arrays inherently sample a sound field from an egocentric perspective - when the head moves the apparent direction of even static sound sources change with respect to the array. Traditionally, enhancement algorithms have aimed at being robust to head motion but hearable devices and augmented reality (AR) headsets/glasses contain additional sensors which offer the potential to adapt to, or even exploit, head motion. The recently released EasyCom database contains microphone array recordings of group conversations made in a realistic restaurant-like acoustic scene. In addition to egocentric recordings made with AR glasses, extensive metadata, including the position and orientation of speakers, is provided. This paper describes the use and adaptation of EasyCom for a new IEEE SPS Data Challenge.

Conference paper

Nespoli F, Barreda D, Naylor PA, 2022, Relative Acoustic Features for Distance Estimation in Smart-Homes, Pages: 724-728, ISSN: 2308-457X

Any audio recording encapsulates the unique fingerprint of the associated acoustic environment, namely the background noise and reverberation. Considering the scenario of a room equipped with a fixed smart speaker device with one or more microphones and a wearable smart device (watch, glasses or smartphone), we employed the improved proportionate normalized least mean square adaptive filter to estimate the relative room impulse response mapping the audio recordings of the two devices. We performed inter-device distance estimation by exploiting a new set of features obtained extending the definition of some acoustic attributes of the room impulse response to its relative version. In combination with the sparseness measure of the estimated relative room impulse response, the relative features allow precise inter-device distance estimation which can be exploited for tasks such as best microphone selection or acoustic scene analysis. Experimental results from simulated rooms of different dimensions and reverberation times demonstrate the effectiveness of this computationally lightweight approach for smart home acoustic ranging applications.

Conference paper

Li G, Sharma D, Naylor PA, 2022, Non-Intrusive Signal Analysis for Room Adaptation of ASR Models, Pages: 130-134, ISSN: 2219-5491

We present a new deep-learning-based non-intrusive signal assessment method (NISA+) that performs a joint estimation of a large set of speech signal parameters, including those related to reverberation (C50, DRR, reflection coefficient and room volume), background noise (SNR), perceptual speech quality (PESQ), speech intelligibility (ESTOI), voice activity detection, and speech coding (codec presence and bitrate). We show that neural embedding based combination of spectral features with an LSTM and modulation features with a convolution neural network enable NISA+ to achieve state of the art performance. Particularly, for non-intrusive PESQ and C50 estimation, we show around 15% relative reduction in estimation error compared to our previous best results. We also show that NISA+ can be used to perform targeted data augmentation for generating training data for ASR that matches the signal characteristics extracted from a small sample of data recorded in a target room acoustic environment. We show that a 9.6% word error rate reduction can be achieved relative to an ASR model trained with random augmentation.

Conference paper

Jones DT, Sharma D, Kruchinin SY, Naylor PAet al., 2022, Microphone Array Coding Preserving Spatial Information for Cloud-based Multichannel Speech Recognition, Pages: 324-328, ISSN: 2219-5491

An efficient method of coding multichannel signals from a microphone array is presented. This is advantageous for cloud-based audio processing, such as Direction-of-Arrival (DOA) and Automatic Speech Recognition (ASR). The method operates by encoding separately the signal information - using a reference signal - and the spatial information - using Relative Transfer Functions (RTFs). Results for ASR and DOA performance are presented for the proposed codec in comparison to a baseline multichannel implementation of the Opus codec. Both stationary and time-varying acoustic scenarios have been included in the tests. The proposed RTF-based codec is shown in our experiments to preserve spatial information in the array signals whereas the baseline codec does not. The proposed codec is also shown to outperform the baseline on the ASR task at low bit rates in the region of 6 kbits per second per channel.

Conference paper

Sharma D, Gong R, Fosburgh J, Kruchinin SY, Naylor PA, Milanović Let al., 2022, SPATIAL PROCESSING FRONT-END FOR DISTANT ASR EXPLOITING SELF-ATTENTION CHANNEL COMBINATOR, Pages: 7997-8001, ISSN: 1520-6149

We present a novel multi-channel front-end based on channel shortening with the Weighted Prediction Error (WPE) method followed by a fixed MVDR beamformer used in combination with a recently proposed self-attention-based channel combination (SACC) scheme, for tackling the distant ASR problem. We show that the proposed system used as part of a ContextNet based end-to-end (E2E) ASR system outperforms leading ASR systems as demonstrated by a 21.6% reduction in relative WER on a multi-channel LibriSpeech playback dataset. We also show how dereverberation prior to beamforming is beneficial and compare the WPE method with a modified neural channel shortening approach. An analysis of the non-intrusive estimate of the signal C50 confirms that the 8 channel WPE method provides significant dereverberation of the signals (13.6 dB improvement). We also show how the weights of the SACC system allow the extraction of accurate spatial information which can be beneficial for other speech processing applications like diarization.

Conference paper

Cakmak B, Dietzen T, Ali R, Naylor P, Waterschoot TVet al., 2022, A Distributed Steered Response Power Approach to Source Localization in Wireless Acoustic Sensor Networks

In wireless acoustic sensor networks (WASNs), the conventional steered response power (SRP) approach to source localization requires each node to transmit its microphone signal to a fusion center. As an alternative, this paper proposes two different fusion strategies for local, single-node SRP maps computed using only the microphone pairs within a node. In the first fusion strategy, we sum all single-node SRP maps in a fusion center, requiring less communication than the conventional SRP approach because the single-node SRP maps typically have less parameters than the raw microphone signals. In the second fusion strategy, the single-node SRP maps are distributively averaged without using a fusion center, requiring communication amongst connected nodes only. Simulations show that we achieve a good trade-off between communicational load and localization performance.

Conference paper

Green T, Hilkhuysen G, Huckvale M, Rosen S, Brookes M, Moore A, Naylor P, Lightburn L, Xue Wet al., 2022, Speech recognition with a hearing-aid processing scheme combining beamforming with mask-informed speech enhancement, Trends in Hearing, Vol: 26, Pages: 1-16, ISSN: 2331-2165

A signal processing approach combining beamforming with mask-informed speech enhancement was assessed by measuring sentence recognition in listeners with mild-to-moderate hearing impairment in adverse listening conditions that simulated the output of behind-the-ear hearing aids in a noisy classroom. Two types of beamforming were compared: binaural, with the two microphones of each aid treated as a single array, and bilateral, where independent left and right beamformers were derived. Binaural beamforming produces a narrower beam, maximising improvement in signal-to-noise ratio (SNR), but eliminates the spatial diversity that is preserved in bilateral beamforming. Each beamformer type was optimised for the true target position and implemented with and without additional speech enhancement in which spectral features extracted from the beamformer output were passed to a deep neural network trained to identify time-frequency regions dominated by target speech. Additional conditions comprising binaural beamforming combined with speech enhancement implemented using Wiener filtering or modulation-domain Kalman filtering were tested in normally-hearing (NH) listeners. Both beamformer types gave substantial improvements relative to no processing, with significantly greater benefit for binaural beamforming. Performance with additional mask-informed enhancement was poorer than with beamforming alone, for both beamformer types and both listener groups. In NH listeners the addition of mask-informed enhancement produced significantly poorer performance than both other forms of enhancement, neither of which differed from the beamformer alone. In summary, the additional improvement in SNR provided by binaural beamforming appeared to outweigh loss of spatial information, while speech understanding was not further improved by the mask-informed enhancement method implemented here.

Journal article

Neo V, Evers C, Naylor P, 2021, Polynomial matrix eigenvalue decomposition-based source separation using informed spherical microphone arrays, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE, Pages: 201-205

Audio source separation is essential for many applications such as hearing aids, telecommunications, and robot audition. Subspace decomposition approaches using polynomial matrix eigenvalue decomposition (PEVD) algorithms applied to the microphone signals, or lower-dimension eigenbeams for spherical microphone arrays, are effective for speech enhancement. In this work, we extend the work from speech enhancement and propose a PEVD subspace algorithm that uses eigenbeams for source separation. The proposed PEVD-based source separation approach performs comparably with state-of-the-art algorithms, such as those based on independent component analysis (ICA) and multi-channel non-negative matrix factorization (MNMF). Informal listening examples also indicate that our method does not introduce any audible artifacts.

Conference paper

Hogg A, Neo V, Weiss S, Evers C, Naylor Pet al., 2021, A polynomial eigenvalue decomposition MUSIC approach for broadband sound source localization, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE, Pages: 326-330

Direction of arrival (DoA) estimation for sound source localization is increasingly prevalent in modern devices. In this paper, we explore a polynomial extension to the multiple signal classification (MUSIC) algorithm, spatio-spectral polynomial (SSP)-MUSIC, and evaluate its performance when using speech sound sources. In addition, we also propose three essential enhancements for SSP-MUSIC to work with noisy reverberant audio data. This paper includes an analysis of SSP-MUSIC using speech signals in a simulated room for different noise and reverberation conditions and the first task of the LOCATA challenge. We show that SSP-MUSIC is more robust to noise and reverberation compared to independent frequency bin (IFB) approaches and improvements can be seen for single sound source localization at signal-to-noise ratios (SNRs) below 5 dB and reverberation times (T60s) larger than 0.7 s.

Conference paper

D'Olne E, Moore A, Naylor P, 2021, Model-based beamforming for wearable microphone arrays, European Signal Processing Conference (EUSIPCO), Publisher: IEEE, Pages: 1105-1109

Beamforming techniques for hearing aid applications are often evaluated using behind-the-ear (BTE) devices. However, the growing number of wearable devices with microphones has made it possible to consider new geometries for microphone array beamforming. In this paper, we examine the effect of array location and geometry on the performance of binaural minimum power distortionless response (BMPDR) beamformers. In addition to the classical adaptive BMPDR, we evaluate the benefit of a recently-proposed method that estimates the sample covariance matrix using a compact model. Simulation results show that using a chest-mounted array reduces noise by an additional 1.3~dB compared to BTE hearing aids. The compact model method is found to yield higher predicted intelligibility than adaptive BMPDR beamforming, regardless of the array geometry.

Conference paper

Neo V, Evers C, Naylor P, 2021, Enhancement of noisy reverberant speech using polynomial matrix eigenvalue decomposition, IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol: 29, Pages: 3255-3266, ISSN: 2329-9290

Speech enhancement is important for applications such as telecommunications, hearing aids, automatic speech recognition and voice-controlled systems. Enhancement algorithms aim to reduce interfering noise and reverberation while minimizing any speech distortion. In this work for speech enhancement, we propose to use polynomial matrices to model the spatial, spectral and temporal correlations between the speech signals received by a microphone array and polynomial matrix eigenvalue decomposition (PEVD) to decorrelate in space, time and frequency simultaneously. We then propose a blind and unsupervised PEVD-based speech enhancement algorithm. Simulations and informal listening examples involving diverse reverberant and noisy environments have shown that our method can jointly suppress noise and reverberation, thereby achieving speech enhancement without introducing processing artefacts into the enhanced signal.

Journal article

Martinez-Colon A, Viciana-Abad R, Perez-Lorenzo JM, Evers C, Naylor PAet al., 2021, An audio enhancement system to improve intelligibility for social-awareness in HRI, MULTIMEDIA TOOLS AND APPLICATIONS, Vol: 81, Pages: 3327-3350, ISSN: 1380-7501

Journal article

Jones DT, Sharma D, Kruchinin SY, Naylor Pet al., 2021, Spatial Coding for Microphone Arrays using IPNLMS-Based RTF Estimation, 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

Conference paper

Moore A, Vos R, Naylor P, Brookes Det al., 2021, Processing pipelines for efficient, physically-accurate simulation of microphone array signals in dynamic sound scenes, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Publisher: IEEE, ISSN: 0736-7791

Multichannel acoustic signal processing is predicated on the fact that the inter channel relationships between the received signals can be exploited to infer information about the acoustic scene. Recently there has been increasing interest in algorithms which are applicable in dynamic scenes, where the source(s) and/or microphone array may be moving. Simulating such scenes has particular challenges which are exacerbated when real-time, listener-in-the-loop evaluation of algorithms is required. This paper considers candidate pipelines for simulating the array response to a set of point/image sources in terms of their accuracy, scalability and continuity. Anew approach, in which the filter kernels are obtained using principal component analysis from time-aligned impulse responses, is proposed. When the number of filter kernels is≤40the new approach achieves more accurate simulation than competing methods.

Conference paper

Hogg A, Naylor P, Evers C, 2021, Multichannel overlapping speaker segmentation using multiple hypothesis tracking of acoustic and spatial features, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE, Pages: 26-30

An essential part of any diarization system is the task of speaker segmentation which is important for many applications including speaker indexing and automatic speech recognition (ASR) in multi-speaker environments. Segmentation of overlapping speech has recently been a key focus of this work. In this paper we explore the use of a new multimodal approach for overlapping speaker segmentation that tracks both the fundamental frequency (F0) of the speaker and the speaker’s direction of arrival (DOA) simultaneously. Our proposed multiple hypothesis tracking system, which simultaneously tracks both features, shows an improvement in segmentation performance when compared to tracking these features separately. An illustrative example of overlapping speech demonstrates the effectiveness of our proposed system. We also undertake a statistical analysis on 12 meetings from the AMI corpus and show an improvement in the HIT rate of 14.1% on average against a commonly used deep learning bidirectional long short term memory network (BLSTM) approach.

Conference paper

Neo VW, Evers C, Naylor PA, 2021, Polynomial matrix eigenvalue decomposition of spherical harmonics for speech enhancement, IEEE International Conference on Acoustics, Speech and Signal Processing, Publisher: IEEE, Pages: 786-790

Speech enhancement algorithms using polynomial matrix eigen value decomposition (PEVD) have been shown to be effective for noisy and reverberant speech. However, these algorithms do not scale well in complexity with the number of channels used in the processing. For a spherical microphone array sampling an order-limited sound field, the spherical harmonics provide a compact representation of the microphone signals in the form of eigen beams. We propose a PEVD algorithm that uses only the lower dimension eigen beams for speech enhancement at a significantly lower computation cost. The proposed algorithm is shown to significantly reduce complexity while maintaining full performance. Informal listening examples have also indicated that the processing does not introduce any noticeable artefacts.

Conference paper

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00004259&limit=30&person=true