366 results found
Jones DT, Sharma D, Kruchinin SY, et al., 2021, Spatial Coding for Microphone Arrays using IPNLMS-Based RTF Estimation, 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
Neo V, Evers C, Naylor P, 2021, Polynomial matrix eigenvalue decomposition-based source separation using informed spherical microphone arrays, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE
Audio source separation is essential for many applications such as hearing aids, telecommunications, and robot audition. Subspace decomposition approaches using polynomial matrix eigenvalue decomposition (PEVD) algorithms applied to the microphone signals, or lower-dimension eigenbeams for spherical microphone arrays, are effective for speech enhancement. In this work, we extend the work from speech enhancement and propose a PEVD subspace algorithm that uses eigenbeams for source separation. The proposed PEVD-based source separation approach performs comparably with state-of-the-art algorithms, such as those based on independent component analysis (ICA) and multi-channel non-negative matrix factorization (MNMF). Informal listening examples also indicate that our method does not introduce any audible artifacts.
Hogg A, Neo V, Weiss S, et al., 2021, A polynomial eigenvalue decomposition MUSIC approach for broadband sound source localization, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE
Direction of arrival (DoA) estimation for sound source localization is increasingly prevalent in modern devices. In this paper, we explore a polynomial extension to the multiple signal classification (MUSIC) algorithm, spatio-spectral polynomial (SSP)-MUSIC, and evaluate its performance when using speech sound sources. In addition, we also propose three essential enhancements for SSP-MUSIC to work with noisy reverberant audio data. This paper includes an analysis of SSP-MUSIC using speech signals in a simulated room for different noise and reverberation conditions and the first task of the LOCATA challenge. We show that SSP-MUSIC is more robust to noise and reverberation compared to independent frequency bin (IFB) approaches and improvements can be seen for single sound source localization at signal-to-noise ratios (SNRs) below 5 dB and reverberation times (T60s) larger than 0.7 s.
Moore A, Vos R, Naylor P, et al., 2021, Processing pipelines for efficient, physically-accurate simulation of microphone array signals in dynamic sound scenes, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Publisher: IEEE, ISSN: 0736-7791
Multichannel acoustic signal processing is predicated on the fact that the inter channel relationships between the received signals can be exploited to infer information about the acoustic scene. Recently there has been increasing interest in algorithms which are applicable in dynamic scenes, where the source(s) and/or microphone array may be moving. Simulating such scenes has particular challenges which are exacerbated when real-time, listener-in-the-loop evaluation of algorithms is required. This paper considers candidate pipelines for simulating the array response to a set of point/image sources in terms of their accuracy, scalability and continuity. Anew approach, in which the filter kernels are obtained using principal component analysis from time-aligned impulse responses, is proposed. When the number of filter kernels is≤40the new approach achieves more accurate simulation than competing methods.
D'Olne E, Moore A, Naylor P, 2021, Model-based beamforming for wearable microphone arrays, European Signal Processing Conference (EUSIPCO), Publisher: IEEE
Beamforming techniques for hearing aid applications are often evaluated using behind-the-ear (BTE) devices. However, the growing number of wearable devices with microphones has made it possible to consider new geometries for microphone array beamforming. In this paper, we examine the effect of array location and geometry on the performance of binaural minimum power distortionless response (BMPDR) beamformers. In addition to the classical adaptive BMPDR, we evaluate the benefit of a recently-proposed method that estimates the sample covariance matrix using a compact model. Simulation results show that using a chest-mounted array reduces noise by an additional 1.3~dB compared to BTE hearing aids. The compact model method is found to yield higher predicted intelligibility than adaptive BMPDR beamforming, regardless of the array geometry.
Hogg A, Evers C, Moore A, et al., 2021, Overlapping speaker segmentation using multiple hypothesis tracking of fundamental frequency, IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol: 29, Pages: 1479-1490, ISSN: 2329-9290
This paper demonstrates how the harmonic structure of voiced speech can be exploited to segment multiple overlapping speakers in a speaker diarization task. We explore how a change in the speaker can be inferred from a change in pitch. We show that voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker’s utterance in the presence of an additional active speaker. This system is bench-marked against a segmentation system from the literature that employs a bidirectional long short term memory network (BLSTM) approach and requires training. Experimental results highlight that the proposed approach outperforms the BLSTM baseline approach by 12.9% in terms of HIT rate for speaker segmentation. We also show that the estimated pitch tracks of our system can be used as features to the BLSTM to achieve further improvements of 1.21% in terms of coverage and 2.45% in terms of purity
Hafezi S, Moore A, Naylor P, 2021, Narrowband multi-source Direction-of-Arrival estimation in the spherical harmonic domain, Journal of the Acoustical Society of America, Vol: 149, ISSN: 0001-4966
A conventional approach to wideband multi-source (MS) direction-of-arrival (DOA) estimation is to perform single source (SS) DOA estimation in time-frequency (TF) bins for which a SS assumption is valid. Such methods use the W-disjoint orthogonality (WDO) assumption due to the speech sparseness. As the number of sources increases, the chance of violating the WDO assumption increases. As shown in the challenging scenarios with multiple simultaneously active sources over a short period of time masking each other, it is possible for a strongly masked source (due to inconsistency of activity or quietness) to be rarely dominant in a TF bin. SS-based DOA estimators fail in the detection or accurate localization of masked sources in such scenarios. Two analytical approaches are proposed for narrowband DOA estimation based on the MS assumption in a bin in the spherical harmonic domain. In the first approach, eigenvalue decomposition is used to decompose a MS scenario into multiple SS scenarios, and a SS-based analytical DOA estimation is performed on each. The second approach analytically estimates two DOAs per bin assuming the presence of two active sources per bin. The evaluation validates the improvement to double accuracy and robustness to sensor noise compared to the baseline methods.
Yiallourides C, Naylor PA, 2021, Time-frequency analysis and parameterisation of knee sounds fornon-invasive setection of osteoarthritis, IEEE Transactions on Biomedical Engineering, Vol: 68, Pages: 1250-1261, ISSN: 0018-9294
Objective: In this work the potential of non-invasive detection of kneeosteoarthritis is investigated using the sounds generated by the knee jointduring walking. Methods: The information contained in the time-frequency domainof these signals and its compressed representations is exploited and theirdiscriminant properties are studied. Their efficacy for the task of normal vsabnormal signal classification is evaluated using a comprehensive experimentalframework. Based on this, the impact of the feature extraction parameters onthe classification performance is investigated using Classification andRegression Trees (CART), Linear Discriminant Analysis (LDA) and Support VectorMachine (SVM) classifiers. Results: It is shown that classification issuccessful with an area under the Receiver Operating Characteristic (ROC) curveof 0.92. Conclusion: The analysis indicates improvements in classificationperformance when using non-uniform frequency scaling and identifies specificfrequency bands that contain discriminative features. Significance: Contrary toother studies that focus on sit-to-stand movements and knee flexion/extension,this study used knee sounds obtained during walking. The analysis of suchsignals leads to non-invasive detection of knee osteoarthritis with highaccuracy and could potentially extend the range of available tools for theassessment of the disease as a more practical and cost effective method withoutrequiring clinical setups.
Hogg A, Naylor P, Evers C, 2021, Multichannel overlapping speaker segmentation using multiple hypothesis tracking of acoustic and spatial features, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE
An essential part of any diarization system is the task of speaker segmentation which is important for many applications including speaker indexing and automatic speech recognition (ASR) in multi-speaker environments. Segmentation of overlapping speech has recently been a key focus of this work. In this paper we explore the use of a new multimodal approach for overlapping speaker segmentation that tracks both the fundamental frequency (F0) of the speaker and the speaker’s direction of arrival (DOA) simultaneously. Our proposed multiple hypothesis tracking system, which simultaneously tracks both features, shows an improvement in segmentation performance when compared to tracking these features separately. An illustrative example of overlapping speech demonstrates the effectiveness of our proposed system. We also undertake a statistical analysis on 12 meetings from the AMI corpus and show an improvement in the HIT rate of 14.1% on average against a commonly used deep learning bidirectional long short term memory network (BLSTM) approach.
Neo VW, Evers C, Naylor PA, 2021, Polynomial matrix eigenvalue decomposition of spherical harmonics for speech enhancement, IEEE International Conference on Acoustics, Speech and Signal Processing, Publisher: IEEE
Speech enhancement algorithms using polynomial matrix eigen value decomposition (PEVD) have been shown to be effective for noisy and reverberant speech. However, these algorithms do not scale well in complexity with the number of channels used in the processing. For a spherical microphone array sampling an order-limited sound field, the spherical harmonics provide a compact representation of the microphone signals in the form of eigen beams. We propose a PEVD algorithm that uses only the lower dimension eigen beams for speech enhancement at a significantly lower computation cost. The proposed algorithm is shown to significantly reduce complexity while maintaining full performance. Informal listening examples have also indicated that the processing does not introduce any noticeable artefacts.
Sharma D, Berger L, Quillen C, et al., 2021, Non-intrusive estimation of speech signal parameters using a frame-based machine learning approach, 2020 28th European Signal Processing Conference (EUSIPCO), Publisher: IEEE, Pages: 446-450
We present a novel, non-intrusive method that jointly estimates acoustic signal properties associated with the perceptual speech quality, level of reverberation and noise in a speech signal. We explore various machine learning frameworks, consisting of popular feature extraction front-ends and two types of regression models and show the trade-off in performance that must be considered with each combination. We show that a short-time framework consisting of an 80-dimension log-Mel filter bank feature front-end employing spectral augmentation, followed by a 3 layer LSTM recurrent neural network model achieves a mean absolute error of 3.3 dB for C50, 2.3 dB for segmental SNR and 0.3 for PESQ estimation on the Libri Augmented (LA) database. The internal VAD for this system achieves an F1 score of 0.93 on this data. The proposed system also achieves a 2.4 dB mean absolute error for C50 estimation on the ACE test set. Furthermore, we show how each type of acoustic parameter correlates with ASR performance in terms of ground truth labels and additionally show that the estimated C50, SNR and PESQ from our proposed method have a high correlation (greater than 0.92) with WER on the LA test set.
Felsheim RC, Brendel A, Naylor PA, et al., 2021, Head Orientation Estimation from Multiple Microphone Arrays, 28th European Signal Processing Conference (EUSIPCO), Publisher: IEEE, Pages: 491-495, ISSN: 2076-1465
McKnight SW, Hogg A, Naylor P, 2020, Analysis of phonetic dependence of segmentation errors in speaker diarization, European Signal Processing Conference (EUSIPCO), Publisher: IEEE, ISSN: 2076-1465
Evaluation of speaker segmentation and diarization normally makes use of forgiveness collars around ground truth speaker segment boundaries such that estimated speaker segment boundaries with such collars are considered completely correct. This paper shows that the popular recent approach of removing forgiveness collars from speaker diarization evaluation tools can unfairly penalize speaker diarization systems that correctly estimate speaker segment boundaries. The uncertainty in identifying the start and/or end of a particular phoneme means that the ground truth segmentation is not perfectly accurate, and even trained human listeners are unable to identify phoneme boundaries with full consistency. This research analyses the phoneme dependence of this uncertainty, and shows that it depends on (i) whether the phoneme being detected is at the start or end of an utterance and (ii) what the phoneme is, so that the use of a uniform forgiveness collar is inadequate. This analysis is expected to point the way towards more indicative and repeatable assessment of the performance of speaker diarization systems.
Neo VW, Evers C, Naylor PA, 2020, Speech dereverberation performance of a polynomial-EVD subspace approach, European Signal Processing Conference (EUSIPCO), Publisher: IEEE, ISSN: 2076-1465
The degradation of speech arising from additive background noise and reverberation affects the performance of important speech applications such as telecommunications, hearing aids, voice-controlled systems and robot audition. In this work, we focus on dereverberation. It is shown that the parameterized polynomial matrix eigenvalue decomposition (PEVD)-based speech enhancement algorithm exploits the lack of correlation between speech and the late reflections to enhance the speech component associated with the direct path and early reflections. The algorithm's performance is evaluated using simulations involving measured acoustic impulse responses and noise from the ACE corpus. The simulations and informal listening examples have indicated that the PEVD-based algorithm performs dereverberation over a range of SNRs without introducing any noticeable processing artefacts.
Xue W, Moore A, Brookes D, et al., 2020, Speech enhancement based on modulation-domain parametric multichannel Kalman filtering, IEEE Transactions on Audio, Speech and Language Processing, Vol: 29, Pages: 393-405, ISSN: 1558-7916
Recently we presented a modulation-domain multichannel Kalman filtering (MKF) algorithm for speech enhancement, which jointly exploits the inter-frame modulation-domain temporal evolution of speech and the inter-channel spatial correlation to estimate the clean speech signal. The goal of speech enhancement is to suppress noise while keeping the speech undistorted, and a key problem is to achieve the best trade-off between speech distortion and noise reduction. In this paper, we extend the MKF by presenting a modulation-domain parametric MKF (PMKF) which includes a parameter that enables flexible control of the speech enhancement behaviour in each time-frequency (TF) bin. Based on the decomposition of the MKF cost function, a new cost function for PMKF is proposed, which uses the controlling parameter to weight the noise reduction and speech distortion terms. An optimal PMKF gain is derived using a minimum mean squared error (MMSE) criterion. We analyse the performance of the proposed MKF, and show its relationship to the speech distortion weighted multichannel Wiener filter (SDW-MWF). To evaluate the impact of the controlling parameter on speech enhancement performance, we further propose PMKF speech enhancement systems in which the controlling parameter is adaptively chosen in each TF bin. Experiments on a publicly available head-related impulse response (HRIR) database in different noisy and reverberant conditions demonstrate the effectiveness of the proposed method.
Martínez-Colón A, Viciana-Abad R, Perez-Lorenzo JM, et al., 2020, Evaluation of a multi-speaker system for socially assistive HRI in real scenarios, Workshop of Physical Agents, Publisher: Springer International Publishing, Pages: 151-166, ISSN: 2194-5357
In the field of social human-robot interaction, and in particular for social assistive robotics, the capacity of recognizing the speaker’s discourse in very diverse conditions and where more than one interlocutor may be present, plays an essential role. The use of a mics. array that can be mounted in a robot supported by a voice enhancement module has been evaluated, with the goal of improving the performance of current automatic speech recognition (ASR) systems in multi-speaker conditions. An evaluation has been made of the improvement in terms of intelligibility scores that can be achieved in the operation of two off-the-shelf ASR solutions in situations that contemplate the typical scenarios where a robot with these characteristics can be found. The results have identified the conditions in which a low computational cost demand algorithm can be beneficial to improve intelligibility scores in real environments.
Papayiannis C, Evers C, Naylor P, 2020, End-to-end classification of reverberant rooms using DNNs, IEEE Transactions on Audio, Speech and Language Processing, Vol: 28, Pages: 3010-3017, ISSN: 1558-7916
Reverberation is present in our workplaces, ourhomes, concert halls and theatres. This paper investigates howdeep learning can use the effect of reverberation on speechto classify a recording in terms of the room in which it wasrecorded. Existing approaches in the literature rely on domainexpertise to manually select acoustic parameters as inputs toclassifiers. Estimation of these parameters from reverberantspeech is adversely affected by estimation errors, impacting theclassification accuracy. In order to overcome the limitations ofpreviously proposed methods, this paper shows how DNNs canperform the classification by operating directly on reverberantspeech spectra and a CRNN with an attention-mechanism isproposed for the task. The relationship is investigated betweenthe reverberant speech representations learned by the DNNs andacoustic parameters. For evaluation, AIRs are used from theACE-challenge dataset that were measured in 7 real rooms. Theclassification accuracy of the CRNN classifier in the experimentsis 78% when using 5 hours of training data and 90% when using10 hours.
Evers C, Lollmann HW, Mellmann H, et al., 2020, The LOCATA challenge: acoustic source localization and tracking, IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol: 28, Pages: 1620-1643, ISSN: 2329-9290
The ability to localize and track acoustic events is a fundamental prerequisite for equipping machines with the ability to be aware of and engage with humans in their surrounding environment. However, in realistic scenarios, audio signals are adversely affected by reverberation, noise, interference, and periods of speech inactivity. In dynamic scenarios, where the sources and microphone platforms may be moving, the signals are additionally affected by variations in the source-sensor geometries. In practice, approaches to sound source localization and tracking are often impeded by missing estimates of active sources, estimation errors, as well as false estimates. The aim of the LOCAlization and TrAcking (LOCATA) Challenge is an openaccess framework for the objective evaluation and benchmarking of broad classes of algorithms for sound source localization and tracking. This paper provides a review of relevant localization and tracking algorithms and, within the context of the existing literature, a detailed evaluation and dissemination of the LOCATA submissions. The evaluation highlights achievements in the field, open challenges, and identifies potential future directions.Index Terms—Acoustic signal processing, Source localization, Source tracking, Reverberation.
Neo VW, Evers C, Naylor PA, 2020, PEVD-based speech enhancement in reverberant environments, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE
The enhancement of noisy speech is important for applications involving human-to-human interactions, such as telecommunications and hearing aids, as well as human-to-machine interactions, such as voice-controlled systems and robot audition. In this work, we focus on reverberant environments. It is shown that, by exploiting the lack of correlation between speech and the late reflections, further noise reduction can be achieved. This is verified using simulations involving actual acoustic impulse responses and noise from the ACE corpus. The simulations show that even without using a noise estimator, our proposed method simultaneously achieves noise reduction, and enhancement of speech quality and intelligibility, in reverberant environments over a wide range of SNRs. Furthermore, informal listening examples highlight that our approach does not introduce any significant processing artefacts such as musical noise.
Hogg A, Evers C, Naylor P, 2019, Multiple hypothesis tracking for overlapping speaker segmentation, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE
Speaker segmentation is an essential part of any diarization system.Applications of diarization include tasks such as speaker indexing, improving automatic speech recognition (ASR) performance and making single speaker-based algorithms available for use in multi-speaker environments.This paper proposes a multiple hypothesis tracking (MHT) method that exploits the harmonic structure associated with the pitch in voiced speech in order to segment the onsets and end-points of speech from multiple, overlapping speakers. The proposed method is evaluated against a segmentation system from the literature that uses a spectral representation and is based on employing bidirectional long short term memory networks (BLSTM). The proposed method is shown to achieve comparable performance for segmenting overlapping speakers only using the pitch harmonic information in the MHT framework.
Antonello N, De Sena E, Moonen M, et al., 2019, Joint Acoustic Localization and Dereverberation Through Plane Wave Decomposition and Sparse Regularization, IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, Vol: 27, Pages: 1893-1905, ISSN: 2329-9290
Hafezi S, Moore AH, Naylor PA, 2019, Spatial consistency for multiple source direction-of-arrival estimation and source counting., Journal of the Acoustical Society of America, Vol: 146, Pages: 4592-4603, ISSN: 0001-4966
A conventional approach to wideband multi-source (MS) direction-of-arrival (DOA) estimation is to perform single source (SS) DOA estimation in time-frequency (TF) bins for which a SS assumption is valid. The typical SS-validity confidence metrics analyse the validity of the SS assumption over a fixed-size TF region local to the TF bin. The performance of such methods degrades as the number of simultaneously active sources increases due to the associated decrease in the size of the TF regions where the SS assumption is valid. A SS-validity confidence metric is proposed that exploits a dynamic MS assumption over relatively larger TF regions. The proposed metric first clusters the initial DOA estimates (one per TF bin) and then uses the members' spatial consistency as well as its cluster's spread to weight each TF bin. Distance-based and density-based clustering are employed as two alternative approaches for clustering DOAs. A noise-robust density-based clustering is also used in an evolutionary framework to propose a method for source counting and source direction estimation. The evaluation results based on simulations and also with real recordings show that the proposed weighting strategy significantly improves the accuracy of source counting and MS DOA estimation compared to the state-of-the-art.
Sharma D, Hogg A, Wang Y, et al., 2019, Non-Intrusive POLQA estimation of speech quality using recurrent neural networks, European Signal Processing Conference (EUSIPCO), Publisher: IEEE
Estimating the quality of speech without the use of a clean reference signal is a challenging problem, in part due to the time and expense required to collect sufficient training data for modern machine learning algorithms. We present a novel, non-intrusive estimator that exploits recurrent neural network architectures to predict the intrusive POLQA score of a speech signal in a short time context. The predictor is based on a novel compressed representation of modulation domain features, used in conjunction with static MFCC features. We show that the proposed method can reliably predict POLQA with a 300 ms context, achieving a mean absolute error of 0.21 on unseen data.The proposed method is trained using English speech and is shown to generalize well across unseen languages. The neural network also jointly estimates the mean voice activity detection(VAD) with an F1 accuracy score of 0.9, removing the need for an external VAD.
Neo V, Evers C, Naylor P, 2019, Speech enhancement using polynomial eigenvalue decomposition, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Publisher: IEEE
Speech enhancement is important for applications such as telecommunications, hearing aids, automatic speech recognition and voice-controlled system. The enhancement algorithms aim to reduce interfering noise while minimizing any speech distortion. In this work for speech enhancement, we propose to use polynomial matrices in order to exploit the spatial, spectral as well as temporal correlations between the speech signals received by the microphone array. Polynomial matrices provide the necessary mathematical framework in order to exploit constructively the spatial correlations within and between sensor pairs, as well as the spectral-temporal correlations of broadband signals, such as speech. Specifically, the polynomial eigenvalue decomposition (PEVD) decorrelates simultaneously in space, time and frequency. We then propose a PEVD-based speech enhancement algorithm. Simulations and informal listening examples have shown that our method achieves noise reduction without introducing artefacts into the enhanced signal for white, babble and factory noise conditions between -10 dB to 30 dB SNR.
Neo V, Naylor PA, 2019, Second order sequential best rotation algorithm with householder reduction for polynomial matrix eigenvalue decomposition, IEEE International Conference on Acoustics, Speech and Signal Processing, Publisher: IEEE, Pages: 8043-8047, ISSN: 0736-7791
The Second-order Sequential Best Rotation (SBR2) algorithm, usedfor Eigenvalue Decomposition (EVD) on para-Hermitian polynomialmatrices typically encountered in wideband signal processingapplications like multichannel Wiener filtering and channel coding,involves a series of delay and rotation operations to achieve diagonalisation.In this paper, we proposed the use of Householder transformationsto reduce polynomial matrices to tridiagonal form beforezeroing the dominant element with rotation. Similar to performingHouseholder reduction on conventional matrices, our methodenables SBR2 to converge in fewer iterations with smaller orderof polynomial matrix factors because more off-diagonal Frobeniusnorm(F-norm) could be transferred to the main diagonal at everyiteration. A reduction in the number of iterations by 12.35% and0.1% improvement in reconstruction error is achievable.
Moore AH, de Haan JM, Pedersen MS, et al., 2019, Personalized signal-independent beamforming for binaural hearing aids, Journal of the Acoustical Society of America, Vol: 145, Pages: 2971-2981, ISSN: 0001-4966
The effect of personalized microphone array calibration on the performance of hearing aid beamformers under noisy reverberant conditions is studied. The study makes use of a new, publicly available, database containing acoustic transfer function measurements from 29 loudspeakers arranged on a sphere to a pair of behind-the-ear hearing aids in a listening room when worn by 27 males, 14 females, and 4 mannequins. Bilateral and binaural beamformers are designed using each participant's hearing aid head-related impulse responses (HAHRIRs). The performance of these personalized beamformers is compared to that of mismatched beamformers, where the HAHRIR used for the design does not belong to the individual for whom performance is measured. The case where the mismatched HAHRIR is that of a mannequin is of particular interest since it represents current practice in commercially available hearing aids. The benefit of personalized beamforming is assessed using an intrusive binaural speech intelligibility metric and in a matrix speech intelligibility test. For binaural beamforming, both measures demonstrate a statistically signficant (p < 0.05) benefit of personalization. The benefit varies substantially between individuals with some predicted to benefit by as much as 1.5 dB.
Hogg A, Naylor P, Evers C, 2019, Speaker change detection using fundamental frequency with application to multi-talker segmentation, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE
This paper shows that time varying pitch properties can be used advantageously within the segmentation step of a multi-talker diarization system. First a study is conducted to verify that changes in pitch are strong indicators of changes in the speaker. It is then highlighted that an individual’s pitch is smoothly varying and, therefore, can be predicted by means of a Kalman filter. Subsequently it is shown that if the pitch is not predictable then this is most likely due to a change in the speaker. Finally, a novel system is proposed that uses this approach of pitch prediction for speaker change detection. This system is then evaluated against a commonly used MFCC segmentation system. The proposed system is shown to increase the speaker change detection rate from 43.3% to 70.5% on meetings in the AMI corpus. Therefore, there are two equally weighted contributions in this paper: 1. We address the question of whether a change in pitch is a reliable estimator of a speaker change in multi-talk meeting audio. 2. We develop a method to extract such speaker changes and test them on a widely available meeting corpus.
Moore A, Xue W, Naylor P, et al., 2019, Noise covariance matrix estimation for rotating microphone arrays, IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol: 27, Pages: 519-530, ISSN: 2329-9290
The noise covariance matrix computed between the signals from a microphone array is used in the design of spatial filters and beamformers with applications in noise suppression and dereverberation. This paper specifically addresses the problem of estimating the covariance matrix associated with a noise field when the array is rotating during desired source activity, as is common in head-mounted arrays. We propose a parametric model that leads to an analytical expression for the microphone signal covariance as a function of the array orientation and array manifold. An algorithm for estimating the model parameters during noise-only segments is proposed and the performance shown to be improved, rather than degraded, by array rotation. The stored model parameters can then be used to update the covariance matrix to account for the effects of any array rotation that occurs when the desired source is active. The proposed method is evaluated in terms of the Frobenius norm of the error in the estimated covariance matrix and of the noise reduction performance of a minimum variance distortionless response beamformer. In simulation experiments the proposed method achieves 18 dB lower error in the estimated noise covariance matrix than a conventional recursive averaging approach and results in noise reduction which is within 0.05 dB of an oracle beamformer using the ground truth noise covariance matrix.
Gannot S, Naylor PA, 2019, Highlights from the Audio and Acoustic Signal Processing Technical Committee [In the Spotlight], IEEE Signal Processing Magazine, Vol: 36, ISSN: 1053-5888
© 1991-2012 IEEE. The IEEE Audio and Acoustic Signal Processing Technical Committee (AASP TC) is one of 13 TCs in the IEEE Signal Processing Society. Its mission is to support, nourish, and lead scientific and technological development in all areas of AASP. These areas are currently seeing increased levels of interest and significant growth, providing a fertile ground for a broad range of specific and interdisciplinary research and development. Ranging from array processing for microphones and loudspeakers to music genre classification, from psychoacoustics to machine learning (ML), from consumer electronics devices to blue-sky research, this scope encompasses countless technical challenges and many hot topics. The TC has roughly 30 elected volunteer members drawn equally from leading academic and industrial organizations around the world, unified by the common aim of offering their expertise in the service of the scientific community.
Brookes D, Lightburn L, Moore A, et al., 2019, Mask-assisted speech enhancement for binaural hearing aids, ELOBES2019
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.