Search or filter publications

Filter by type:

Filter by publication type

Filter by year:

to

Results

  • Showing results for:
  • Reset all filters

Search results

  • Journal article
    Hogg A, Evers C, Moore A, Naylor Pet al., 2021,

    Overlapping speaker segmentation using multiple hypothesis tracking of fundamental frequency

    , IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol: 29, Pages: 1479-1490, ISSN: 2329-9290

    This paper demonstrates how the harmonic structure of voiced speech can be exploited to segment multiple overlapping speakers in a speaker diarization task. We explore how a change in the speaker can be inferred from a change in pitch. We show that voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker’s utterance in the presence of an additional active speaker. This system is bench-marked against a segmentation system from the literature that employs a bidirectional long short term memory network (BLSTM) approach and requires training. Experimental results highlight that the proposed approach outperforms the BLSTM baseline approach by 12.9% in terms of HIT rate for speaker segmentation. We also show that the estimated pitch tracks of our system can be used as features to the BLSTM to achieve further improvements of 1.21% in terms of coverage and 2.45% in terms of purity

  • Conference paper
    Neo VW, Evers C, Naylor PA, 2021,

    Polynomial matrix eigenvalue decomposition of spherical harmonics for speech enhancement

    , IEEE International Conference on Acoustics, Speech and Signal Processing, Publisher: IEEE

    Speech enhancement algorithms using polynomial matrix eigen value decomposition (PEVD) have been shown to be effective for noisy and reverberant speech. However, these algorithms do not scale well in complexity with the number of channels used in the processing. For a spherical microphone array sampling an order-limited sound field, the spherical harmonics provide a compact representation of the microphone signals in the form of eigen beams. We propose a PEVD algorithm that uses only the lower dimension eigen beams for speech enhancement at a significantly lower computation cost. The proposed algorithm is shown to significantly reduce complexity while maintaining full performance. Informal listening examples have also indicated that the processing does not introduce any noticeable artefacts.

  • Conference paper
    Hogg A, Naylor P, Evers C, 2021,

    Multichannel overlapping speaker segmentation using multiple hypothesis tracking of acoustic and spatial features

    , IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE

    An essential part of any diarization system is the task of speaker segmentation which is important for many applications including speaker indexing and automatic speech recognition (ASR) in multi-speaker environments. Segmentation of overlapping speech has recently been a key focus of this work. In this paper we explore the use of a new multimodal approach for overlapping speaker segmentation that tracks both the fundamental frequency (F0) of the speaker and the speaker’s direction of arrival (DOA) simultaneously. Our proposed multiple hypothesis tracking system, which simultaneously tracks both features, shows an improvement in segmentation performance when compared to tracking these features separately. An illustrative example of overlapping speech demonstrates the effectiveness of our proposed system. We also undertake a statistical analysis on 12 meetings from the AMI corpus and show an improvement in the HIT rate of 14.1% on average against a commonly used deep learning bidirectional long short term memory network (BLSTM) approach.

  • Conference paper
    McKnight SW, Hogg A, Naylor P, 2020,

    Analysis of phonetic dependence of segmentation errors in speaker diarization

    , European Signal Processing Conference (EUSIPCO), Publisher: IEEE

    Evaluation of speaker segmentation and diarization normally makes use of forgiveness collars around ground truth speaker segment boundaries such that estimated speaker segment boundaries with such collars are considered completely correct. This paper shows that the popular recent approach of removing forgiveness collars from speaker diarization evaluation tools can unfairly penalize speaker diarization systems that correctly estimate speaker segment boundaries. The uncertainty in identifying the start and/or end of a particular phoneme means that the ground truth segmentation is not perfectly accurate, and even trained human listeners are unable to identify phoneme boundaries with full consistency. This research analyses the phoneme dependence of this uncertainty, and shows that it depends on (i) whether the phoneme being detected is at the start or end of an utterance and (ii) what the phoneme is, so that the use of a uniform forgiveness collar is inadequate. This analysis is expected to point the way towards more indicative and repeatable assessment of the performance of speaker diarization systems.

  • Conference paper
    Neo VW, Evers C, Naylor PA, 2020,

    Speech dereverberation performance of a polynomial-EVD subspace approach

    , European Signal Processing Conference (EUSIPCO), Publisher: IEEE

    The degradation of speech arising from additive background noise and reverberation affects the performance of important speech applications such as telecommunications, hearing aids, voice-controlled systems and robot audition. In this work, we focus on dereverberation. It is shown that the parameterized polynomial matrix eigenvalue decomposition (PEVD)-based speech enhancement algorithm exploits the lack of correlation between speech and the late reflections to enhance the speech component associated with the direct path and early reflections. The algorithm's performance is evaluated using simulations involving measured acoustic impulse responses and noise from the ACE corpus. The simulations and informal listening examples have indicated that the PEVD-based algorithm performs dereverberation over a range of SNRs without introducing any noticeable processing artefacts.

  • Conference paper
    Neo VW, Evers C, Naylor PA, 2020,

    PEVD-based speech enhancement in reverberant environments

    , IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE

    The enhancement of noisy speech is important for applications involving human-to-human interactions, such as telecommunications and hearing aids, as well as human-to-machine interactions, such as voice-controlled systems and robot audition. In this work, we focus on reverberant environments. It is shown that, by exploiting the lack of correlation between speech and the late reflections, further noise reduction can be achieved. This is verified using simulations involving actual acoustic impulse responses and noise from the ACE corpus. The simulations show that even without using a noise estimator, our proposed method simultaneously achieves noise reduction, and enhancement of speech quality and intelligibility, in reverberant environments over a wide range of SNRs. Furthermore, informal listening examples highlight that our approach does not introduce any significant processing artefacts such as musical noise.

  • Journal article
    Steadman M, Kim C, Lestang J-H, Goodman D, Picinali Let al., 2019,

    Short-term effects of sound localization training in virtual reality

    , Scientific Reports, Vol: 9, ISSN: 2045-2322

    Head-related transfer functions (HRTFs) capture the direction-dependant way that sound interacts with the head and torso. In virtual audio systems, which aim to emulate these effects, non-individualized, generic HRTFs are typically used leading to an inaccurate perception of virtual sound location. Training has the potential to exploit the brain’s ability to adapt to these unfamiliar cues. In this study, three virtual sound localization training paradigms were evaluated; one provided simple visual positional confirmation of sound source location, a second introduced game design elements (“gamification”) and a final version additionally utilized head-tracking to provide listeners with experience of relative sound source motion (“active listening”). The results demonstrate a significant effect of training after a small number of short (12-minute) training sessions, which is retained across multiple days. Gamification alone had no significant effect on the efficacy of the training, but active listening resulted in a significantly greater improvements in localization accuracy. In general, improvements in virtual sound localization following training generalized to a second set of non-individualized HRTFs, although some HRTF-specific changes were observed in polar angle judgement for the active listening group. The implications of this on the putative mechanisms of the adaptation process are discussed.

  • Conference paper
    Neo V, Evers C, Naylor P, 2019,

    Speech enhancement using polynomial eigenvalue decomposition

    , IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Publisher: IEEE

    Speech enhancement is important for applications such as telecommunications, hearing aids, automatic speech recognition and voice-controlled system. The enhancement algorithms aim to reduce interfering noise while minimizing any speech distortion. In this work for speech enhancement, we propose to use polynomial matrices in order to exploit the spatial, spectral as well as temporal correlations between the speech signals received by the microphone array. Polynomial matrices provide the necessary mathematical framework in order to exploit constructively​ the spatial correlations within and between sensor pairs, as well as the spectral-temporal correlations of broadband signals, such as speech. Specifically, the polynomial eigenvalue decomposition (PEVD) decorrelates simultaneously in space, time and frequency. We then propose a PEVD-based speech enhancement algorithm. Simulations and informal listening examples have shown that our method achieves noise reduction without introducing artefacts into the enhanced signal for white, babble and factory noise conditions between -10 dB to 30 dB SNR.

  • Conference paper
    Hogg AOT, Evers C, Naylor PA, 2019,

    Multiple Hypothesis Tracking for Overlapping Speaker Segmentation

    , 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE
  • Conference paper
    Sharma D, Hogg AOT, Wang Y, Nour-Eldin A, Naylor PAet al., 2019,

    Non-Intrusive POLQA Estimation of Speech Quality using Recurrent Neural Networks

    , 2019 27th European Signal Processing Conference (EUSIPCO), Publisher: IEEE
  • Conference paper
    Neo V, Naylor PA, 2019,

    Second order sequential best rotation algorithm with householder reduction for polynomial matrix eigenvalue decomposition

    , IEEE International Conference on Acoustics, Speech and Signal Processing, Publisher: IEEE, Pages: 8043-8047, ISSN: 0736-7791

    The Second-order Sequential Best Rotation (SBR2) algorithm, usedfor Eigenvalue Decomposition (EVD) on para-Hermitian polynomialmatrices typically encountered in wideband signal processingapplications like multichannel Wiener filtering and channel coding,involves a series of delay and rotation operations to achieve diagonalisation.In this paper, we proposed the use of Householder transformationsto reduce polynomial matrices to tridiagonal form beforezeroing the dominant element with rotation. Similar to performingHouseholder reduction on conventional matrices, our methodenables SBR2 to converge in fewer iterations with smaller orderof polynomial matrix factors because more off-diagonal Frobeniusnorm(F-norm) could be transferred to the main diagonal at everyiteration. A reduction in the number of iterations by 12.35% and0.1% improvement in reconstruction error is achievable.

  • Conference paper
    Hogg AOT, Evers C, Naylor PA, 2019,

    Speaker Change Detection Using Fundamental Frequency with Application to Multi-talker Segmentation

    , ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE
  • Journal article
    Stitt P, Picinali L, Katz BFG, 2019,

    Auditory accommodation to poorly matched non-individual spectral localization cues through active learning

    , Scientific Reports, Vol: 9, Pages: 1-14, ISSN: 2045-2322

    This study examines the effect of adaptation to non-ideal auditory localization cues represented by the Head-Related Transfer Function (HRTF) and the retention of training for up to three months after the last session. Continuing from a previous study on rapid non-individual HRTF learning, subjects using non-individual HRTFs were tested alongside control subjects using their own measured HRTFs. Perceptually worst-rated non-individual HRTFs were chosen to represent the worst-case scenario in practice and to allow for maximum potential for improvement. The methodology consisted of a training game and a localization test to evaluate performance carried out over 10 sessions. Sessions 1–4 occurred at 1 week intervals, performed by all subjects. During initial sessions, subjects showed improvement in localization performance for polar error. Following this, half of the subjects stopped the training game element, continuing with only the localization task. The group that continued to train showed improvement, with 3 of 8 subjects achieving group mean polar errors comparable to the control group. The majority of the group that stopped the training game retained their performance attained at the end of session 4. In general, adaptation was found to be quite subject dependent, highlighting the limits of HRTF adaptation in the case of poor HRTF matches. No identifier to predict learning ability was observed.

  • Conference paper
    Moore A, de Haan JM, Pedersen MS, Naylor P, Brookes D, Jensen Jet al., 2019,

    Personalized {HRTF}s for hearing aids

    , ELOBES2019
  • Conference paper
    Cuevas-Rodriguez M, Gonzalez-Toledo D, La Rubia-Cuestas ED, Garre C, Molina-Tanco L, Reyes-Lecuona A, Poirier-Quinot D, Picinali Let al., 2018,

    The 3D Tune-In Toolkit - 3D audio spatialiser, hearing loss and hearing aid simulations

    The 3DTI Toolkit is a standard C++ library for audio spatialisation and simulation using loudspeakers or headphones developed within the 3D Tune-In (3DTI) project (http://www.3d-tune-in.eu), which aims at using 3D sound and simulating hearing loss and hearing aids within virtual environments and games. The Toolkit allows the design and rendering of highly realistic and immersive 3D audio, and the simulation of virtual hearing aid devices and of different typologies of hearing loss. The library includes a real-time 3D binaural audio renderer offering full 3D spatialization based on efficient Head Related Transfer Function (HRTF) convolution, including smooth interpolation among impulse responses, customization of listener head radius and specific simulation of far-distance and near-field effects. In addition, spatial reverberation is simulated in real time using a uniformly partitioned convolution with Binaural Room Impulse Responses (BRIRs) employing a virtual Ambisonic approach. The 3D Tune-In Toolkit includes also a loudspeaker-based spatialiser implemented using Ambisonic encoding/decoding. This poster presents a brief overview of the main features of the Toolkit, which is released open-source under GPL v3 license (the code is available in GitHub https://github.com/3DTune-In/3dti-AudioToolkit).

  • Journal article
    Braiman C, Fridman A, Conte MM, Vosse HU, Reichenbach CS, Reichenbach J, Schiff NDet al., 2018,

    Cortical response to the natural speech envelope correlates with neuroimaging evidence of cognition in severe brain injury

    , Current Biology, Vol: 28, Pages: 3833-3839.E3, ISSN: 0960-9822

    Recent studies identify severely brain-injured patients with limited or no behavioral responses who successfully perform functional magnetic resonance imaging (fMRI) or electroencephalogram (EEG) mental imagery tasks [1, 2, 3, 4, 5]. Such tasks are cognitively demanding [1]; accordingly, recent studies support that fMRI command following in brain-injured patients associates with preserved cerebral metabolism and preserved sleep-wake EEG [5, 6]. We investigated the use of an EEG response that tracks the natural speech envelope (NSE) of spoken language [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] in healthy controls and brain-injured patients (vegetative state to emergence from minimally conscious state). As audition is typically preserved after brain injury, auditory paradigms may be preferred in searching for covert cognitive function [23, 24, 25]. NSE measures are obtained by cross-correlating EEG with the NSE. We compared NSE latencies and amplitudes with and without consideration of fMRI assessments. NSE latencies showed significant and progressive delay across diagnostic categories. Patients who could carry out fMRI-based mental imagery tasks showed no statistically significant difference in NSE latencies relative to healthy controls; this subgroup included patients without behavioral command following. The NSE may stratify patients with severe brain injuries and identify those patients demonstrating “cognitive motor dissociation” (CMD) [26] who show only covert evidence of command following utilizing neuroimaging or electrophysiological methods that demand high levels of cognitive function. Thus, the NSE is a passive measure that may provide a useful screening tool to improve detection of covert cognition with fMRI or other methods and improve stratification of patients with disorders of consciousness in research studies.

  • Journal article
    Sethi S, Ewers R, Jones N, Orme D, Picinali Let al., 2018,

    Robust, real-time and autonomous monitoring of ecosystems with an open, low-cost, networked device

    , Methods in Ecology and Evolution, Vol: 9, Pages: 2383-2387, ISSN: 2041-210X

    1. Automated methods of monitoring ecosystems provide a cost-effective way to track changes in natural system's dynamics across temporal and spatial scales. However, methods of recording and storing data captured from the field still require significant manual effort. 2. Here we introduce an open source, inexpensive, fully autonomous ecosystem monitoring unit for capturing and remotely transmitting continuous data streams from field sites over long time-periods. We provide a modular software framework for deploying various sensors, together with implementations to demonstrate proof of concept for continuous audio monitoring and time-lapse photography. 3. We show how our system can outperform comparable technologies for fractions of the cost, provided a local mobile network link is available. The system is robust to unreliable network signals and has been shown to function in extreme environmental conditions, such as in the tropical rainforests of Sabah, Borneo. 4. We provide full details on how to assemble the hardware, and the open-source software. Paired with appropriate automated analysis techniques, this system could provide spatially dense, near real-time, continuous insights into ecosystem and biodiversity dynamics at a low cost.

  • Conference paper
    Moore AH, Lightburn L, Xue W, Naylor P, Brookes Det al., 2018,

    Binaural mask-informed speech enhancement for hearing aids with head tracking

    , International Workshop on Acoustic Signal Enhancement (IWAENC 2018), Publisher: IEEE, Pages: 461-465

    An end-to-end speech enhancement system for hearing aids is pro-posed which seeks to improve the intelligibility of binaural speechin noise during head movement. The system uses a reference beam-former whose look direction is informed by knowledge of the headorientation and the a priori known direction of the desired source.From this a time-frequency mask is estimated using a deep neuralnetwork. The binaural signals are obtained using bilateral beam-formers followed by a classical minimum mean square error speechenhancer, modified to use the estimated mask as a speech presenceprobability prior. In simulated experiments, the improvement in abinaural intelligibility metric (DBSTOI) given by the proposed sys-tem relative to beamforming alone corresponds to an SNR improve-ment of 4 to 6 dB. Results also demonstrate the individual contribu-tions of incorporating the mask and the head orientation-aware beamsteering to the proposed system.

  • Conference paper
    Dawson PJ, De Sena E, Naylor PA, 2018,

    An acoustic image-source characterisation of surface profiles

    , 2018 26th European Signal Processing Conference (EUSIPCO), Publisher: IEEE, Pages: 2130-2134, ISSN: 2076-1465

    The image-source method models the specular reflection from a plane by means of a secondary source positioned at the source's reflected image. The method has been widely used in acoustics to model the reverberant field of rectangular rooms, but can also be used for general-shaped rooms and non-flat reflectors. This paper explores the relationship between the physical properties of a non-flat reflector and the statistical properties of the associated cloud of image-sources. It is shown here that the standard deviation of the image-sources is strongly correlated with the ratio between depth and width of the reflector's spatial features.

  • Conference paper
    Kim C, Steadman M, Lestang JH, Goodman DFM, Picinali Let al., 2018,

    A VR-based mobile platform for training to non-individualized binaural 3D audio

    , 144th Audio Engineering Society Convention 2018

    © 2018 Audio Engineering Society. All Rights Reserved. Delivery of immersive 3D audio with arbitrarily-positioned sound sources over headphones often requires processing of individual source signals through a set of Head-Related Transfer Functions (HRTFs). The individual morphological differences and the impracticality of HRTF measurement make it difficult to deliver completely individualized 3D audio, and instead lead to the use of previously-measured non-individual sets of HRTFs. In this study, a VR-based mobile sound localization training prototype system is introduced which uses HRTF sets for audio. It consists of a mobile phone as a head-mounted device, a hand-held Bluetooth controller, and a network-enabled laptop with a USB audio interface and a pair of headphones. The virtual environment was developed on the mobile phone such that the user can listen-to/navigate-in an acoustically neutral scene and locate invisible target sound sources presented at random directions using non-individualized HRTFs in repetitive sessions. Various training paradigms can be designed with this system, with performance-related feedback provided according to the user’s localization accuracy, including visual indication of the target location, and some aspects of a typical first-person shooting game, such as enemies, scoring, and level advancement. An experiment was conducted using this system, in which 11 subjects went through multiple training sessions, using non-individualized HRTF sets. The localization performance evaluations showed reduction of overall localization angle error over repeated training sessions, reflecting lower front-back confusion rates.

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-t4-html.jsp Request URI: /respub/WEB-INF/jsp/search-t4-html.jsp Query String: id=1047&limit=20&page=1&respub-action=search.html Current Millis: 1624305136475 Current Time: Mon Jun 21 20:52:16 BST 2021