Search or filter publications

Filter by type:

Filter by publication type

Filter by year:

to

Results

  • Showing results for:
  • Reset all filters

Search results

  • Conference paper
    Neo VW, Weiss S, McKnight S, Hogg A, Naylor PAet al., 2022,

    Polynomial eigenvalue decomposition-based target speaker voice activity detection in the presence of competing talkers

    , 17th International Workshop on Acoustic Signal Enhancement

    Voice activity detection (VAD) algorithms are essential for many speech processing applications, such as speaker diarization, automatic speech recognition, speech enhancement, and speech coding. With a good VAD algorithm, non-speech segments can be excluded to improve the performance and computation of these applications. In this paper, we propose a polynomial eigenvalue decomposition-based target-speaker VAD algorithm to detect unseen target speakers in the presence of competing talkers. The proposed approach uses frame-based processing to compute the syndrome energy, used for testing the presence or absence of a target speaker. The proposed approach is consistently among the best in F1 and balanced accuracy scores over the investigated range of signal to interference ratio (SIR) from -10 dB to 20 dB.

  • Conference paper
    McKnight S, Hogg A, Neo V, Naylor Pet al., 2022,

    A study of salient modulation domain features for speaker identification

    , Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Publisher: IEEE, Pages: 705-712

    This paper studies the ranges of acoustic andmodulation frequencies of speech most relevant for identifyingspeakers and compares the speaker-specific information presentin the temporal envelope against that present in the temporalfine structure. This study uses correlation and feature importancemeasures, random forest and convolutional neural network mod-els, and reconstructed speech signals with specific acoustic and/ormodulation frequencies removed to identify the salient points. Itis shown that the range of modulation frequencies associated withthe fundamental frequency is more important than the 1-16 Hzrange most commonly used in automatic speech recognition, andthat the 0 Hz modulation frequency band contains significantspeaker information. It is also shown that the temporal envelopeis more discriminative among speakers than the temporal finestructure, but that the temporal fine structure still contains usefuladditional information for speaker identification. This researchaims to provide a timely addition to the literature by identifyingspecific aspects of speech relevant for speaker identification thatcould be used to enhance the discriminant capabilities of machinelearning models.

  • Conference paper
    Hogg AOT, Neo VW, Weiss S, Evers C, Naylor PAet al., 2021,

    A Polynomial Eigenvalue Decomposition Music Approach for Broadband Sound Source Localization

    , 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE
  • Conference paper
    Hogg AOT, Evers C, Naylor PA, 2021,

    Multichannel Overlapping Speaker Segmentation Using Multiple Hypothesis Tracking Of Acoustic And Spatial Features

    , ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE
  • Conference paper
    Neo VW, Evers C, Naylor PA, 2021,

    Polynomial matrix eigenvalue decomposition of spherical harmonics for speech enhancement

    , IEEE International Conference on Acoustics, Speech and Signal Processing, Publisher: IEEE, Pages: 786-790

    Speech enhancement algorithms using polynomial matrix eigen value decomposition (PEVD) have been shown to be effective for noisy and reverberant speech. However, these algorithms do not scale well in complexity with the number of channels used in the processing. For a spherical microphone array sampling an order-limited sound field, the spherical harmonics provide a compact representation of the microphone signals in the form of eigen beams. We propose a PEVD algorithm that uses only the lower dimension eigen beams for speech enhancement at a significantly lower computation cost. The proposed algorithm is shown to significantly reduce complexity while maintaining full performance. Informal listening examples have also indicated that the processing does not introduce any noticeable artefacts.

  • Journal article
    Hogg A, Evers C, Moore A, Naylor Pet al., 2021,

    Overlapping speaker segmentation using multiple hypothesis tracking of fundamental frequency

    , IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol: 29, Pages: 1479-1490, ISSN: 2329-9290

    This paper demonstrates how the harmonic structure of voiced speech can be exploited to segment multiple overlapping speakers in a speaker diarization task. We explore how a change in the speaker can be inferred from a change in pitch. We show that voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker’s utterance in the presence of an additional active speaker. This system is bench-marked against a segmentation system from the literature that employs a bidirectional long short term memory network (BLSTM) approach and requires training. Experimental results highlight that the proposed approach outperforms the BLSTM baseline approach by 12.9% in terms of HIT rate for speaker segmentation. We also show that the estimated pitch tracks of our system can be used as features to the BLSTM to achieve further improvements of 1.21% in terms of coverage and 2.45% in terms of purity

  • Conference paper
    McKnight SW, Hogg A, Naylor P, 2020,

    Analysis of phonetic dependence of segmentation errors in speaker diarization

    , European Signal Processing Conference (EUSIPCO), Publisher: IEEE, ISSN: 2076-1465

    Evaluation of speaker segmentation and diarization normally makes use of forgiveness collars around ground truth speaker segment boundaries such that estimated speaker segment boundaries with such collars are considered completely correct. This paper shows that the popular recent approach of removing forgiveness collars from speaker diarization evaluation tools can unfairly penalize speaker diarization systems that correctly estimate speaker segment boundaries. The uncertainty in identifying the start and/or end of a particular phoneme means that the ground truth segmentation is not perfectly accurate, and even trained human listeners are unable to identify phoneme boundaries with full consistency. This research analyses the phoneme dependence of this uncertainty, and shows that it depends on (i) whether the phoneme being detected is at the start or end of an utterance and (ii) what the phoneme is, so that the use of a uniform forgiveness collar is inadequate. This analysis is expected to point the way towards more indicative and repeatable assessment of the performance of speaker diarization systems.

  • Conference paper
    Neo VW, Evers C, Naylor PA, 2021,

    Speech dereverberation performance of a polynomial-EVD subspace approach

    , European Signal Processing Conference (EUSIPCO), Publisher: IEEE, ISSN: 2076-1465

    The degradation of speech arising from additive background noise and reverberation affects the performance of important speech applications such as telecommunications, hearing aids, voice-controlled systems and robot audition. In this work, we focus on dereverberation. It is shown that the parameterized polynomial matrix eigenvalue decomposition (PEVD)-based speech enhancement algorithm exploits the lack of correlation between speech and the late reflections to enhance the speech component associated with the direct path and early reflections. The algorithm's performance is evaluated using simulations involving measured acoustic impulse responses and noise from the ACE corpus. The simulations and informal listening examples have indicated that the PEVD-based algorithm performs dereverberation over a range of SNRs without introducing any noticeable processing artefacts.

  • Conference paper
    Neo VW, Evers C, Naylor PA, 2020,

    PEVD-based speech enhancement in reverberant environments

    , IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE, Pages: 186-190

    The enhancement of noisy speech is important for applications involving human-to-human interactions, such as telecommunications and hearing aids, as well as human-to-machine interactions, such as voice-controlled systems and robot audition. In this work, we focus on reverberant environments. It is shown that, by exploiting the lack of correlation between speech and the late reflections, further noise reduction can be achieved. This is verified using simulations involving actual acoustic impulse responses and noise from the ACE corpus. The simulations show that even without using a noise estimator, our proposed method simultaneously achieves noise reduction, and enhancement of speech quality and intelligibility, in reverberant environments over a wide range of SNRs. Furthermore, informal listening examples highlight that our approach does not introduce any significant processing artefacts such as musical noise.

  • Journal article
    Steadman M, Kim C, Lestang J-H, Goodman D, Picinali Let al., 2019,

    Short-term effects of sound localization training in virtual reality

    , Scientific Reports, Vol: 9, ISSN: 2045-2322

    Head-related transfer functions (HRTFs) capture the direction-dependant way that sound interacts with the head and torso. In virtual audio systems, which aim to emulate these effects, non-individualized, generic HRTFs are typically used leading to an inaccurate perception of virtual sound location. Training has the potential to exploit the brain’s ability to adapt to these unfamiliar cues. In this study, three virtual sound localization training paradigms were evaluated; one provided simple visual positional confirmation of sound source location, a second introduced game design elements (“gamification”) and a final version additionally utilized head-tracking to provide listeners with experience of relative sound source motion (“active listening”). The results demonstrate a significant effect of training after a small number of short (12-minute) training sessions, which is retained across multiple days. Gamification alone had no significant effect on the efficacy of the training, but active listening resulted in a significantly greater improvements in localization accuracy. In general, improvements in virtual sound localization following training generalized to a second set of non-individualized HRTFs, although some HRTF-specific changes were observed in polar angle judgement for the active listening group. The implications of this on the putative mechanisms of the adaptation process are discussed.

  • Conference paper
    Hogg AOT, Evers C, Naylor PA, 2019,

    Multiple Hypothesis Tracking for Overlapping Speaker Segmentation

    , 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE
  • Conference paper
    Neo VW, Evers C, Naylor PA, 2019,

    Speech Enhancement Using Polynomial Eigenvalue Decomposition

    , 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Publisher: IEEE
  • Conference paper
    Sharma D, Hogg AOT, Wang Y, Nour-Eldin A, Naylor PAet al., 2019,

    Non-Intrusive POLQA Estimation of Speech Quality using Recurrent Neural Networks

    , 2019 27th European Signal Processing Conference (EUSIPCO), Publisher: IEEE
  • Conference paper
    Neo VW, Naylor PA, 2019,

    Second Order Sequential Best Rotation Algorithm with Householder Reduction for Polynomial Matrix Eigenvalue Decomposition

    , ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE
  • Conference paper
    Hogg AOT, Evers C, Naylor PA, 2019,

    Speaker Change Detection Using Fundamental Frequency with Application to Multi-talker Segmentation

    , ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE
  • Journal article
    Stitt P, Picinali L, Katz BFG, 2019,

    Auditory accommodation to poorly matched non-individual spectral localization cues through active learning

    , Scientific Reports, Vol: 9, Pages: 1-14, ISSN: 2045-2322

    This study examines the effect of adaptation to non-ideal auditory localization cues represented by the Head-Related Transfer Function (HRTF) and the retention of training for up to three months after the last session. Continuing from a previous study on rapid non-individual HRTF learning, subjects using non-individual HRTFs were tested alongside control subjects using their own measured HRTFs. Perceptually worst-rated non-individual HRTFs were chosen to represent the worst-case scenario in practice and to allow for maximum potential for improvement. The methodology consisted of a training game and a localization test to evaluate performance carried out over 10 sessions. Sessions 1–4 occurred at 1 week intervals, performed by all subjects. During initial sessions, subjects showed improvement in localization performance for polar error. Following this, half of the subjects stopped the training game element, continuing with only the localization task. The group that continued to train showed improvement, with 3 of 8 subjects achieving group mean polar errors comparable to the control group. The majority of the group that stopped the training game retained their performance attained at the end of session 4. In general, adaptation was found to be quite subject dependent, highlighting the limits of HRTF adaptation in the case of poor HRTF matches. No identifier to predict learning ability was observed.

  • Conference paper
    Moore A, de Haan JM, Pedersen MS, Naylor P, Brookes D, Jensen Jet al., 2019,

    Personalized {HRTF}s for hearing aids

    , ELOBES2019
  • Conference paper
    Cuevas-Rodriguez M, Gonzalez-Toledo D, La Rubia-Cuestas ED, Garre C, Molina-Tanco L, Reyes-Lecuona A, Poirier-Quinot D, Picinali Let al., 2018,

    The 3D Tune-In Toolkit - 3D audio spatialiser, hearing loss and hearing aid simulations

    The 3DTI Toolkit is a standard C++ library for audio spatialisation and simulation using loudspeakers or headphones developed within the 3D Tune-In (3DTI) project (http://www.3d-tune-in.eu), which aims at using 3D sound and simulating hearing loss and hearing aids within virtual environments and games. The Toolkit allows the design and rendering of highly realistic and immersive 3D audio, and the simulation of virtual hearing aid devices and of different typologies of hearing loss. The library includes a real-time 3D binaural audio renderer offering full 3D spatialization based on efficient Head Related Transfer Function (HRTF) convolution, including smooth interpolation among impulse responses, customization of listener head radius and specific simulation of far-distance and near-field effects. In addition, spatial reverberation is simulated in real time using a uniformly partitioned convolution with Binaural Room Impulse Responses (BRIRs) employing a virtual Ambisonic approach. The 3D Tune-In Toolkit includes also a loudspeaker-based spatialiser implemented using Ambisonic encoding/decoding. This poster presents a brief overview of the main features of the Toolkit, which is released open-source under GPL v3 license (the code is available in GitHub https://github.com/3DTune-In/3dti-AudioToolkit).

  • Journal article
    Braiman C, Fridman A, Conte MM, Vosse HU, Reichenbach CS, Reichenbach J, Schiff NDet al., 2018,

    Cortical response to the natural speech envelope correlates with neuroimaging evidence of cognition in severe brain injury

    , Current Biology, Vol: 28, Pages: 3833-3839.E3, ISSN: 0960-9822

    Recent studies identify severely brain-injured patients with limited or no behavioral responses who successfully perform functional magnetic resonance imaging (fMRI) or electroencephalogram (EEG) mental imagery tasks [1, 2, 3, 4, 5]. Such tasks are cognitively demanding [1]; accordingly, recent studies support that fMRI command following in brain-injured patients associates with preserved cerebral metabolism and preserved sleep-wake EEG [5, 6]. We investigated the use of an EEG response that tracks the natural speech envelope (NSE) of spoken language [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] in healthy controls and brain-injured patients (vegetative state to emergence from minimally conscious state). As audition is typically preserved after brain injury, auditory paradigms may be preferred in searching for covert cognitive function [23, 24, 25]. NSE measures are obtained by cross-correlating EEG with the NSE. We compared NSE latencies and amplitudes with and without consideration of fMRI assessments. NSE latencies showed significant and progressive delay across diagnostic categories. Patients who could carry out fMRI-based mental imagery tasks showed no statistically significant difference in NSE latencies relative to healthy controls; this subgroup included patients without behavioral command following. The NSE may stratify patients with severe brain injuries and identify those patients demonstrating “cognitive motor dissociation” (CMD) [26] who show only covert evidence of command following utilizing neuroimaging or electrophysiological methods that demand high levels of cognitive function. Thus, the NSE is a passive measure that may provide a useful screening tool to improve detection of covert cognition with fMRI or other methods and improve stratification of patients with disorders of consciousness in research studies.

  • Journal article
    Sethi S, Ewers R, Jones N, Orme D, Picinali Let al., 2018,

    Robust, real-time and autonomous monitoring of ecosystems with an open, low-cost, networked device

    , Methods in Ecology and Evolution, Vol: 9, Pages: 2383-2387, ISSN: 2041-210X

    1. Automated methods of monitoring ecosystems provide a cost-effective way to track changes in natural system's dynamics across temporal and spatial scales. However, methods of recording and storing data captured from the field still require significant manual effort. 2. Here we introduce an open source, inexpensive, fully autonomous ecosystem monitoring unit for capturing and remotely transmitting continuous data streams from field sites over long time-periods. We provide a modular software framework for deploying various sensors, together with implementations to demonstrate proof of concept for continuous audio monitoring and time-lapse photography. 3. We show how our system can outperform comparable technologies for fractions of the cost, provided a local mobile network link is available. The system is robust to unreliable network signals and has been shown to function in extreme environmental conditions, such as in the tropical rainforests of Sabah, Borneo. 4. We provide full details on how to assemble the hardware, and the open-source software. Paired with appropriate automated analysis techniques, this system could provide spatially dense, near real-time, continuous insights into ecosystem and biodiversity dynamics at a low cost.

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://www.imperial.ac.uk:80/respub/WEB-INF/jsp/search-t4-html.jsp Request URI: /respub/WEB-INF/jsp/search-t4-html.jsp Query String: id=1047&limit=20&respub-action=search.html Current Millis: 1664274690799 Current Time: Tue Sep 27 11:31:30 BST 2022