Search or filter publications

Filter by type:

Filter by publication type

Filter by year:



  • Showing results for:
  • Reset all filters

Search results

  • Conference paper
    McKnight SW, Hogg A, Naylor P, 2020,

    Analysis of phonetic dependence of segmentation errors in speaker diarization

    , European Signal Processing Conference (EUSIPCO), Publisher: IEEE

    Evaluation of speaker segmentation and diarization normally makes use of forgiveness collars around ground truth speaker segment boundaries such that estimated speaker segment boundaries with such collars are considered completely correct. This paper shows that the popular recent approach of removing forgiveness collars from speaker diarization evaluation tools can unfairly penalize speaker diarization systems that correctly estimate speaker segment boundaries. The uncertainty in identifying the start and/or end of a particular phoneme means that the ground truth segmentation is not perfectly accurate, and even trained human listeners are unable to identify phoneme boundaries with full consistency. This research analyses the phoneme dependence of this uncertainty, and shows that it depends on (i) whether the phoneme being detected is at the start or end of an utterance and (ii) what the phoneme is, so that the use of a uniform forgiveness collar is inadequate. This analysis is expected to point the way towards more indicative and repeatable assessment of the performance of speaker diarization systems.

  • Journal article
    Steadman M, Kim C, Lestang J-H, Goodman D, Picinali Let al., 2019,

    Short-term effects of sound localization training in virtual reality

    , Scientific Reports, Vol: 9, ISSN: 2045-2322

    Head-related transfer functions (HRTFs) capture the direction-dependant way that sound interacts with the head and torso. In virtual audio systems, which aim to emulate these effects, non-individualized, generic HRTFs are typically used leading to an inaccurate perception of virtual sound location. Training has the potential to exploit the brain’s ability to adapt to these unfamiliar cues. In this study, three virtual sound localization training paradigms were evaluated; one provided simple visual positional confirmation of sound source location, a second introduced game design elements (“gamification”) and a final version additionally utilized head-tracking to provide listeners with experience of relative sound source motion (“active listening”). The results demonstrate a significant effect of training after a small number of short (12-minute) training sessions, which is retained across multiple days. Gamification alone had no significant effect on the efficacy of the training, but active listening resulted in a significantly greater improvements in localization accuracy. In general, improvements in virtual sound localization following training generalized to a second set of non-individualized HRTFs, although some HRTF-specific changes were observed in polar angle judgement for the active listening group. The implications of this on the putative mechanisms of the adaptation process are discussed.

  • Conference paper
    Hogg AOT, Evers C, Naylor PA, 2019,

    Multiple hypothesis tracking for overlapping speaker segmentation

    , Pages: 195-199, ISSN: 1931-1168

    © 2019 IEEE. Speaker segmentation is an essential part of any diarization system. Applications of diarization include tasks such as speaker indexing, improving automatic speech recognition (ASR) performance and making single speaker-based algorithms available for use in multi-speaker environments. This paper proposes a multiple hypothesis tracking (MHT) method that exploits the harmonic structure associated with the pitch in voiced speech in order to segment the onsets and end-points of speech from multiple, overlapping speakers. The proposed method is evaluated against a segmentation system from the literature that uses a spectral representation and is based on employing bidirectional long short term memory networks (BLSTM). The proposed method is shown to achieve comparable performance for segmenting overlapping speakers only using the pitch harmonic information in the MHT framework.

  • Conference paper
    Sharma D, Hogg AOT, Wang Y, Nour-Eldin A, Naylor PAet al., 2019,

    Non-intrusive POLQA estimation of speech quality using recurrent neural networks

    , ISSN: 2219-5491

    © 2019 IEEE Estimating the quality of speech without the use of a clean reference signal is a challenging problem, in part due to the time and expense required to collect sufficient training data for modern machine learning algorithms. We present a novel, non-intrusive estimator that exploits recurrent neural network architectures to predict the intrusive POLQA score of a speech signal in a short time context. The predictor is based on a novel compressed representation of modulation domain features, used in conjunction with static MFCC features. We show that the proposed method can reliably predict POLQA with a 300 ms context, achieving a mean absolute error of 0.21 on unseen data. The proposed method is trained using English speech and is shown to generalize well across unseen languages. The neural network also jointly estimates the mean voice activity detection (VAD) with an F1 accuracy score of 0.9, removing the need for an external VAD.

  • Conference paper
    Hogg AOT, Evers C, Naylor PA, 2019,

    Speaker Change Detection Using Fundamental Frequency with Application to Multi-talker Segmentation

    , Pages: 5826-5830, ISSN: 1520-6149

    © 2019 IEEE. This paper shows that time varying pitch properties can be used advantageously within the segmentation step of a multi-talker diarization system. First a study is conducted to verify that changes in pitch are strong indicators of changes in the speaker. It is then highlighted that an individual's pitch is smoothly varying and, therefore, can be predicted by means of a Kalman filter. Subsequently it is shown that if the pitch is not predictable then this is most likely due to a change in the speaker. Finally, a novel system is proposed that uses this approach of pitch prediction for speaker change detection. This system is then evaluated against a commonly used MFCC segmentation system. The proposed system is shown to increase the speaker change detection rate from 43.3% to 70.5% on meetings in the AMI corpus. Therefore, there are two equally weighted contributions in this paper: 1. We address the question of whether a change in pitch is a reliable estimator of a speaker change in multi-talk meeting audio. 2. We develop a method to extract such speaker changes and test them on a widely available meeting corpus.

  • Journal article
    Stitt P, Picinali L, Katz BFG, 2019,

    Auditory accommodation to poorly matched non-individual spectral localization cues through active learning

    , Scientific Reports, Vol: 9, ISSN: 2045-2322

    This study examines the effect of adaptation to non-ideal auditory localization cues represented by the Head-Related Transfer Function (HRTF) and the retention of training for up to three months after the last session. Continuing from a previous study on rapid non-individual HRTF learning, subjects using non-individual HRTFs were tested alongside control subjects using their own measured HRTFs. Perceptually worst-rated non-individual HRTFs were chosen to represent the worst-case scenario in practice and to allow for maximum potential for improvement. The methodology consisted of a training game and a localization test to evaluate performance carried out over 10 sessions. Sessions 1–4 occurred at 1 week intervals, performed by all subjects. During initial sessions, subjects showed improvement in localization performance for polar error. Following this, half of the subjects stopped the training game element, continuing with only the localization task. The group that continued to train showed improvement, with 3 of 8 subjects achieving group mean polar errors comparable to the control group. The majority of the group that stopped the training game retained their performance attained at the end of session 4. In general, adaptation was found to be quite subject dependent, highlighting the limits of HRTF adaptation in the case of poor HRTF matches. No identifier to predict learning ability was observed.

  • Conference paper
    Moore A, de Haan JM, Pedersen MS, Naylor P, Brookes D, Jensen Jet al., 2019,

    Personalized {HRTF}s for hearing aids

    , ELOBES2019
  • Conference paper
    Cuevas-Rodriguez M, Gonzalez-Toledo D, La Rubia-Cuestas ED, Garre C, Molina-Tanco L, Reyes-Lecuona A, Poirier-Quinot D, Picinali Let al., 2018,

    The 3D Tune-In Toolkit - 3D audio spatialiser, hearing loss and hearing aid simulations

    © 2018 IEEE. The 3DTI Toolkit is a standard C++ library for audio spatialisation and simulation using loudspeakers or headphones developed within the 3D Tune-In (3DTI) project (, which aims at using 3D sound and simulating hearing loss and hearing aids within virtual environments and games. The Toolkit allows the design and rendering of highly realistic and immersive 3D audio, and the simulation of virtual hearing aid devices and of different typologies of hearing loss. The library includes a real-time 3D binaural audio renderer offering full 3D spatialization based on efficient Head Related Transfer Function (HRTF) convolution, including smooth interpolation among impulse responses, customization of listener head radius and specific simulation of far-distance and near-field effects. In addition, spatial reverberation is simulated in real time using a uniformly partitioned convolution with Binaural Room Impulse Responses (BRIRs) employing a virtual Ambisonic approach. The 3D Tune-In Toolkit includes also a loudspeaker-based spatialiser implemented using Ambisonic encoding/decoding. This poster presents a brief overview of the main features of the Toolkit, which is released open-source under GPL v3 license (the code is available in GitHub

  • Journal article
    Braiman C, Fridman A, Conte MM, Vosse HU, Reichenbach CS, Reichenbach J, Schiff NDet al., 2018,

    Cortical response to the natural speech envelope correlates with neuroimaging evidence of cognition in severe brain injury

    , Current Biology, Vol: 28, Pages: 3833-3839, ISSN: 1879-0445

    Recent studies identify severely brain-injured patients with limited or no behavioral responses who successfully perform functional magnetic resonance imaging (fMRI) or electroencephalogram (EEG) mental imagery tasks [1, 2, 3, 4, 5]. Such tasks are cognitively demanding [1]; accordingly, recent studies support that fMRI command following in brain-injured patients associates with preserved cerebral metabolism and preserved sleep-wake EEG [5, 6]. We investigated the use of an EEG response that tracks the natural speech envelope (NSE) of spoken language [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] in healthy controls and brain-injured patients (vegetative state to emergence from minimally conscious state). As audition is typically preserved after brain injury, auditory paradigms may be preferred in searching for covert cognitive function [23, 24, 25]. NSE measures are obtained by cross-correlating EEG with the NSE. We compared NSE latencies and amplitudes with and without consideration of fMRI assessments. NSE latencies showed significant and progressive delay across diagnostic categories. Patients who could carry out fMRI-based mental imagery tasks showed no statistically significant difference in NSE latencies relative to healthy controls; this subgroup included patients without behavioral command following. The NSE may stratify patients with severe brain injuries and identify those patients demonstrating “cognitive motor dissociation” (CMD) [26] who show only covert evidence of command following utilizing neuroimaging or electrophysiological methods that demand high levels of cognitive function. Thus, the NSE is a passive measure that may provide a useful screening tool to improve detection of covert cognition with fMRI or other methods and improve stratification of patients with disorders of consciousness in research studies.

  • Journal article
    Sethi S, Ewers R, Jones N, Orme D, Picinali L, Sethi SS, Ewers R, Jones N, Orme CDL, Picinali Let al., 2018,

    Robust, real-time and autonomous monitoring of ecosystems with an open, low-cost, networked device

    , Methods in Ecology and Evolution, Vol: 9, Pages: 2383-2387, ISSN: 2041-210X

    1. Automated methods of monitoring ecosystems provide a cost-effective way to track changes in natural system's dynamics across temporal and spatial scales. However, methods of recording and storing data captured from the field still require significant manual effort. 2. Here we introduce an open source, inexpensive, fully autonomous ecosystem monitoring unit for capturing and remotely transmitting continuous data streams from field sites over long time-periods. We provide a modular software framework for deploying various sensors, together with implementations to demonstrate proof of concept for continuous audio monitoring and time-lapse photography. 3. We show how our system can outperform comparable technologies for fractions of the cost, provided a local mobile network link is available. The system is robust to unreliable network signals and has been shown to function in extreme environmental conditions, such as in the tropical rainforests of Sabah, Borneo. 4. We provide full details on how to assemble the hardware, and the open-source software. Paired with appropriate automated analysis techniques, this system could provide spatially dense, near real-time, continuous insights into ecosystem and biodiversity dynamics at a low cost.

  • Conference paper
    Dawson PJ, De Sena E, Naylor PA, 2018,

    An acoustic image-source characterisation of surface profiles

    , 2018 26th European Signal Processing Conference (EUSIPCO), Publisher: IEEE, Pages: 2130-2134, ISSN: 2076-1465

    The image-source method models the specular reflection from a plane by means of a secondary source positioned at the source's reflected image. The method has been widely used in acoustics to model the reverberant field of rectangular rooms, but can also be used for general-shaped rooms and non-flat reflectors. This paper explores the relationship between the physical properties of a non-flat reflector and the statistical properties of the associated cloud of image-sources. It is shown here that the standard deviation of the image-sources is strongly correlated with the ratio between depth and width of the reflector's spatial features.

  • Conference paper
    Moore AH, Lightburn L, Xue W, Naylor P, Brookes Det al., 2018,

    Binaural mask-informed speech enhancement for hearing aids with head tracking

    , International Workshop on Acoustic Signal Enhancement (IWAENC 2018), Publisher: IEEE

    An end-to-end speech enhancement system for hearing aids is pro-posed which seeks to improve the intelligibility of binaural speechin noise during head movement. The system uses a reference beam-former whose look direction is informed by knowledge of the headorientation and the a priori known direction of the desired source.From this a time-frequency mask is estimated using a deep neuralnetwork. The binaural signals are obtained using bilateral beam-formers followed by a classical minimum mean square error speechenhancer, modified to use the estimated mask as a speech presenceprobability prior. In simulated experiments, the improvement in abinaural intelligibility metric (DBSTOI) given by the proposed sys-tem relative to beamforming alone corresponds to an SNR improve-ment of 4 to 6 dB. Results also demonstrate the individual contribu-tions of incorporating the mask and the head orientation-aware beamsteering to the proposed system.

  • Conference paper
    Kim C, Steadman M, Lestang JH, Goodman DFM, Picinali Let al., 2018,

    A VR-based mobile platform for training to non-individualized binaural 3D audio

    , 144th Audio Engineering Society Convention 2018

    © 2018 Audio Engineering Society. All Rights Reserved. Delivery of immersive 3D audio with arbitrarily-positioned sound sources over headphones often requires processing of individual source signals through a set of Head-Related Transfer Functions (HRTFs). The individual morphological differences and the impracticality of HRTF measurement make it difficult to deliver completely individualized 3D audio, and instead lead to the use of previously-measured non-individual sets of HRTFs. In this study, a VR-based mobile sound localization training prototype system is introduced which uses HRTF sets for audio. It consists of a mobile phone as a head-mounted device, a hand-held Bluetooth controller, and a network-enabled laptop with a USB audio interface and a pair of headphones. The virtual environment was developed on the mobile phone such that the user can listen-to/navigate-in an acoustically neutral scene and locate invisible target sound sources presented at random directions using non-individualized HRTFs in repetitive sessions. Various training paradigms can be designed with this system, with performance-related feedback provided according to the user’s localization accuracy, including visual indication of the target location, and some aspects of a typical first-person shooting game, such as enemies, scoring, and level advancement. An experiment was conducted using this system, in which 11 subjects went through multiple training sessions, using non-individualized HRTF sets. The localization performance evaluations showed reduction of overall localization angle error over repeated training sessions, reflecting lower front-back confusion rates.

  • Conference paper
    Forte AE, Etard OE, Reichenbach JDT, 2018,

    Selective Auditory Attention At The Brainstem Level

    , ARO 2018
  • Conference paper
    Saiz Alia M, Askari A, Forte AE, Reichenbach JDTet al., 2018,

    A model of the human auditory brainstem response to running speech

    , ARO 2018
  • Conference paper
    Kegler M, Etard OE, Forte AE, Reichenbach JDTet al., 2018,

    Complex Statistical Model for Detecting the Auditory Brainstem Response to Natural Speech and for Decoding Attention from High-Density EEG Recordings

    , ARO 2018
  • Conference paper
    Lim V, Frangakis N, Molina Tanco L, Picinali Let al., 2018,

    PLUGGY: A Pluggable Social Platform for Cultural Heritage Awareness and Participation

    , International Workshop on Analysis in Digital Cultural Heritage 2017
  • Journal article
    Etard OE, Kegler M, Braiman C, Forte AE, Reichenbach JDTet al., 2018,

    Real-time decoding of selective attention from the human auditory brainstem response to continuous speech

    , BioRxiv
  • Journal article
    Reichenbach JDT, Ciganovic N, Warren R, Keceli B, Jacon S, Fridberger Aet al., 2018,

    Static length changes of cochlear outer hair cells can tune low-frequency hearing

    , PLoS Computational Biology, Vol: 14, ISSN: 1553-734X

    The cochlea not only transduces sound-induced vibration into neural spikes, it also amplifiesweak sound to boost its detection. Actuators of this active process are sensory outer haircells in the organ of Corti, whereas the inner hair cells transduce the resulting motion intoelectric signals that propagate via the auditory nerve to the brain. However, how the outerhair cells modulate the stimulus to the inner hair cells remains unclear. Here, we combinetheoretical modeling and experimental measurements near the cochlear apex to study theway in which length changes of the outer hair cells deform the organ of Corti. We develop ageometry-based kinematic model of the apical organ of Corti that reproduces salient, yetcounter-intuitive features of the organ’s motion. Our analysis further uncovers a mechanismby which a static length change of the outer hair cells can sensitively tune the signal transmittedto the sensory inner hair cells. When the outer hair cells are in an elongated state,stimulation of inner hair cells is largely inhibited, whereas outer hair cell contraction leads toa substantial enhancement of sound-evoked motion near the hair bundles. This novel mechanismfor regulating the sensitivity of the hearing organ applies to the low frequencies thatare most important for the perception of speech and music. We suggest that the proposedmechanism might underlie frequency discrimination at low auditory frequencies, as well asour ability to selectively attend auditory signals in noisy surroundings.

  • Journal article
    Dietz M, Lestang J-H, Majdak P, Stern RM, Marquardt T, Ewert SD, Hartmann WM, Goodman DFMet al., 2017,

    A framework for testing and comparing binaural models

    , Hearing Research, Vol: 360, Pages: 92-106, ISSN: 0378-5955

    Auditory research has a rich history of combining experimental evidence with computational simulations of auditory processing in order to deepen our theoretical understanding of how sound is processed in the ears and in the brain. Despite significant progress in the amount of detail and breadth covered by auditory models, for many components of the auditory pathway there are still different model approaches that are often not equivalent but rather in conflict with each other. Similarly, some experimental studies yield conflicting results which has led to controversies. This can be best resolved by a systematic comparison of multiple experimental data sets and model approaches. Binaural processing is a prominent example of how the development of quantitative theories can advance our understanding of the phenomena, but there remain several unresolved questions for which competing model approaches exist. This article discusses a number of current unresolved or disputed issues in binaural modelling, as well as some of the significant challenges in comparing binaural models with each other and with the experimental data. We introduce an auditory model framework, which we believe can become a useful infrastructure for resolving some of the current controversies. It operates models over the same paradigms that are used experimentally. The core of the proposed framework is an interface that connects three components irrespective of their underlying programming language: The experiment software, an auditory pathway model, and task-dependent decision stages called artificial observers that provide the same output format as the test subject.

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: Request URI: /respub/WEB-INF/jsp/search-t4-html.jsp Query String: id=1047&limit=20&respub-action=search.html Current Millis: 1597285408583 Current Time: Thu Aug 13 03:23:28 BST 2020