Analysis of Very Low Quality Speech for Mask-Based Intelligibility Enhancement

Researchers: Leo Lightburn, Sira Gonzalez and Mike Brookes

Many sources of intrusion are present in any environment when trying to follow a speech soundstream. In addition, the channel and electronic devices used to record and propagate the sound may introduce distortion. Automatic speech enhancement aims to improve the intelligibility and/or the quality of corrupted speech signals and has many applications in speech processing. Many different approaches have been developed up to date, but no current technique is able to improve substantially speech intelligibility. In this project, our goal is to improve the intelligibility of very low quality speech by means of a time-frequency binary mask. We focus on the identification and preservation of the speech energy, taking into account that it has different distribution for each kind of phoneme. We aim to extract features that identify the types of phoneme and we aim to apply temporal constraints on their length and sequence.

We have started the features extraction with PEFAC (Pitch Estimation Filter with Amplitude Compression), a fundamental frequency estimation algorithm that is able to identify the pitch of voiced frames reliably even at negative signal to noise ratio. PEFAC is a frequency domain pitch estimator robust to high levels of additive noise. The pitch, fundamental frequency, of each frame is estimated by convolving its power spectral density in the log-frequency domain with a filter that sums the energy of the pitch harmonics while rejecting additive noise that has a smoothly varying power spectrum. Amplitude compression is applied before filtering to attenuate narrowband noise components. Current version available at VOICEBOX.

We have also developed an algorithm for identifying the location of sibilant phones in noisy speech. Our algorithm does not attempt to identify sibilant onsets and offsets directly but instead detects a sustained increase in power over the entire duration of a sibilant phone. The normalized estimate of the sibilant power forms the input to two Gaussian mixture models that are trained on sibilant and non-sibilant frames respectively. The likelihood ratio of the two models is then used to classify each frame. The classification accuracy is over 80% at 0 dB signal to noise ratio for additive white noise.

Relevant publications:

  1. L. Lightburn, M. Brookes: SOBM - A Binary Mask for Noisy Speech that Optimises an Objective Intelligibility Metric. In: Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 2015.
  2. S. Gonzalez, M. Brookes: Speech active level estimation in noisy conditions. In: Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013.
  3. S. Gonzalez, M. Brookes: Sibilant Speech Detection in Noise. In: Proc. Interspeech Conf. , Portland, USA, 2012.
  4. S. Gonzalez, M. Brookes: A Pitch Estimation Filter Robust to High Levels of Noise (PEFAC). In: Proc. European Signal Processing Conf. (EUSIPCO), pp. 451-455, Barcelona, Spain, 2011.