Sarnoff Speech Technology - Blind Source Separation

Blind Source Separation based on Multiple Decorrelations (*)

Abstract

Acoustic signals recorded simultaneously in a reverberant environment can be described as sums of differently convolved sources. The task of source separation is to identify the multiple channels and possibly to invert those in order to obtain estimates of the underling sources. We tackle the problem by explicitly exploiting the non-stationarity of the acoustic sources. Changing cross-correlations at multiple times give a sufficient set of constraints for the unknown channels. A least squares optimization allows us to estimate a forward model, identifying thus the multipath channel. In the same manner we can find an FIR backward model, which generates well separated model sources.

The applications of this technique reach from advanced digital hearing aids to improved front ends for speech recognition engines.

Results

Two speakers recorded in a low noise environment (12 dB). 15 seconds alternating and 15 seconds continuous speech. The cross-talk, measured as Signal to Signal Ratio (SSR) in the first 15 seconds improves from 0 dB before the separation (channel 1, channel 2) to 14 dB after the separation (channel 1, channel 2).

In order to compare with current state of the art listen to separation obtained from two speakers in a real room (channel 1, channel 2). This is our result (channel 1, channel 2) and the result provided by by Te-Won Lee (channel 1, channel 2).

In the case of a main source embedded in a background of multiple sources (channel 1, channel 2) the assumption of more microphones than sources is violated. In that case the algorithm separates the speaker form the background, providing a good background estimate (background - channel 1, speaker - channel 2).

This is an example of a strongly reveberating environment(channel 1, channel 2) . The interfering source (TV set) has little direct signal to the two microphones and instead reflects of a wall of the room. This result is obtained with a filter of size 512 (background - channel 2, speaker - channel 1).

(*) US patent 6,167,417; IEEE Trans. on Speech and Audio Processing pp. 320-327, May 2000.