RedHenSpeakRecog: Week 10: Better results with GMM-UBM

After trained an i-vector extractor, I tested it on the testing dataset. To my surprise, the process took extremely long, a single sample could take up to more than 5 seconds. In addition, its performance on a subset of data was not as good as the GMM-based system. This result does not agree with the common belief that I-vector should outperform GMM methods, so I once suspected that there might have been some implementation errors with either the bob.kaldi package or my training script, and spent quite some time debugging these codes. Although bob.kaldi could be improved to have much better efficiency, none functional errors could be detected as the underlying computation was basically done by Kaldi. I was puzzled for long until I found that many others have reported that for short audios (less than 5 seconds), GMM-UBM can perform better than I-vectors. For example, in [1] the authors did a systematic comparison of these two systems including the effect of duration variability, and mentioned in the Abstract that

"We also observe that if the speakers are enrolled with sufficient amount of training data, GMM-UBM system outperforms i-vector system for very short test utterances."

later in the introduction, they wrote

"Our experimental results reveal that though TV(i-vector) system is performing better than GMM-UBM in many conditions, the classical approach is still better than the state-of-the-art technique for condition very similar to practical requirements i.e. when speakers are enrolled with sufficient amount of speech data and tested with short segments."

Moreover in earlier literature, there were also evidences showing that GMM-UBM systems perform well for short test segments[2] [3] .

These explain the surprising results I saw, because the testing data we are dealing with are all audio clips of single sentences extracted from CNN news, which typically have durations less than 5 seconds. For this reason, I decided to choose GMM-UBM system over the i-vector system.

To further improve the GMM-UBM system, I spent considerable amount of time tuning the parameters, and made the following change to it:

The dimension of mfcc is now extended to 19
Delta and double delta coefficients are appended to form higher dimensional feature vectors.
More GMM components (64) are used to model a speaker.
UBM is used to check the confidence of the classification results
Instead of a simple sum of GMM posterior, a weighted sum is used as the score

These improvements turned out to be effective, now the testing result given by this upgraded system is

"2324 out of 2656 clips are correctly recognized"

which is 87.5%.

RedHenSpeakRecog

Wednesday, August 16, 2017

Week 10: Better results with GMM-UBM

No comments:

Post a Comment