"We also observe that if the speakers are enrolled with sufficient amount of training data, GMM-UBM system outperforms i-vector system for very short test utterances."
later in the introduction, they wrote
"Our experimental results reveal that though TV(i-vector) system is performing better than GMM-UBM in many conditions, the classical approach is still better than the state-of-the-art technique for condition very similar to practical requirements i.e. when speakers are enrolled with sufficient amount of speech data and tested with short segments."
Moreover in earlier literature, there were also evidences showing that GMM-UBM systems perform well for short test segments[2] [3] .
These explain the surprising results I saw, because the testing data we are dealing with are all audio clips of single sentences extracted from CNN news, which typically have durations less than 5 seconds. For this reason, I decided to choose GMM-UBM system over the i-vector system.
To further improve the GMM-UBM system, I spent considerable amount of time tuning the parameters, and made the following change to it:
- The dimension of mfcc is now extended to 19
- Delta and double delta coefficients are appended to form higher dimensional feature vectors.
- More GMM components (64) are used to model a speaker.
- UBM is used to check the confidence of the classification results
- Instead of a simple sum of GMM posterior, a weighted sum is used as the score
These improvements turned out to be effective, now the testing result given by this upgraded system is
"2324 out of 2656 clips are correctly recognized"
which is 87.5%.
No comments:
Post a Comment