Thursday, June 29, 2017

Week4: Detecting Outliers in Training Data

In the post from the last week, I mentioned that there can be mislabeled audio clips in the training dataset collected for each speaker. These wrong data will degrade the quality of the model trained on them, and eventually lead to bad performance on speaker recognition tasks. Hence I designed an outlier detection method tailored for audio data to automatically filter out these mislabeled clips.

This method is based on the Gaussian Mixture Models (GMMs).  For every speaker, we first train a GMM on all audio clips collected in the training dataset using the expectation-maximization (EM) algorithm. Then for each audio clip, we estimate its generative probability using the trained model. Audio clips that fit the distribution of the true speaker will have a high generative probability, and anomalies will have very low fit probabilities. Hence we can set a threshold and filter out every clip that has a fit score lower than it. For details of outlier detection by probabilistic mixture modeling, please refer to Chaper 2.4 of the book Outlier Analysis by Charu Aggarwal, one of the most cited researchers in the field of outlier and anomaly detection.

I tested this method on the data collected last week from the sample video. The following list shows the result on audio clips for Jim Clancy, the generative probability for each clip is given in the attribute "llhd", which stands for the log likelihood.
[
  {
    "duration": "0:00:01.650",
    "llhd": -20.387030349382481,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy0.wav",
    "start": "0:00:15.240"
  },
  {
    "duration": "0:00:08.000",
    "llhd": -18.196139725170504,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy1.wav",
    "start": "0:00:32.960"
  },
  {
    "duration": "0:00:00.910",
    "llhd": -17.888030707481747,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy2.wav",
    "start": "0:00:47.460"
  },
  {
    "duration": "0:00:05.940",
    "llhd": -18.082631203617577,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy3.wav",
    "start": "0:01:25.690"
  },
  {
    "duration": "0:00:03.960",
    "llhd": -18.352468451630649,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy4.wav",
    "start": "0:01:39.290"
  },
  {
    "duration": "0:00:06.260",
    "llhd": -17.712504094912944,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy5.wav",
    "start": "0:02:14.740"
  },
  {
    "duration": "0:00:05.140",
    "llhd": -19.767810192124848,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy6.wav",
    "start": "0:04:00.360"
  },
  {
    "duration": "0:00:01.520",
    "llhd": -17.829715306752892,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy7.wav",
    "start": "0:05:40.970"
  },
  {
    "duration": "0:00:06.320",
    "llhd": -18.639242622299303,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy8.wav",
    "start": "0:09:54.240"
  },
  {
    "duration": "0:00:07.320",
    "llhd": -17.714960218582345,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy9.wav",
    "start": "0:10:08.030"
  },
  {
    "duration": "0:00:04.290",
    "llhd": -17.906594198834085,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy10.wav",
    "start": "0:12:34.670"
  },
  {
    "duration": "0:00:03.420",
    "llhd": -17.690216443470053,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy11.wav",
    "start": "0:15:11.680"
  }
]

Note that this method successfully identified the mislabeled clip (number 6) mentioned in my last article, which scores the second lowest log likelihood (-19.767810192124848). The speech in the first clip although really is from Jim Clancy, one can notice that it's mixed with loud background music, hence it's not so surprising that the algorithm also identified it as an outlier. Running this method on the datasets of other speakers also confirmed that it can not only filter out wrong speakers, but also noisy, low quality. One may argue that training with noisy data can increase the robustness of recognition models. This however can be compensated later by introduce artificial noise or extracting speech signals using blind source separation techniques.

The script for this process is detect_outlier.py, and a usage example is here given:


python detect_outlier.py -i ./output/audio/ -s Jim_Clancy

the argument after -i is the audio directory, where the training audio clips for every speaker are placed, and -s specifies the name of the speaker, which should also be the name of a subdirectory in the audio directory.

No comments:

Post a Comment