Thursday, June 29, 2017

Week4: Detecting Outliers in Training Data

In the post from the last week, I mentioned that there can be mislabeled audio clips in the training dataset collected for each speaker. These wrong data will degrade the quality of the model trained on them, and eventually lead to bad performance on speaker recognition tasks. Hence I designed an outlier detection method tailored for audio data to automatically filter out these mislabeled clips.

This method is based on the Gaussian Mixture Models (GMMs).  For every speaker, we first train a GMM on all audio clips collected in the training dataset using the expectation-maximization (EM) algorithm. Then for each audio clip, we estimate its generative probability using the trained model. Audio clips that fit the distribution of the true speaker will have a high generative probability, and anomalies will have very low fit probabilities. Hence we can set a threshold and filter out every clip that has a fit score lower than it. For details of outlier detection by probabilistic mixture modeling, please refer to Chaper 2.4 of the book Outlier Analysis by Charu Aggarwal, one of the most cited researchers in the field of outlier and anomaly detection.

I tested this method on the data collected last week from the sample video. The following list shows the result on audio clips for Jim Clancy, the generative probability for each clip is given in the attribute "llhd", which stands for the log likelihood.
[
  {
    "duration": "0:00:01.650",
    "llhd": -20.387030349382481,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy0.wav",
    "start": "0:00:15.240"
  },
  {
    "duration": "0:00:08.000",
    "llhd": -18.196139725170504,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy1.wav",
    "start": "0:00:32.960"
  },
  {
    "duration": "0:00:00.910",
    "llhd": -17.888030707481747,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy2.wav",
    "start": "0:00:47.460"
  },
  {
    "duration": "0:00:05.940",
    "llhd": -18.082631203617577,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy3.wav",
    "start": "0:01:25.690"
  },
  {
    "duration": "0:00:03.960",
    "llhd": -18.352468451630649,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy4.wav",
    "start": "0:01:39.290"
  },
  {
    "duration": "0:00:06.260",
    "llhd": -17.712504094912944,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy5.wav",
    "start": "0:02:14.740"
  },
  {
    "duration": "0:00:05.140",
    "llhd": -19.767810192124848,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy6.wav",
    "start": "0:04:00.360"
  },
  {
    "duration": "0:00:01.520",
    "llhd": -17.829715306752892,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy7.wav",
    "start": "0:05:40.970"
  },
  {
    "duration": "0:00:06.320",
    "llhd": -18.639242622299303,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy8.wav",
    "start": "0:09:54.240"
  },
  {
    "duration": "0:00:07.320",
    "llhd": -17.714960218582345,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy9.wav",
    "start": "0:10:08.030"
  },
  {
    "duration": "0:00:04.290",
    "llhd": -17.906594198834085,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy10.wav",
    "start": "0:12:34.670"
  },
  {
    "duration": "0:00:03.420",
    "llhd": -17.690216443470053,
    "name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy11.wav",
    "start": "0:15:11.680"
  }
]

Note that this method successfully identified the mislabeled clip (number 6) mentioned in my last article, which scores the second lowest log likelihood (-19.767810192124848). The speech in the first clip although really is from Jim Clancy, one can notice that it's mixed with loud background music, hence it's not so surprising that the algorithm also identified it as an outlier. Running this method on the datasets of other speakers also confirmed that it can not only filter out wrong speakers, but also noisy, low quality. One may argue that training with noisy data can increase the robustness of recognition models. This however can be compensated later by introduce artificial noise or extracting speech signals using blind source separation techniques.

The script for this process is detect_outlier.py, and a usage example is here given:


python detect_outlier.py -i ./output/audio/ -s Jim_Clancy

the argument after -i is the audio directory, where the training audio clips for every speaker are placed, and -s specifies the name of the speaker, which should also be the name of a subdirectory in the audio directory.

Wednesday, June 21, 2017

Week3: Collect Training Data for Speaker Recognition

After we have the speaker turns and sentence boundaries, the next step is to extract audio clips for every speaker and collect them as training datasets for the recognition models. To do this, I wrote a python script spk_extract.py, which loops through the list of speakers in the transcript, and whenever it encounters a new speaker, a directory for this speaker will be created and the audio clips of this speaker will be placed inside, along with a text file contains the list of clips and their timestamps.

At first, I clipped every sentence between two speaker turns, and this resulted in a huge number of clips in the datasets. But the problem is that a lot of irrelevant speeches are included, because not all audio signals between two speaker turns are speeches from the same speaker. For example, there can be commercials between two speaker turns. Therefore, to make the data collections more reliable, I then changed the code to extract only the first sentence after a speaker turn. This significantly reduced the amount of clips collected (e.g the number of clips for the most frequent speaker in the test video news, Jim Clancy, reduced from 482 to 11), and increased the quality of the datasets. However, there are still wrong clips, simply because the speakers are mislabeled in the tpt file. For instance, by just randomly sampling from the clips, I found that the sentence starting at time 00:04:00 of the test news video 2006-10-02_1600_US_CNN_Your_World_Today_Rep.mp4 was labled to be spoken by the anchor Jim Clancy (see the screen shot of aligned transcript below),
but the pictures of the video clearly indicates that the speaker is Martin Geissler (see the video screenshot).

Such mislabeled training data may mislead the recognition models and result in bad performance during the testing time. Potential solutions to this problem have been studied in machine learning, for example, outlier detection and training with noisy labels. Later in this summer project, I will try these methods to clean up the training dataset.

The following command gives an example of running the script:


module load ffmpeg

python spk_extract.py -o ./output/ -i ~/data/ -f 2006-10-02_1600_US_CNN_Your_World_Today -s output/spk/


-o output directory
-i input directory
-s directory of speaker turn file
-f filename 

Sunday, June 18, 2017

Week 2: Speaker Turns and Sentence Boundaries

After a discussion over Skype, Peter and I decided to stay with the original Gentle output format, which consists of two components: input transcript and a list of words. Each word in the list has attributes "startOffset" and "endOffset" that refer to its position in the original transcript. We favor this format mainly for two reasons: 1. it already contains all the information we need in a concise and non-redundant form, which makes it a good compromise among many different applications: any specific application can further process it to fit its own need. 2. No changes will be made to Gentle code, hence merging issues can be avoided even when Gentle upgrade its implementation in the future.

As a result of this decision, I wrote a python script align2spk.py to convert Gentle output to a data format suitable for the speaker recognition task, where the speaker turns and sentence boundaries are marked. Later we would need the speaker turns to extract speech signals from audios and collect them as speaker recognition training datasets. Sentence boundaries will be used later for segmenting audios during testing time.

Speaker turns are originally marked in the transcript by ">>", this is because the input transcript to Gentle is the stripped tpt file that contains speaker turns denoted by ">>", as introduced in the last post.

For sentence boundaries, I used the python Natural Language Toolkit (nltk) to split the transcript component into sentences. Splitting a document into sentences is not entirely a trivial task, because in English, sentence ending punctuations such "." can be used for other reasons such as abbreviations (e.g. "U.S.", "Dr."), so the tokenizer from nltk is in fact a pattern recognizer trained from a large set of corpora (see the official documentation for details).

Both the speaker turns and sentence boundaries have attributes "start" and "end" to indicate their occurring time in the audio file. The start time of a speaker turn or a sentence boundary is the end time of the successfully aligned word (with "case"= "success") before it, and its end time is the start time of the next successfully aligned word. If the immediately neighboring words are not successfully aligned (e.g. "case" = "not-found-in-audio"), then the timestamps of the closest aligned words will be used. In this case, the turn or boundary will be marked with an attribute "case"= "cautious", indicating that the time stamps for them may not be very reliable, otherwise they have the attribute "case"="fine".

An example use case for this script is given below:
python align2spk.py -o lucier.align.spk lucier.align.jsonl

Wednesday, June 7, 2017

Week 1: Combining Speaker Information and Forced Alignment

The tpt files on cartago contain, among many other meta information, the transcript and corresponding speakers for the news audios. The task of this week is to combine speaker info from tpt and the time info given by Gentle alignment.

Gentle expects only the spoken words for the transcript, any other information will be noise to it, hence may degrade the quality of the alignment results.  Therefore, in order to run gentle, the first step is to strip off redundant information from the tpt file to get clean transcript. The program for this, written by Prof. Steen, is called strip-tpt, placed on my cartago homepage.

To finally get "who speaks when", we then need to add the speaker info from tpt to the output returned by Gentle. The challenge is to find the right place in the aligned transcript to insert corresponding speakers. The solution I came up with is to leave a special notation at the place where a speaker occurs in the tpt file during the stripping process, and to use this special notation as a reference point later to insert the corresponding speaker info. In order to NOT let this extra content in the transcript worsen the performance of gentle, the special notation has to be ignored during alignment but kept in the result. After a set of test runs, I found that double chevrons ">>" used in tpt files to mark speakers would actually do the job.

Therefore I modified strip-tpt such that instead of removing the speaker info from tpt, it replaces them with ">>". In addition, I added some commands to also strip "voice over" info that was overlooked previously. The modified file is called chevron-tpt, also on my cartago. It has been run on all the tpt files, and the processed results (denoted by extension .chevron.tpt) are saved in home/owen_he/netapp/chevron/.

To get the stripped results on Case HPC, I wrote a script get-chevron, which fetches the original tpt, chevron.tpt and video files from Cartago and place them in the folder ~/data/.  Furthermore a command is added in ~/GSoC2017/gentle.sh to extract speaker info from tpt before it starts Gentle alignment. Finally, I hacked the main script of gentle alignment to also write the speaker info in the alignment output.



List of files:

Cartago:~/chevron-tpt:

replace the speaker info with >> as reference points for inserting speaker information back later,(voice over) was not stripped off in the previous version, now it’s replaced with ||


CaseHPC:~/data/get-chevron
get chevron.tpt files from cartago


CaseHPC:~/GSoC2017/gentle.sh
bash script to run gentle alignment for acquiring speaker boundaries and extract speaker info to speaker.list, usage example:
./gentle.sh [filename] [txt_ext] [out_dir] [out_ext] [audio_dir] [txt_dir]

CaseHPC:~/Gentle/gentle/gen_spk.py 
Aligning transcript with audio signal, and insert the speaker info when output the alignment result.