RedHenSpeakRecog: 2017

Sunday, August 27, 2017

Week 12: Final Report

We are now approaching the end of GSoC 2017. In this post, I will summarize my work during this summer. This final report is meant to be concise yet comprehensive, for verbose, detailed explanation, check out my previous posts for this project.

The entire project consists of the following parts:

Training Data Collection from CNN News Videos
Outlier Detection within Training Set
Building and Testing Speaker Recognition System
New Speaker Recognition Module

All source code can be found in this RedHen github.

Training Data Collection from CNN News Videos

The main script for this step is collect-train.sh, however it only extracts audio clips from one video file. To process a list of videos, use process-list.sh, and to submit it as a job on Case HPC, use data-prep.slurm. Collected audio samples will be placed in output/audio/(speakername)/, where a file called clip_list.txt that contains info (stating time, duration, title, etc.) about these samples is also included. Detailed descriptions and usage cases for these scripts and other relevant files are provided below.

Description: the main script for collecting training data, given one file name, it extracts all audio signals with speaker tags from the corresponding news video. To start

Usage:

./collect-train.sh 2006-09-29_0100_US_CNN_Larry_King_Live.tpt

Description: run the main script collect-train.sh on a list of tpt files, an example file is given in the usage case.
Usage:
./process-list.sh tmp-list

Description: submit process-list.sh as a job to Case HPC SLURM
Usage:
sbatch data-prep.slurm

Outlier Detection within Training Set

The training data collected through the previous step might contain mislabeled samples, to detect these outliers, we train a probabilistic models for each speaker, then score every audio clip by this model. Samples with low log likelihood can be considered as outliers. The computed likelihood will be recorded in the clip_list.txt file, outliers won't be deleted but will be skipped in later process. One can set a threshold to specify which range of likelihood is considered acceptable.

Description: train a speaker model from a set of training samples for a speaker, then score them and save the likelihood info to clip_list.txt, where -i specifies the the folder where speaker folders are placed, and -s specifies the folder name for the speaker. It will also send statistics of each speaker's training set to a file stats.json file, where statistics of the entire training set are included.
Usage:
python detect_outlier.py -i ./output/audio/ -s Jim_Clancy

Description: loop through every speaker folder, and run outlier detection there. DATADIR variable specifies where speaker folders are placed.
Usage:
./clean-train.sh

Building and Testing Speaker Recognition System

With the data collected from previous steps, one can move on to build a speaker recognition system and test its performance. In the future when new methods are developed, one might build a different system, here we show an example using the current system for how one could go about testing a recognition system. Before training and testing, it is important to select, based on the statistics, which speakers in the training sets are qualified for enrollment, since not all speakers have the same amount of data available.

Description: a python script to select qualified speakers based on the statistics file, the example here chooses speakers with more than 1 minute training signal. -s gives the stats file name, and -o tells where to save the output. Exemplar output is given below.
Usage:
python select-speaker.py -s stats.json -o

Description: a python script to build a recognizer from a list of speakers, trained models for different speakers are stored separately in corresponding speaker folders. -d points to the training data directory and -s specifies a file that contains a list of speakers.
Usage:
python build_recognizer.py -d ./output/audio -s

Description: a python script to test the trained recognizer, it takes a list of speakers and reconstruct a recognizer by loading the individual models for these speakers. -d gives the testing data directory, -s specifies a file that contains a list of potential speakers and -m points to the directory where models for speakers are stored
Usage:

python test_recognizer.py -d $TEST -s $spk_test -m $TRAIN

Description: it is just a script that runs the whole process of training and testing
Usage:

./train_test.sh

New Speaker Recognition Module

I also made many changes to the original speaker recognition python module so that it is suitable for large scale news datasets and tailored for short audio clips. These changes including UBM training, more powerful features offered by librosa, compositional construction method for recognizers, larger model capacity and so on. The following are module files that are modified:

Description: a python module file for creating a set of GMM models using scikit-learn, the default number of GMM components is 64.

Description: a python module file for speaker recognition based on GMM.

Usage: Instantiate a recognizer, enroll a speaker and train the model:

import AudioPipe.speaker.recognition as SR
Recognizer = SR.GMMRec()

Recognizer.enroll_file(spk_name, audio_file, model=model_fn)

Recognizer.train()

Predict the speaker of an audio file:

id, llhd = Recognizer.predict(Recognizer.get_mfcc(audio_test))

where id is the identity of the speaker and llhd is the log likelihood

Description: a python script for selecting dataset to train a Universal Background Model (UBM), the training can be done in the same way as training a GMM for one speaker, except that it is usually recommended to use many more components for UBM. -s is the stats file, -o is the output, a list of audio clips for training and -d specifies where these clips are saved.

Usage:

python ubm_data.py -s stats.json -o ubm_data.json -d ./output/audio

Auxiliary Files

The following files are either used as intermediate functions by the files above, or offer additional functionalities that might come handy when manipulate data.

Description: get video and tpt files from Cartago
Usage:
./get.sh 2006-09-29_0100_US_CNN_Larry_King_Live.tpt

Description: strip off redundant info from the tpt file and place the output in a folder specified by the second argument.
Usage:
./strip-tpt.sh $FIL.tpt $OUTDIR

Description: run Gentle alignment and the speaker information is preserved during this process. The first argument specifies the file name (without extension), whereas the second one specifies the extension of the stripped tpt file.
Usage:
./gentle.sh $FIL chevron.tpt

Description: insert speaker info back to the alignment results, -o specifies the output file, -s specifies the list of speakers, the last argument is the alignment result file.
Usage:
python align2spk.py -o $FIL.align.spk -s $FIL.speaker.list $FIL.align.json

Description: extract audio clips corresponding to these speakers, -i specifies the input directory, where the video file locates; -o specifies the output directory, where the extracted audio clips should be placed; -f is the file name and -s is the output given by align2spk.py
Usage:
python spk_extract.py -o $OUTDIR/ -i $INDIR/ -f $FIL -s $OUTDIR/spk/

Description: put the resulting files on my Cartago, the first argument is the file and the second is a folder on my Cartago account
Usage:
./put.sh ./output/align/${FIL}.align.jsonl align

Description: remove duplicated entries in clip_list.txt files, -s specifies the name of statistics file, and -d points to the data directory
Usage:
python remove_duplicate.py -s stats.json -d ./output/audio

Monday, August 21, 2017

Week 11: A Compositional Speaker Recognition Module

It is usually observed that when the number of speakers enrolled in a system increase, the recognition accuracy will decrease. Therefore, although we aim to build a large-scale speaker recognition system, we can further improve the performance of our system if we know beforehand what speakers may appear in the video file, and narrow the range of potential speakers for the system to decide. This is of course not possible in general. However, thanks to the tpt files archived at RedHen, we can acquire a list of speakers that appear in CNN news video by looking at which speakers are tagged in the tpt file. Hence, this week I decided to take advantage of this additional information to optimize the current speaker recognition system. In order to do that, our system must be able to flexibly change the number of enrolled speakers. To this end, I made the following change to the speaker recognition Python module:

Instead saving all enrolled speakers into one model, it now saves each speaker GMM model individually.
During testing time, a list of relevant speakers will be used to build a recognizer with only these speakers.
New functions for adding and deleting speakers are introduced.
One can now enroll speaker by either features or saved models.

Below are examples for how to use the updated API:

To instantiate a recognizer:

import AudioPipe.speaker.recognition as SR
Recognizer = SR.GMMRec()

To enroll a speaker:
Recognizer.enroll_file(spk_name, audio_file, model=model_fn)

, where model_fn is a name of the file to save the trained model

After enrolled enough speakers, run the following command to train and save all models:
Recognizer.train()

Later during testing time, use the following code to load a recognizer with a list of speakers:

for spk_name in spk_ls:

model_fn=get_model_file(spk_name)
Recognizer.enroll_model(spk_name, model_fn)

Here get_model_file() is a pseudo function used to represent the procedure to get the path to the model file for the corresponding speaker.

Wednesday, August 16, 2017

Week 10: Better results with GMM-UBM

After trained an i-vector extractor, I tested it on the testing dataset. To my surprise, the process took extremely long, a single sample could take up to more than 5 seconds. In addition, its performance on a subset of data was not as good as the GMM-based system. This result does not agree with the common belief that I-vector should outperform GMM methods, so I once suspected that there might have been some implementation errors with either the bob.kaldi package or my training script, and spent quite some time debugging these codes. Although bob.kaldi could be improved to have much better efficiency, none functional errors could be detected as the underlying computation was basically done by Kaldi. I was puzzled for long until I found that many others have reported that for short audios (less than 5 seconds), GMM-UBM can perform better than I-vectors. For example, in [1] the authors did a systematic comparison of these two systems including the effect of duration variability, and mentioned in the Abstract that

"We also observe that if the speakers are enrolled with sufficient amount of training data, GMM-UBM system outperforms i-vector system for very short test utterances."

later in the introduction, they wrote

"Our experimental results reveal that though TV(i-vector) system is performing better than GMM-UBM in many conditions, the classical approach is still better than the state-of-the-art technique for condition very similar to practical requirements i.e. when speakers are enrolled with sufficient amount of speech data and tested with short segments."

Moreover in earlier literature, there were also evidences showing that GMM-UBM systems perform well for short test segments[2] [3] .

These explain the surprising results I saw, because the testing data we are dealing with are all audio clips of single sentences extracted from CNN news, which typically have durations less than 5 seconds. For this reason, I decided to choose GMM-UBM system over the i-vector system.

To further improve the GMM-UBM system, I spent considerable amount of time tuning the parameters, and made the following change to it:

The dimension of mfcc is now extended to 19
Delta and double delta coefficients are appended to form higher dimensional feature vectors.
More GMM components (64) are used to model a speaker.
UBM is used to check the confidence of the classification results
Instead of a simple sum of GMM posterior, a weighted sum is used as the score

These improvements turned out to be effective, now the testing result given by this upgraded system is

"2324 out of 2656 clips are correctly recognized"

which is 87.5%.

Monday, August 7, 2017

Week 9: Building an i-vector extractor with bob.kaldi

In the last decade, Gaussian mixture model based on universal background model (GMM-UBM) framework has demonstrated strong performance in speaker verification.[1] It is commonly believed that the mean vectors of GMMs represent the most characteristics of speakers. Extended from GMM-UBM framework, factor analysis (FA) technique [2, 3] attempts to model the speaker components jointly. Each speaker is represented by the mean supervector which is a linear combination of the set of eigenvoices. Based on FA technique, joint factor analysis (JFA) [4, 5] decomposes GMM supervector into speaker component S and channel component C. Inspired by JFA approach, Dehak et al. [7] propose a combination of speaker space and channel space. A new low-dimensional space named total factor space is defined. In this new space, each utterance is represented by a low-dimensional feature vector termed i-vector. The idea of i-vector opens a new era to the analysis of speaker and session variability.

Over the recent years, i-vector approach has emerged to be the state-of-the-art for speaker recognition task. Many popular speech recognition libraries such as Kaldi offer api for training i-vector extractors, see here for a comprehensive overview of available open source tools. To avoid reinventing the wheel, I decided to use one of the existing libraries. Although I admire the quality of Kaldi, I'd prefer a pythonic API to work with since most of the RedHen libraries are wrapped in python. Luckily, I found bob.kaldi, which is a bob package that seamlessly integrate Kaldi functionality with Python-based workflows.

To install bob.kaldi, one needed to first install mini conda from here, then follow the bob installation instruction here. Finally bob.kaldi can be installed:

conda install bob.kaldi

To activate the virtual environment for this package on my Case HPC account, run the following command:

source activate bob_py3

I have written a python script build_model.py to train an i-vector extractor with functions provided in bob.kadi, a usage case looks like the following:


python build_model.py -d ubm_data.json

where ubm_data.json is a json file that contains a list of training samples for UBM, one can produce such a list from the python script mentioned in the article from the last week. We need this file because training i-vector extractor requires a pre-trained UBM model.

Note that the function bob.kaldi.ivector_train() can accept features for multiple utterances as a 3D array, I found that weird, since two utterances may have different durations, therefore different dimensions. Padding them to the same length does seem to be an elegant solution to me, so I simply changed if feats.ndim == 3: to if feats is list: in the source code, now one can put features in a list.

Saturday, July 29, 2017

Week 8: Upgrading Recognition System with Universal Background Models

A Universal Background Model (UBM) is a model used in a biometric verification system to represent general, person independent feature characteristics to be compared against a model of person-specific feature characteristics when making an accept or reject decision. For example, in a speaker verification system, the UBM is a speaker-independent Gaussian Mixture Model (GMM) trained with speech samples from a large set of speakers to represent general speech characteristics. Using a speaker-specific GMM trained with speech samples from a particular enrolled speaker, a likelihood-ratio test for an unknown speech sample can be formed between the match score of the speaker-specific model and the UBM. The UBM may also be used when training the speaker-specific model by acting as a the prior model in MAP parameter estimation. More about UBM can be found here.

State-of-the-art Gaussian mixture model (GMM)-based speaker recognition/verification systems utilize UBM, and the currently very popular total variability (i-vector) approach needs a trained UBM as a prerequisite. Hence, to improve our current speaker recognition system, we will also equip it with a UBM.

As we already have a GMM-based speaker recognition system, all we have to do is to collect a set of speech samples from distinct speaker that can cover a large varieties of speech characteristics. One can of course take the entire dataset to train the model, but since some of the speakers have much more speech samples than others, they could easily dominate the resulting UBM. Hence I limited the number of speech samples per speaker to include in the UBM training data. For this, I wrote a python script called ubm_data.py, which has the following usage case:

python ubm_data.py -s stats.json -o ubm_data.json -d ./output/audio

The output file ubm_data.json is a list of files in ./output/audio directory that will be used to train the UBM. The training can be done with the existing speaker recognition module. However, since UBM usually consist of much more components than a GMM of an individual speakers, we would need an additional handle in the speaker recognition api to specify larger number of Gaussian components, for this I added an additional parameter gmm_order in the train() function, which can be called as follows:

Recognizer = SR.GMMRec()

Recognizer.enroll_file(spk_name, audio_file)

Recognizer.train(gmm_order=256)

Saturday, July 22, 2017

Week 6-7: Test Current Speaker Recognition System

After collecting labeled audio clips from 3 months (09-11.2006) of news videos, I had enough to test the existing speaker recognition system, and I used the data from Sept. and Nov. 2016 for training and those from Oct.2016 for testing.

Since these datasets might include mislabeled audio clips, I first ran the outlier detection algorithm to clean up the datasets. To this end, I wrote a script clean-train.sh that goes through every speaker in the dataset and marks each clip with it's log likelihood and z-scores.

Furthermore, not all speakers appear in the datasets should be enrolled in our speaker recognition system for various reasons (not enough training data, speaker name unidentifiable, e.g. 'ANNOUNCER', 'ME'), hence we need information to help us decide which speakers to select. For this purpose, I computed speaker-related statistics of each dataset and stored them in a json file called stats.json, which would be used later to decide what speakers to enroll into the recognition system. The information collected in stats.json include, for each speaker, the number of clips, the total audio duration, the number of non-outlier clips, the total duration of non-outlier audio clips.

Here is a list of top 100 speakers in the testing dataset, sorted by the total duration (in seconds) of non-outlier clips that each speaker has:

[('Bronwyn_Adcock', 59.4),
 ('Ross_Perot', 59.510000000000005),
 ('Rick_Sanchez', 59.98000000000001),
 ('Gary_Tuchman', 60.69),
 ('Jonathan_Freed', 61.19),
 ('Rep._Ray_Lahood', 62.84),
 ('Bay_Buchanan', 62.99999999999999),
 ('Mayor_Keith_Weatherly', 64.61),
 ('Bill_Tucker', 64.89),
 ('Rosemary_Church', 65.67),
 ('Cal_Perry', 66.53),
 ('Dennis_Hastert', 69.28),
 ('Paula_Newton', 70.46000000000001),
 ('KOCH', 70.64999999999999),
 ('Paul_Weyrich', 71.0),
 ('Dan_Simon', 71.96),
 ('J.C._Watts', 72.39),
 ('Donald_Rumsfeld', 72.88000000000001),
 ('David_Roth', 76.03999999999999),
 ('Aaron_Meyer', 76.07),
 ('Kitty_Pilgrim', 76.89999999999999),
 ('ONAR', 78.99),
 ('Ed_Henry', 79.67999999999999),
 ('Lewis_Black', 80.04),
 ('Doro_Bush_Koch', 80.05),
 ('Melanie_Sloan', 81.61999999999999),
 ('Andy_Serwer', 84.96),
 ('Amy_Walter', 85.35),
 ('Michael_Ware', 86.83),
 ('Howard_Kurtz', 91.64000000000001),
 ('Rusty_Dornin', 93.59),
 ('Bill_Maher', 94.15),
 ('Susan_Candiotti', 97.88000000000001),
 ('Gerri_Willis', 99.65),
 ('David_Albright', 107.69999999999999),
 ('Stuart_Rothenberg', 108.06),
 ('Sen._John_Warner', 110.46000000000001),
 ('NEWTON-JOHN', 115.9),
 ('David_Gergen', 116.35000000000001),
 ('John_Zarrella', 120.47),
 ('Jason_Carroll', 123.95),
 ('Drew_Griffin', 125.84),
 ('ANNOUNCER', 128.05),
 ('Delia_Gallagher', 131.59),
 ('Arwa_Damon', 138.01999999999998),
 ('Keith_Oppenheim', 143.53000000000003),
 ('John_Bolton', 145.67000000000002),
 ('Comm._Jeffrey_Miller', 155.91999999999996),
 ('ME', 156.79),
 ('Dr._Sanjay_Gupta', 157.49000000000004),
 ('Condoleezza_Rice', 163.28999999999996),
 ('Stephen_Jones', 163.49999999999994),
 ('Joe_Johns', 165.69),
 ('Michael_Holmes', 169.88000000000005),
 ('Richard_Roth', 170.37000000000003),
 ('Jeff_Koinange', 173.48000000000002),
 ('Dan_Rivers', 174.62),
 ('Rep._Dennis_Hastert', 174.91999999999996),
 ('Betty_Nguyen', 178.35000000000002),
 ('Carol_Lin', 181.16),
 ('Kelli_Arena', 184.89999999999998),
 ('Randi_Kaye', 191.82999999999998),
 ('BUSH', 201.27999999999997),
 ('Commissioner_Jeffrey_Miller', 209.14),
 ('Jack_Cafferty', 209.60999999999999),
 ('William_Schneider', 221.1000000000001),
 ('Kathleen_Koch', 226.2),
 ('Jeanne_Moos', 234.41999999999993),
 ('Candy_Crowley', 237.73000000000002),
 ('Bob_Woodward', 247.91999999999993),
 ('Mary_Snow', 248.77),
 ('PHILLIPS', 248.85),
 ('Suzanne_Malveaux', 256.79999999999995),
 ('Zain_Verjee', 267.5100000000001),
 ('AIKEN', 267.82),
 ('Tony_Snow', 275.78000000000003),
 ('Fredricka_Whitfield', 286.03999999999996),
 ('Ralitsa_Vassileva', 290.8500000000001),
 ('Jim_Clancy', 295.96000000000004),
 ('Allan_Chernoff', 329.81000000000006),
 ('UNIDENTIFIED_FEMALE', 331.53999999999996),
 ('QUESTION', 335.44000000000005),
 ('Barbara_Starr', 366.53000000000003),
 ('George_W._Bush', 438.09999999999997),
 ('Lou_Dobbs', 520.3500000000001),
 ('Brian_Todd', 583.4300000000001),
 ('Heidi_Collins', 595.0599999999998),
 ('Andrea_Koppel', 653.7400000000001),
 ('UNIDENTIFIED_MALE', 654.7500000000001),
 ('Paula_Zahn', 681.0300000000003),
 ('Jamie_Mcintyre', 700.4899999999999),
 ('John_King', 716.2900000000002),
 ('Dana_Bash', 803.4400000000003),
 ('Tony_Harris', 816.8199999999998),
 ('John_Roberts', 842.2500000000007),
 ('Larry_King', 999.8599999999997),
 ('Kyra_Phillips', 1109.6700000000005),
 ('Anderson_Cooper', 1260.3100000000013),
 ('Don_Lemon', 1355.3300000000004),
 ('Wolf_Blitzer', 1627.37)]

Based on the statistics computed in the last step, I could specify the criteria for selecting speakers to enroll. select-speaker.py implements this function: it takes the stats.json as input and produces a list of speakers that meet the specified criteria. For our testing this time, the criteria are the following:
the total duration of non-outlier clips should be at least 1 minute and the number of non-outlier clips should be more than 10. Usage example of select-speaker.py:

python select-speaker.py -s stats.json -o enrollment_list.json

I applied the above-mentioned criteria to select speakers from the training and testing datasets respectively, and the overlap between the resulting two lists of qualified speakers include the following 57 speakers:

['Michael_Holmes',
 'Amy_Walter',
 'Andy_Serwer',
 'Jamie_Mcintyre',
 'John_Roberts',
 'Jeanne_Moos',
 'Dana_Bash',
 'Heidi_Collins',
 'Howard_Kurtz',
 'Mary_Snow',
 'Tony_Snow',
 'Arwa_Damon',
 'Donald_Rumsfeld',
 'Delia_Gallagher',
 'Richard_Roth',
 'Susan_Candiotti',
 'Allan_Chernoff',
 'Bay_Buchanan',
 'Jim_Clancy',
 'Kathleen_Koch',
 'William_Schneider',
 'Michael_Ware',
 'Rusty_Dornin',
 'Jason_Carroll',
 'Joe_Johns',
 'Gerri_Willis',
 'George_W._Bush',
 'Barbara_Starr',
 'Larry_King',
 'Drew_Griffin',
 'Randi_Kaye',
 'Kyra_Phillips',
 'Lou_Dobbs',
 'Gary_Tuchman',
 'Andrea_Koppel',
 'Dr._Sanjay_Gupta',
 'David_Gergen',
 'Zain_Verjee',
 'Anderson_Cooper',
 'Don_Lemon',
 'Jack_Cafferty',
 'Tony_Harris',
 'Ralitsa_Vassileva',
 'Suzanne_Malveaux',
 'Dan_Simon',
 'Keith_Oppenheim',
 'Betty_Nguyen',
 'Wolf_Blitzer',
 'Brian_Todd',
 'John_King',
 'Fredricka_Whitfield',
 'John_Zarrella',
 'John_Bolton',
 'Candy_Crowley',
 'Paula_Zahn',
 'Kelli_Arena',
 'Carol_Lin']

These speakers were used to train and test the speaker recognizer. For training, I wrote a python script build_recognizer.py. Given a list of speakers, this python script loops through the list, trains a model for every speaker and adds it to the system, outlier audio clips are excluded from the training.

Finally, the trained recognizer was tested on the audio clips from these speakers in the testing dataset, and this process was implemented in test_recognizer.py, which writes the testing result as a list of (clip name, predicted name) pairs into test_results.json and prints out the total accuracy:

"2139 out of 2656 clips are correctly recognized!"

which is approximately 80.53%.

Below are the usage examples of build_recognizer.py and test_recognizer.py:

python build_recognizer.py -d $TRAIN -s $spk_train -o $OUTDIR

python test_recognizer.py -d $TEST -s $spk_test -m $model

Saturday, July 8, 2017

Week 5: Main Script for Training Data Preparation

Now all the steps required for extracting training clips from a video file have been finished. It is time to actually run these processes on large number of video files from cartago to prepare training data for our speaker recognition system. To this end, I assembled all the solutions so far into one main script.

Given a news file name, the main script implements the following process:

1. Get the news video and tpt files from cartago. (script: get.sh )

2. Strip off redundant information from the tpt file and mark speaker turns by ">>" to get the transcript. (script: strip-tpt.sh, output: .chevron.tpt)

3. Feed the video and transcript to Gentle to produce the alignment file. (script: gentle.sh, output: .align.jsonl)

4. Extract speaker occurrences from the tpt file and save them into a speaker list file. (script: command line, output: .speaker.list)

5. Convert alignment file to alignment file with speaker turns and sentence boundaries. (script: align2spk.py, output: .align.spk)

6. Based on the speaker turns and sentence boundaries, extract an audio clip from the video for every occurrence of a speaker. (script: spk_extract.py, output: .wav)

7. Remove the input files, including .mp4, .tpt, .chevron.tpt, .speaker.list

8. Send some results back to cartago. (script: put.sh, including .align.spk, .align.jsonl)

Then I applied this process on a list of news files (so far only those from Sept. Oct. Nov. 2006 have been processed), for this I also wrote a SLURM batch file for submitting this job to Case HPC. Later I learned that it's possible to parallelize this job via SLURM job arrays, this method will be adopted to process other news files in the future.

Not surprisingly, the main script ran into many problems while processing these news files, since many assumptions made when I developed previous components were based on a few sample files I chose randomly from the news archive, but these assumptions are not true for all news files. Thus, this data collecting procedure also played an important role for debugging my code. Here I list all the bugs detected during the process, will be updated whenever a new one is encountered:

1. non utf-8 characters appear in some tpt files, this made gentle crash. Fix: I then added a command line to remove non utf-8 characters.

2. ") :" was used as a landmark for detecting speaker turns, an exception occurred in one file. Fix: Now switch to "NER01|Person" for landmarks

3. "#" was used to denote sentence boundary, but in one tpt "#" occurred also in transcript. Fix: replace "#" by "$" in transcript

4. "V/O" occurred as speaker name, could not create directory with this name due to "/". Fix: replace any "/" in speaker name to "_".

5. speaker names may contain quotes: " and ', which causes trouble later during outlier detection. Fix: TODO.

Thursday, June 29, 2017

Week4: Detecting Outliers in Training Data

In the post from the last week, I mentioned that there can be mislabeled audio clips in the training dataset collected for each speaker. These wrong data will degrade the quality of the model trained on them, and eventually lead to bad performance on speaker recognition tasks. Hence I designed an outlier detection method tailored for audio data to automatically filter out these mislabeled clips.

This method is based on the Gaussian Mixture Models (GMMs). For every speaker, we first train a GMM on all audio clips collected in the training dataset using the expectation-maximization (EM) algorithm. Then for each audio clip, we estimate its generative probability using the trained model. Audio clips that fit the distribution of the true speaker will have a high generative probability, and anomalies will have very low fit probabilities. Hence we can set a threshold and filter out every clip that has a fit score lower than it. For details of outlier detection by probabilistic mixture modeling, please refer to Chaper 2.4 of the book Outlier Analysis by Charu Aggarwal, one of the most cited researchers in the field of outlier and anomaly detection.

I tested this method on the data collected last week from the sample video. The following list shows the result on audio clips for Jim Clancy, the generative probability for each clip is given in the attribute "llhd", which stands for the log likelihood.
[
{
"duration": "0:00:01.650",
"llhd": -20.387030349382481,
"name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy0.wav",
"start": "0:00:15.240"
},
{
"duration": "0:00:08.000",
"llhd": -18.196139725170504,
"name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy1.wav",
"start": "0:00:32.960"
},
{
"duration": "0:00:00.910",
"llhd": -17.888030707481747,
"name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy2.wav",
"start": "0:00:47.460"
},
{
"duration": "0:00:05.940",
"llhd": -18.082631203617577,
"name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy3.wav",
"start": "0:01:25.690"
},
{
"duration": "0:00:03.960",
"llhd": -18.352468451630649,
"name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy4.wav",
"start": "0:01:39.290"
},
{
"duration": "0:00:06.260",
"llhd": -17.712504094912944,
"name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy5.wav",
"start": "0:02:14.740"
},
{
"duration": "0:00:05.140",
"llhd": -19.767810192124848,
"name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy6.wav",
"start": "0:04:00.360"
},
{
"duration": "0:00:01.520",
"llhd": -17.829715306752892,
"name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy7.wav",
"start": "0:05:40.970"
},
{
"duration": "0:00:06.320",
"llhd": -18.639242622299303,
"name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy8.wav",
"start": "0:09:54.240"
},
{
"duration": "0:00:07.320",
"llhd": -17.714960218582345,
"name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy9.wav",
"start": "0:10:08.030"
},
{
"duration": "0:00:04.290",
"llhd": -17.906594198834085,
"name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy10.wav",
"start": "0:12:34.670"
},
{
"duration": "0:00:03.420",
"llhd": -17.690216443470053,
"name": "2006-10-02_1600_US_CNN_Your_World_Today_Jim_Clancy11.wav",
"start": "0:15:11.680"
}
]

Note that this method successfully identified the mislabeled clip (number 6) mentioned in my last article, which scores the second lowest log likelihood (-19.767810192124848). The speech in the first clip although really is from Jim Clancy, one can notice that it's mixed with loud background music, hence it's not so surprising that the algorithm also identified it as an outlier. Running this method on the datasets of other speakers also confirmed that it can not only filter out wrong speakers, but also noisy, low quality. One may argue that training with noisy data can increase the robustness of recognition models. This however can be compensated later by introduce artificial noise or extracting speech signals using blind source separation techniques.

The script for this process is detect_outlier.py, and a usage example is here given:

python detect_outlier.py -i ./output/audio/ -s Jim_Clancy

the argument after -i is the audio directory, where the training audio clips for every speaker are placed, and -s specifies the name of the speaker, which should also be the name of a subdirectory in the audio directory.

Wednesday, June 21, 2017

Week3: Collect Training Data for Speaker Recognition

After we have the speaker turns and sentence boundaries, the next step is to extract audio clips for every speaker and collect them as training datasets for the recognition models. To do this, I wrote a python script spk_extract.py, which loops through the list of speakers in the transcript, and whenever it encounters a new speaker, a directory for this speaker will be created and the audio clips of this speaker will be placed inside, along with a text file contains the list of clips and their timestamps.

At first, I clipped every sentence between two speaker turns, and this resulted in a huge number of clips in the datasets. But the problem is that a lot of irrelevant speeches are included, because not all audio signals between two speaker turns are speeches from the same speaker. For example, there can be commercials between two speaker turns. Therefore, to make the data collections more reliable, I then changed the code to extract only the first sentence after a speaker turn. This significantly reduced the amount of clips collected (e.g the number of clips for the most frequent speaker in the test video news, Jim Clancy, reduced from 482 to 11), and increased the quality of the datasets. However, there are still wrong clips, simply because the speakers are mislabeled in the tpt file. For instance, by just randomly sampling from the clips, I found that the sentence starting at time 00:04:00 of the test news video 2006-10-02_1600_US_CNN_Your_World_Today_Rep.mp4 was labled to be spoken by the anchor Jim Clancy (see the screen shot of aligned transcript below),

but the pictures of the video clearly indicates that the speaker is Martin Geissler (see the video screenshot).

Such mislabeled training data may mislead the recognition models and result in bad performance during the testing time. Potential solutions to this problem have been studied in machine learning, for example, outlier detection and training with noisy labels. Later in this summer project, I will try these methods to clean up the training dataset.

The following command gives an example of running the script:

module load ffmpeg

python spk_extract.py -o ./output/ -i ~/data/ -f 2006-10-02_1600_US_CNN_Your_World_Today -s output/spk/

-o output directory

-i input directory

-s directory of speaker turn file

-f filename

Sunday, June 18, 2017

Week 2: Speaker Turns and Sentence Boundaries

After a discussion over Skype, Peter and I decided to stay with the original Gentle output format, which consists of two components: input transcript and a list of words. Each word in the list has attributes "startOffset" and "endOffset" that refer to its position in the original transcript. We favor this format mainly for two reasons: 1. it already contains all the information we need in a concise and non-redundant form, which makes it a good compromise among many different applications: any specific application can further process it to fit its own need. 2. No changes will be made to Gentle code, hence merging issues can be avoided even when Gentle upgrade its implementation in the future.

As a result of this decision, I wrote a python script align2spk.py to convert Gentle output to a data format suitable for the speaker recognition task, where the speaker turns and sentence boundaries are marked. Later we would need the speaker turns to extract speech signals from audios and collect them as speaker recognition training datasets. Sentence boundaries will be used later for segmenting audios during testing time.

Speaker turns are originally marked in the transcript by ">>", this is because the input transcript to Gentle is the stripped tpt file that contains speaker turns denoted by ">>", as introduced in the last post.

For sentence boundaries, I used the python Natural Language Toolkit (nltk) to split the transcript component into sentences. Splitting a document into sentences is not entirely a trivial task, because in English, sentence ending punctuations such "." can be used for other reasons such as abbreviations (e.g. "U.S.", "Dr."), so the tokenizer from nltk is in fact a pattern recognizer trained from a large set of corpora (see the official documentation for details).

Both the speaker turns and sentence boundaries have attributes "start" and "end" to indicate their occurring time in the audio file. The start time of a speaker turn or a sentence boundary is the end time of the successfully aligned word (with "case"= "success") before it, and its end time is the start time of the next successfully aligned word. If the immediately neighboring words are not successfully aligned (e.g. "case" = "not-found-in-audio"), then the timestamps of the closest aligned words will be used. In this case, the turn or boundary will be marked with an attribute "case"= "cautious", indicating that the time stamps for them may not be very reliable, otherwise they have the attribute "case"="fine".

An example use case for this script is given below:
python align2spk.py -o lucier.align.spk lucier.align.jsonl

Wednesday, June 7, 2017

Week 1: Combining Speaker Information and Forced Alignment

The tpt files on cartago contain, among many other meta information, the transcript and corresponding speakers for the news audios. The task of this week is to combine speaker info from tpt and the time info given by Gentle alignment.

Gentle expects only the spoken words for the transcript, any other information will be noise to it, hence may degrade the quality of the alignment results. Therefore, in order to run gentle, the first step is to strip off redundant information from the tpt file to get clean transcript. The program for this, written by Prof. Steen, is called strip-tpt, placed on my cartago homepage.

To finally get "who speaks when", we then need to add the speaker info from tpt to the output returned by Gentle. The challenge is to find the right place in the aligned transcript to insert corresponding speakers. The solution I came up with is to leave a special notation at the place where a speaker occurs in the tpt file during the stripping process, and to use this special notation as a reference point later to insert the corresponding speaker info. In order to NOT let this extra content in the transcript worsen the performance of gentle, the special notation has to be ignored during alignment but kept in the result. After a set of test runs, I found that double chevrons ">>" used in tpt files to mark speakers would actually do the job.

Therefore I modified strip-tpt such that instead of removing the speaker info from tpt, it replaces them with ">>". In addition, I added some commands to also strip "voice over" info that was overlooked previously. The modified file is called chevron-tpt, also on my cartago. It has been run on all the tpt files, and the processed results (denoted by extension .chevron.tpt) are saved in home/owen_he/netapp/chevron/.

To get the stripped results on Case HPC, I wrote a script get-chevron, which fetches the original tpt, chevron.tpt and video files from Cartago and place them in the folder ~/data/. Furthermore a command is added in ~/GSoC2017/gentle.sh to extract speaker info from tpt before it starts Gentle alignment. Finally, I hacked the main script of gentle alignment to also write the speaker info in the alignment output.

List of files:

Cartago:~/chevron-tpt:
replace the speaker info with >> as reference points for inserting speaker information back later,(voice over) was not stripped off in the previous version, now it’s replaced with ||

CaseHPC:~/data/get-chevron
get chevron.tpt files from cartago

CaseHPC:~/GSoC2017/gentle.sh
bash script to run gentle alignment for acquiring speaker boundaries and extract speaker info to speaker.list, usage example:

./gentle.sh [filename] [txt_ext] [out_dir] [out_ext] [audio_dir] [txt_dir]

CaseHPC:~/Gentle/gentle/gen_spk.py
Aligning transcript with audio signal, and insert the speaker info when output the alignment result.

Friday, May 5, 2017

GSoC2017

In this blog, I will document my work during the GSoC2017.