RedHenSpeakRecog: July 2017

Saturday, July 29, 2017

Week 8: Upgrading Recognition System with Universal Background Models

A Universal Background Model (UBM) is a model used in a biometric verification system to represent general, person independent feature characteristics to be compared against a model of person-specific feature characteristics when making an accept or reject decision. For example, in a speaker verification system, the UBM is a speaker-independent Gaussian Mixture Model (GMM) trained with speech samples from a large set of speakers to represent general speech characteristics. Using a speaker-specific GMM trained with speech samples from a particular enrolled speaker, a likelihood-ratio test for an unknown speech sample can be formed between the match score of the speaker-specific model and the UBM. The UBM may also be used when training the speaker-specific model by acting as a the prior model in MAP parameter estimation. More about UBM can be found here.

State-of-the-art Gaussian mixture model (GMM)-based speaker recognition/verification systems utilize UBM, and the currently very popular total variability (i-vector) approach needs a trained UBM as a prerequisite. Hence, to improve our current speaker recognition system, we will also equip it with a UBM.

As we already have a GMM-based speaker recognition system, all we have to do is to collect a set of speech samples from distinct speaker that can cover a large varieties of speech characteristics. One can of course take the entire dataset to train the model, but since some of the speakers have much more speech samples than others, they could easily dominate the resulting UBM. Hence I limited the number of speech samples per speaker to include in the UBM training data. For this, I wrote a python script called ubm_data.py, which has the following usage case:

python ubm_data.py -s stats.json -o ubm_data.json -d ./output/audio

The output file ubm_data.json is a list of files in ./output/audio directory that will be used to train the UBM. The training can be done with the existing speaker recognition module. However, since UBM usually consist of much more components than a GMM of an individual speakers, we would need an additional handle in the speaker recognition api to specify larger number of Gaussian components, for this I added an additional parameter gmm_order in the train() function, which can be called as follows:

Recognizer = SR.GMMRec()

Recognizer.enroll_file(spk_name, audio_file)

Recognizer.train(gmm_order=256)

Saturday, July 22, 2017

Week 6-7: Test Current Speaker Recognition System

After collecting labeled audio clips from 3 months (09-11.2006) of news videos, I had enough to test the existing speaker recognition system, and I used the data from Sept. and Nov. 2016 for training and those from Oct.2016 for testing.

Since these datasets might include mislabeled audio clips, I first ran the outlier detection algorithm to clean up the datasets. To this end, I wrote a script clean-train.sh that goes through every speaker in the dataset and marks each clip with it's log likelihood and z-scores.

Furthermore, not all speakers appear in the datasets should be enrolled in our speaker recognition system for various reasons (not enough training data, speaker name unidentifiable, e.g. 'ANNOUNCER', 'ME'), hence we need information to help us decide which speakers to select. For this purpose, I computed speaker-related statistics of each dataset and stored them in a json file called stats.json, which would be used later to decide what speakers to enroll into the recognition system. The information collected in stats.json include, for each speaker, the number of clips, the total audio duration, the number of non-outlier clips, the total duration of non-outlier audio clips.

Here is a list of top 100 speakers in the testing dataset, sorted by the total duration (in seconds) of non-outlier clips that each speaker has:

[('Bronwyn_Adcock', 59.4),
 ('Ross_Perot', 59.510000000000005),
 ('Rick_Sanchez', 59.98000000000001),
 ('Gary_Tuchman', 60.69),
 ('Jonathan_Freed', 61.19),
 ('Rep._Ray_Lahood', 62.84),
 ('Bay_Buchanan', 62.99999999999999),
 ('Mayor_Keith_Weatherly', 64.61),
 ('Bill_Tucker', 64.89),
 ('Rosemary_Church', 65.67),
 ('Cal_Perry', 66.53),
 ('Dennis_Hastert', 69.28),
 ('Paula_Newton', 70.46000000000001),
 ('KOCH', 70.64999999999999),
 ('Paul_Weyrich', 71.0),
 ('Dan_Simon', 71.96),
 ('J.C._Watts', 72.39),
 ('Donald_Rumsfeld', 72.88000000000001),
 ('David_Roth', 76.03999999999999),
 ('Aaron_Meyer', 76.07),
 ('Kitty_Pilgrim', 76.89999999999999),
 ('ONAR', 78.99),
 ('Ed_Henry', 79.67999999999999),
 ('Lewis_Black', 80.04),
 ('Doro_Bush_Koch', 80.05),
 ('Melanie_Sloan', 81.61999999999999),
 ('Andy_Serwer', 84.96),
 ('Amy_Walter', 85.35),
 ('Michael_Ware', 86.83),
 ('Howard_Kurtz', 91.64000000000001),
 ('Rusty_Dornin', 93.59),
 ('Bill_Maher', 94.15),
 ('Susan_Candiotti', 97.88000000000001),
 ('Gerri_Willis', 99.65),
 ('David_Albright', 107.69999999999999),
 ('Stuart_Rothenberg', 108.06),
 ('Sen._John_Warner', 110.46000000000001),
 ('NEWTON-JOHN', 115.9),
 ('David_Gergen', 116.35000000000001),
 ('John_Zarrella', 120.47),
 ('Jason_Carroll', 123.95),
 ('Drew_Griffin', 125.84),
 ('ANNOUNCER', 128.05),
 ('Delia_Gallagher', 131.59),
 ('Arwa_Damon', 138.01999999999998),
 ('Keith_Oppenheim', 143.53000000000003),
 ('John_Bolton', 145.67000000000002),
 ('Comm._Jeffrey_Miller', 155.91999999999996),
 ('ME', 156.79),
 ('Dr._Sanjay_Gupta', 157.49000000000004),
 ('Condoleezza_Rice', 163.28999999999996),
 ('Stephen_Jones', 163.49999999999994),
 ('Joe_Johns', 165.69),
 ('Michael_Holmes', 169.88000000000005),
 ('Richard_Roth', 170.37000000000003),
 ('Jeff_Koinange', 173.48000000000002),
 ('Dan_Rivers', 174.62),
 ('Rep._Dennis_Hastert', 174.91999999999996),
 ('Betty_Nguyen', 178.35000000000002),
 ('Carol_Lin', 181.16),
 ('Kelli_Arena', 184.89999999999998),
 ('Randi_Kaye', 191.82999999999998),
 ('BUSH', 201.27999999999997),
 ('Commissioner_Jeffrey_Miller', 209.14),
 ('Jack_Cafferty', 209.60999999999999),
 ('William_Schneider', 221.1000000000001),
 ('Kathleen_Koch', 226.2),
 ('Jeanne_Moos', 234.41999999999993),
 ('Candy_Crowley', 237.73000000000002),
 ('Bob_Woodward', 247.91999999999993),
 ('Mary_Snow', 248.77),
 ('PHILLIPS', 248.85),
 ('Suzanne_Malveaux', 256.79999999999995),
 ('Zain_Verjee', 267.5100000000001),
 ('AIKEN', 267.82),
 ('Tony_Snow', 275.78000000000003),
 ('Fredricka_Whitfield', 286.03999999999996),
 ('Ralitsa_Vassileva', 290.8500000000001),
 ('Jim_Clancy', 295.96000000000004),
 ('Allan_Chernoff', 329.81000000000006),
 ('UNIDENTIFIED_FEMALE', 331.53999999999996),
 ('QUESTION', 335.44000000000005),
 ('Barbara_Starr', 366.53000000000003),
 ('George_W._Bush', 438.09999999999997),
 ('Lou_Dobbs', 520.3500000000001),
 ('Brian_Todd', 583.4300000000001),
 ('Heidi_Collins', 595.0599999999998),
 ('Andrea_Koppel', 653.7400000000001),
 ('UNIDENTIFIED_MALE', 654.7500000000001),
 ('Paula_Zahn', 681.0300000000003),
 ('Jamie_Mcintyre', 700.4899999999999),
 ('John_King', 716.2900000000002),
 ('Dana_Bash', 803.4400000000003),
 ('Tony_Harris', 816.8199999999998),
 ('John_Roberts', 842.2500000000007),
 ('Larry_King', 999.8599999999997),
 ('Kyra_Phillips', 1109.6700000000005),
 ('Anderson_Cooper', 1260.3100000000013),
 ('Don_Lemon', 1355.3300000000004),
 ('Wolf_Blitzer', 1627.37)]

Based on the statistics computed in the last step, I could specify the criteria for selecting speakers to enroll. select-speaker.py implements this function: it takes the stats.json as input and produces a list of speakers that meet the specified criteria. For our testing this time, the criteria are the following:
the total duration of non-outlier clips should be at least 1 minute and the number of non-outlier clips should be more than 10. Usage example of select-speaker.py:

python select-speaker.py -s stats.json -o enrollment_list.json

I applied the above-mentioned criteria to select speakers from the training and testing datasets respectively, and the overlap between the resulting two lists of qualified speakers include the following 57 speakers:

['Michael_Holmes',
 'Amy_Walter',
 'Andy_Serwer',
 'Jamie_Mcintyre',
 'John_Roberts',
 'Jeanne_Moos',
 'Dana_Bash',
 'Heidi_Collins',
 'Howard_Kurtz',
 'Mary_Snow',
 'Tony_Snow',
 'Arwa_Damon',
 'Donald_Rumsfeld',
 'Delia_Gallagher',
 'Richard_Roth',
 'Susan_Candiotti',
 'Allan_Chernoff',
 'Bay_Buchanan',
 'Jim_Clancy',
 'Kathleen_Koch',
 'William_Schneider',
 'Michael_Ware',
 'Rusty_Dornin',
 'Jason_Carroll',
 'Joe_Johns',
 'Gerri_Willis',
 'George_W._Bush',
 'Barbara_Starr',
 'Larry_King',
 'Drew_Griffin',
 'Randi_Kaye',
 'Kyra_Phillips',
 'Lou_Dobbs',
 'Gary_Tuchman',
 'Andrea_Koppel',
 'Dr._Sanjay_Gupta',
 'David_Gergen',
 'Zain_Verjee',
 'Anderson_Cooper',
 'Don_Lemon',
 'Jack_Cafferty',
 'Tony_Harris',
 'Ralitsa_Vassileva',
 'Suzanne_Malveaux',
 'Dan_Simon',
 'Keith_Oppenheim',
 'Betty_Nguyen',
 'Wolf_Blitzer',
 'Brian_Todd',
 'John_King',
 'Fredricka_Whitfield',
 'John_Zarrella',
 'John_Bolton',
 'Candy_Crowley',
 'Paula_Zahn',
 'Kelli_Arena',
 'Carol_Lin']

These speakers were used to train and test the speaker recognizer. For training, I wrote a python script build_recognizer.py. Given a list of speakers, this python script loops through the list, trains a model for every speaker and adds it to the system, outlier audio clips are excluded from the training.

Finally, the trained recognizer was tested on the audio clips from these speakers in the testing dataset, and this process was implemented in test_recognizer.py, which writes the testing result as a list of (clip name, predicted name) pairs into test_results.json and prints out the total accuracy:

"2139 out of 2656 clips are correctly recognized!"

which is approximately 80.53%.

Below are the usage examples of build_recognizer.py and test_recognizer.py:

python build_recognizer.py -d $TRAIN -s $spk_train -o $OUTDIR

python test_recognizer.py -d $TEST -s $spk_test -m $model

Saturday, July 8, 2017

Week 5: Main Script for Training Data Preparation

Now all the steps required for extracting training clips from a video file have been finished. It is time to actually run these processes on large number of video files from cartago to prepare training data for our speaker recognition system. To this end, I assembled all the solutions so far into one main script.

Given a news file name, the main script implements the following process:

1. Get the news video and tpt files from cartago. (script: get.sh )

2. Strip off redundant information from the tpt file and mark speaker turns by ">>" to get the transcript. (script: strip-tpt.sh, output: .chevron.tpt)

3. Feed the video and transcript to Gentle to produce the alignment file. (script: gentle.sh, output: .align.jsonl)

4. Extract speaker occurrences from the tpt file and save them into a speaker list file. (script: command line, output: .speaker.list)

5. Convert alignment file to alignment file with speaker turns and sentence boundaries. (script: align2spk.py, output: .align.spk)

6. Based on the speaker turns and sentence boundaries, extract an audio clip from the video for every occurrence of a speaker. (script: spk_extract.py, output: .wav)

7. Remove the input files, including .mp4, .tpt, .chevron.tpt, .speaker.list

8. Send some results back to cartago. (script: put.sh, including .align.spk, .align.jsonl)

Then I applied this process on a list of news files (so far only those from Sept. Oct. Nov. 2006 have been processed), for this I also wrote a SLURM batch file for submitting this job to Case HPC. Later I learned that it's possible to parallelize this job via SLURM job arrays, this method will be adopted to process other news files in the future.

Not surprisingly, the main script ran into many problems while processing these news files, since many assumptions made when I developed previous components were based on a few sample files I chose randomly from the news archive, but these assumptions are not true for all news files. Thus, this data collecting procedure also played an important role for debugging my code. Here I list all the bugs detected during the process, will be updated whenever a new one is encountered:

1. non utf-8 characters appear in some tpt files, this made gentle crash. Fix: I then added a command line to remove non utf-8 characters.

2. ") :" was used as a landmark for detecting speaker turns, an exception occurred in one file. Fix: Now switch to "NER01|Person" for landmarks

3. "#" was used to denote sentence boundary, but in one tpt "#" occurred also in transcript. Fix: replace "#" by "$" in transcript

4. "V/O" occurred as speaker name, could not create directory with this name due to "/". Fix: replace any "/" in speaker name to "_".

5. speaker names may contain quotes: " and ', which causes trouble later during outlier detection. Fix: TODO.