Since these datasets might include mislabeled audio clips, I first ran the outlier detection algorithm to clean up the datasets. To this end, I wrote a script clean-train.sh that goes through every speaker in the dataset and marks each clip with it's log likelihood and z-scores.
Furthermore, not all speakers appear in the datasets should be enrolled in our speaker recognition system for various reasons (not enough training data, speaker name unidentifiable, e.g. 'ANNOUNCER', 'ME'), hence we need information to help us decide which speakers to select. For this purpose, I computed speaker-related statistics of each dataset and stored them in a json file called stats.json, which would be used later to decide what speakers to enroll into the recognition system. The information collected in stats.json include, for each speaker, the number of clips, the total audio duration, the number of non-outlier clips, the total duration of non-outlier audio clips.
Here is a list of top 100 speakers in the testing dataset, sorted by the total duration (in seconds) of non-outlier clips that each speaker has:
[('Bronwyn_Adcock', 59.4), ('Ross_Perot', 59.510000000000005), ('Rick_Sanchez', 59.98000000000001), ('Gary_Tuchman', 60.69), ('Jonathan_Freed', 61.19), ('Rep._Ray_Lahood', 62.84), ('Bay_Buchanan', 62.99999999999999), ('Mayor_Keith_Weatherly', 64.61), ('Bill_Tucker', 64.89), ('Rosemary_Church', 65.67), ('Cal_Perry', 66.53), ('Dennis_Hastert', 69.28), ('Paula_Newton', 70.46000000000001), ('KOCH', 70.64999999999999), ('Paul_Weyrich', 71.0), ('Dan_Simon', 71.96), ('J.C._Watts', 72.39), ('Donald_Rumsfeld', 72.88000000000001), ('David_Roth', 76.03999999999999), ('Aaron_Meyer', 76.07), ('Kitty_Pilgrim', 76.89999999999999), ('ONAR', 78.99), ('Ed_Henry', 79.67999999999999), ('Lewis_Black', 80.04), ('Doro_Bush_Koch', 80.05), ('Melanie_Sloan', 81.61999999999999), ('Andy_Serwer', 84.96), ('Amy_Walter', 85.35), ('Michael_Ware', 86.83), ('Howard_Kurtz', 91.64000000000001), ('Rusty_Dornin', 93.59), ('Bill_Maher', 94.15), ('Susan_Candiotti', 97.88000000000001), ('Gerri_Willis', 99.65), ('David_Albright', 107.69999999999999), ('Stuart_Rothenberg', 108.06), ('Sen._John_Warner', 110.46000000000001), ('NEWTON-JOHN', 115.9), ('David_Gergen', 116.35000000000001), ('John_Zarrella', 120.47), ('Jason_Carroll', 123.95), ('Drew_Griffin', 125.84), ('ANNOUNCER', 128.05), ('Delia_Gallagher', 131.59), ('Arwa_Damon', 138.01999999999998), ('Keith_Oppenheim', 143.53000000000003), ('John_Bolton', 145.67000000000002), ('Comm._Jeffrey_Miller', 155.91999999999996), ('ME', 156.79), ('Dr._Sanjay_Gupta', 157.49000000000004), ('Condoleezza_Rice', 163.28999999999996), ('Stephen_Jones', 163.49999999999994), ('Joe_Johns', 165.69), ('Michael_Holmes', 169.88000000000005), ('Richard_Roth', 170.37000000000003), ('Jeff_Koinange', 173.48000000000002), ('Dan_Rivers', 174.62), ('Rep._Dennis_Hastert', 174.91999999999996), ('Betty_Nguyen', 178.35000000000002), ('Carol_Lin', 181.16), ('Kelli_Arena', 184.89999999999998), ('Randi_Kaye', 191.82999999999998), ('BUSH', 201.27999999999997), ('Commissioner_Jeffrey_Miller', 209.14), ('Jack_Cafferty', 209.60999999999999), ('William_Schneider', 221.1000000000001), ('Kathleen_Koch', 226.2), ('Jeanne_Moos', 234.41999999999993), ('Candy_Crowley', 237.73000000000002), ('Bob_Woodward', 247.91999999999993), ('Mary_Snow', 248.77), ('PHILLIPS', 248.85), ('Suzanne_Malveaux', 256.79999999999995), ('Zain_Verjee', 267.5100000000001), ('AIKEN', 267.82), ('Tony_Snow', 275.78000000000003), ('Fredricka_Whitfield', 286.03999999999996), ('Ralitsa_Vassileva', 290.8500000000001), ('Jim_Clancy', 295.96000000000004), ('Allan_Chernoff', 329.81000000000006), ('UNIDENTIFIED_FEMALE', 331.53999999999996), ('QUESTION', 335.44000000000005), ('Barbara_Starr', 366.53000000000003), ('George_W._Bush', 438.09999999999997), ('Lou_Dobbs', 520.3500000000001), ('Brian_Todd', 583.4300000000001), ('Heidi_Collins', 595.0599999999998), ('Andrea_Koppel', 653.7400000000001), ('UNIDENTIFIED_MALE', 654.7500000000001), ('Paula_Zahn', 681.0300000000003), ('Jamie_Mcintyre', 700.4899999999999), ('John_King', 716.2900000000002), ('Dana_Bash', 803.4400000000003), ('Tony_Harris', 816.8199999999998), ('John_Roberts', 842.2500000000007), ('Larry_King', 999.8599999999997), ('Kyra_Phillips', 1109.6700000000005), ('Anderson_Cooper', 1260.3100000000013), ('Don_Lemon', 1355.3300000000004), ('Wolf_Blitzer', 1627.37)]
Based on the statistics computed in the last step, I could specify the criteria for selecting speakers to enroll. select-speaker.py implements this function: it takes the stats.json as input and produces a list of speakers that meet the specified criteria. For our testing this time, the criteria are the following:
the total duration of non-outlier clips should be at least 1 minute and the number of non-outlier clips should be more than 10. Usage example of select-speaker.py:
python select-speaker.py -s stats.json -o enrollment_list.json
I applied the above-mentioned criteria to select speakers from the training and testing datasets respectively, and the overlap between the resulting two lists of qualified speakers include the following 57 speakers:
['Michael_Holmes', 'Amy_Walter', 'Andy_Serwer', 'Jamie_Mcintyre', 'John_Roberts', 'Jeanne_Moos', 'Dana_Bash', 'Heidi_Collins', 'Howard_Kurtz', 'Mary_Snow', 'Tony_Snow', 'Arwa_Damon', 'Donald_Rumsfeld', 'Delia_Gallagher', 'Richard_Roth', 'Susan_Candiotti', 'Allan_Chernoff', 'Bay_Buchanan', 'Jim_Clancy', 'Kathleen_Koch', 'William_Schneider', 'Michael_Ware', 'Rusty_Dornin', 'Jason_Carroll', 'Joe_Johns', 'Gerri_Willis', 'George_W._Bush', 'Barbara_Starr', 'Larry_King', 'Drew_Griffin', 'Randi_Kaye', 'Kyra_Phillips', 'Lou_Dobbs', 'Gary_Tuchman', 'Andrea_Koppel', 'Dr._Sanjay_Gupta', 'David_Gergen', 'Zain_Verjee', 'Anderson_Cooper', 'Don_Lemon', 'Jack_Cafferty', 'Tony_Harris', 'Ralitsa_Vassileva', 'Suzanne_Malveaux', 'Dan_Simon', 'Keith_Oppenheim', 'Betty_Nguyen', 'Wolf_Blitzer', 'Brian_Todd', 'John_King', 'Fredricka_Whitfield', 'John_Zarrella', 'John_Bolton', 'Candy_Crowley', 'Paula_Zahn', 'Kelli_Arena', 'Carol_Lin']
These speakers were used to train and test the speaker recognizer. For training, I wrote a python script build_recognizer.py. Given a list of speakers, this python script loops through the list, trains a model for every speaker and adds it to the system, outlier audio clips are excluded from the training.
Finally, the trained recognizer was tested on the audio clips from these speakers in the testing dataset, and this process was implemented in test_recognizer.py, which writes the testing result as a list of (clip name, predicted name) pairs into test_results.json and prints out the total accuracy:
"2139 out of 2656 clips are correctly recognized!"
which is approximately 80.53%.
Below are the usage examples of build_recognizer.py and test_recognizer.py:
python build_recognizer.py -d $TRAIN -s $spk_train -o $OUTDIR
python test_recognizer.py -d $TEST -s $spk_test -m $model
No comments:
Post a Comment