Wednesday, June 21, 2017

Week3: Collect Training Data for Speaker Recognition

After we have the speaker turns and sentence boundaries, the next step is to extract audio clips for every speaker and collect them as training datasets for the recognition models. To do this, I wrote a python script spk_extract.py, which loops through the list of speakers in the transcript, and whenever it encounters a new speaker, a directory for this speaker will be created and the audio clips of this speaker will be placed inside, along with a text file contains the list of clips and their timestamps.

At first, I clipped every sentence between two speaker turns, and this resulted in a huge number of clips in the datasets. But the problem is that a lot of irrelevant speeches are included, because not all audio signals between two speaker turns are speeches from the same speaker. For example, there can be commercials between two speaker turns. Therefore, to make the data collections more reliable, I then changed the code to extract only the first sentence after a speaker turn. This significantly reduced the amount of clips collected (e.g the number of clips for the most frequent speaker in the test video news, Jim Clancy, reduced from 482 to 11), and increased the quality of the datasets. However, there are still wrong clips, simply because the speakers are mislabeled in the tpt file. For instance, by just randomly sampling from the clips, I found that the sentence starting at time 00:04:00 of the test news video 2006-10-02_1600_US_CNN_Your_World_Today_Rep.mp4 was labled to be spoken by the anchor Jim Clancy (see the screen shot of aligned transcript below),
but the pictures of the video clearly indicates that the speaker is Martin Geissler (see the video screenshot).

Such mislabeled training data may mislead the recognition models and result in bad performance during the testing time. Potential solutions to this problem have been studied in machine learning, for example, outlier detection and training with noisy labels. Later in this summer project, I will try these methods to clean up the training dataset.

The following command gives an example of running the script:


module load ffmpeg

python spk_extract.py -o ./output/ -i ~/data/ -f 2006-10-02_1600_US_CNN_Your_World_Today -s output/spk/


-o output directory
-i input directory
-s directory of speaker turn file
-f filename 

No comments:

Post a Comment