At first, I clipped every sentence between two speaker turns, and this resulted in a huge number of clips in the datasets. But the problem is that a lot of irrelevant speeches are included, because not all audio signals between two speaker turns are speeches from the same speaker. For example, there can be commercials between two speaker turns. Therefore, to make the data collections more reliable, I then changed the code to extract only the first sentence after a speaker turn. This significantly reduced the amount of clips collected (e.g the number of clips for the most frequent speaker in the test video news, Jim Clancy, reduced from 482 to 11), and increased the quality of the datasets. However, there are still wrong clips, simply because the speakers are mislabeled in the tpt file. For instance, by just randomly sampling from the clips, I found that the sentence starting at time 00:04:00 of the test news video 2006-10-02_1600_US_CNN_Your_World_Today_Rep.mp4 was labled to be spoken by the anchor Jim Clancy (see the screen shot of aligned transcript below),
Such mislabeled training data may mislead the recognition models and result in bad performance during the testing time. Potential solutions to this problem have been studied in machine learning, for example, outlier detection and training with noisy labels. Later in this summer project, I will try these methods to clean up the training dataset.
The following command gives an example of running the script:
module load ffmpeg
python spk_extract.py -o ./output/ -i ~/data/ -f 2006-10-02_1600_US_CNN_Your_World_Today -s output/spk/
-o output directory
-i input directory
-s directory of speaker turn file
-f filename
No comments:
Post a Comment