Saturday, July 8, 2017

Week 5: Main Script for Training Data Preparation

Now all the steps required for extracting training clips from a video file have been finished. It is time to actually run these processes on large number of video files from cartago to prepare training data for our speaker recognition system. To this end, I assembled all the solutions so far into one main script.

Given a news file name, the main script implements the following process:

1. Get the news video and tpt files from cartago. (script: get.sh )

2. Strip off redundant information from the tpt file and mark speaker turns by ">>" to get the transcript. (script: strip-tpt.sh, output: .chevron.tpt)

3. Feed the video and transcript to Gentle to produce the alignment file. (script: gentle.sh, output: .align.jsonl)

4. Extract speaker occurrences from the tpt file and save them into a speaker list file. (script: command line, output: .speaker.list)

5. Convert alignment file to alignment file with speaker turns and sentence boundaries. (script: align2spk.py, output: .align.spk)

6. Based on the speaker turns and sentence boundaries, extract an audio clip from the video for every occurrence of a speaker. (script: spk_extract.py, output: .wav)

7. Remove the input files, including .mp4, .tpt, .chevron.tpt, .speaker.list

8. Send some results back to cartago. (script: put.sh, including .align.spk, .align.jsonl)

Then I applied this process on a list of news files (so far only those from Sept. Oct. Nov. 2006 have been processed), for this I also wrote a SLURM batch file for submitting this job to Case HPC. Later I learned that it's possible to parallelize this job via SLURM job arrays, this method will be adopted to process other news files in the future.

Not surprisingly, the main script ran into many problems while processing these news files, since many assumptions made when I developed previous components were based on a few sample files I chose randomly from the news archive, but these assumptions are not true for all news files. Thus, this data collecting procedure also played an important role for debugging my code. Here I list all the bugs detected during the process, will be updated whenever a new one is encountered:

1. non utf-8 characters appear in some tpt files, this made gentle crash. Fix: I then added a command line to remove non utf-8 characters.

2. ") :" was used as a landmark for detecting speaker turns, an exception occurred in one file. Fix: Now switch to "NER01|Person" for landmarks

3. "#" was used to denote sentence boundary, but in one tpt "#" occurred also in transcript. Fix: replace "#" by "$" in transcript

4. "V/O" occurred as speaker name, could not create directory with this name due to "/". Fix: replace any "/" in speaker name to "_".

5. speaker names may contain quotes: " and ', which causes trouble later during outlier detection. Fix: TODO.


No comments:

Post a Comment