The entire project consists of the following parts:
- Training Data Collection from CNN News Videos
- Outlier Detection within Training Set
- Building and Testing Speaker Recognition System
- New Speaker Recognition Module
All source code can be found in this RedHen github.
Training Data Collection from CNN News Videos
The main script for this step is collect-train.sh, however it only extracts audio clips from one video file. To process a list of videos, use process-list.sh, and to submit it as a job on Case HPC, use data-prep.slurm. Collected audio samples will be placed in output/audio/(speakername)/, where a file called clip_list.txt that contains info (stating time, duration, title, etc.) about these samples is also included. Detailed descriptions and usage cases for these scripts and other relevant files are provided below.
Description: the main script for collecting training data, given one file name, it extracts all audio signals with speaker tags from the corresponding news video. To start
Usage:
./collect-train.sh 2006-09-29_0100_US_CNN_Larry_King_Live.tpt
Description: run the main script collect-train.sh on a list of tpt files, an example file is given in the usage case.
Usage:
./process-list.sh tmp-list
data-prep.slurm
Description: submit process-list.sh as a job to Case HPC SLURM
Usage:
sbatch data-prep.slurm
Outlier Detection within Training Set
The training data collected through the previous step might contain mislabeled samples, to detect these outliers, we train a probabilistic models for each speaker, then score every audio clip by this model. Samples with low log likelihood can be considered as outliers. The computed likelihood will be recorded in the clip_list.txt file, outliers won't be deleted but will be skipped in later process. One can set a threshold to specify which range of likelihood is considered acceptable.
Description: train a speaker model from a set of training samples for a speaker, then score them and save the likelihood info to clip_list.txt, where -i specifies the the folder where speaker folders are placed, and -s specifies the folder name for the speaker. It will also send statistics of each speaker's training set to a file stats.json file, where statistics of the entire training set are included.
Usage:
python detect_outlier.py -i ./output/audio/ -s Jim_Clancy
clean-train.sh
Description: loop through every speaker folder, and run outlier detection there. DATADIR variable specifies where speaker folders are placed.
Usage:
./clean-train.sh
Building and Testing Speaker Recognition System
With the data collected from previous steps, one can move on to build a speaker recognition system and test its performance. In the future when new methods are developed, one might build a different system, here we show an example using the current system for how one could go about testing a recognition system. Before training and testing, it is important to select, based on the statistics, which speakers in the training sets are qualified for enrollment, since not all speakers have the same amount of data available.
select_speakers.py
Description: a python script to select qualified speakers based on the statistics file, the example here chooses speakers with more than 1 minute training signal. -s gives the stats file name, and -o tells where to save the output. Exemplar output is given below.
Usage:
python select-speaker.py -s stats.json -o enrollment_list.json
Description: a python script to select qualified speakers based on the statistics file, the example here chooses speakers with more than 1 minute training signal. -s gives the stats file name, and -o tells where to save the output. Exemplar output is given below.
Usage:
python select-speaker.py -s stats.json -o enrollment_list.json
build_recognizer.py
Description: a python script to build a recognizer from a list of speakers, trained models for different speakers are stored separately in corresponding speaker folders. -d points to the training data directory and -s specifies a file that contains a list of speakers.
Usage:
python build_recognizer.py -d ./output/audio -s overlap.json
test_recognizer.py
Description: a python script to test the trained recognizer, it takes a list of speakers and reconstruct a recognizer by loading the individual models for these speakers. -d gives the testing data directory, -s specifies a file that contains a list of potential speakers and -m points to the directory where models for speakers are stored
Usage:
Description: a python script to build a recognizer from a list of speakers, trained models for different speakers are stored separately in corresponding speaker folders. -d points to the training data directory and -s specifies a file that contains a list of speakers.
Usage:
python build_recognizer.py -d ./output/audio -s overlap.json
test_recognizer.py
Description: a python script to test the trained recognizer, it takes a list of speakers and reconstruct a recognizer by loading the individual models for these speakers. -d gives the testing data directory, -s specifies a file that contains a list of potential speakers and -m points to the directory where models for speakers are stored
Usage:
python test_recognizer.py -d $TEST -s $spk_test -m $TRAIN
Description: it is just a script that runs the whole process of training and testing
Usage:
Usage:
./train_test.sh
New Speaker Recognition Module
I also made many changes to the original speaker recognition python module so that it is suitable for large scale news datasets and tailored for short audio clips. These changes including UBM training, more powerful features offered by librosa, compositional construction method for recognizers, larger model capacity and so on. The following are module files that are modified:
Description: a python module file for creating a set of GMM models using scikit-learn, the default number of GMM components is 64.
Description: a python module file for speaker recognition based on GMM.
Usage: Instantiate a recognizer, enroll a speaker and train the model:
import AudioPipe.speaker.recognition as SR
Recognizer = SR.GMMRec()
Recognizer = SR.GMMRec()
Recognizer.enroll_file(spk_name, audio_file, model=model_fn)
Recognizer.train()
id, llhd = Recognizer.predict(Recognizer.get_mfcc(audio_test))
where id is the identity of the speaker and llhd is the log likelihood
ubm_data.py
get.sh
Description: get video and tpt files from Cartago
Usage:
./get.sh 2006-09-29_0100_US_CNN_Larry_King_Live.tpt
strip-tpt.sh
Description: strip off redundant info from the tpt file and place the output in a folder specified by the second argument.
Usage:
./strip-tpt.sh $FIL.tpt $OUTDIR
gentle.sh
Description: run Gentle alignment and the speaker information is preserved during this process. The first argument specifies the file name (without extension), whereas the second one specifies the extension of the stripped tpt file.
Usage:
./gentle.sh $FIL chevron.tpt
align2spk.py
Description: insert speaker info back to the alignment results, -o specifies the output file, -s specifies the list of speakers, the last argument is the alignment result file.
Usage:
python align2spk.py -o $FIL.align.spk -s $FIL.speaker.list $FIL.align.json
spk_extract.py
Description: extract audio clips corresponding to these speakers, -i specifies the input directory, where the video file locates; -o specifies the output directory, where the extracted audio clips should be placed; -f is the file name and -s is the output given by align2spk.py
Usage:
python spk_extract.py -o $OUTDIR/ -i $INDIR/ -f $FIL -s $OUTDIR/spk/
Description: a python script for selecting dataset to train a Universal Background Model (UBM), the training can be done in the same way as training a GMM for one speaker, except that it is usually recommended to use many more components for UBM. -s is the stats file, -o is the output, a list of audio clips for training and -d specifies where these clips are saved.
Usage:
python ubm_data.py -s stats.json -o ubm_data.json -d ./output/audio
Auxiliary Files
The following files are either used as intermediate functions by the files above, or offer additional functionalities that might come handy when manipulate data.
Description: get video and tpt files from Cartago
Usage:
./get.sh 2006-09-29_0100_US_CNN_Larry_King_Live.tpt
strip-tpt.sh
Description: strip off redundant info from the tpt file and place the output in a folder specified by the second argument.
Usage:
./strip-tpt.sh $FIL.tpt $OUTDIR
gentle.sh
Description: run Gentle alignment and the speaker information is preserved during this process. The first argument specifies the file name (without extension), whereas the second one specifies the extension of the stripped tpt file.
Usage:
./gentle.sh $FIL chevron.tpt
align2spk.py
Description: insert speaker info back to the alignment results, -o specifies the output file, -s specifies the list of speakers, the last argument is the alignment result file.
Usage:
python align2spk.py -o $FIL.align.spk -s $FIL.speaker.list $FIL.align.json
spk_extract.py
Description: extract audio clips corresponding to these speakers, -i specifies the input directory, where the video file locates; -o specifies the output directory, where the extracted audio clips should be placed; -f is the file name and -s is the output given by align2spk.py
Usage:
python spk_extract.py -o $OUTDIR/ -i $INDIR/ -f $FIL -s $OUTDIR/spk/
put.sh
Description: put the resulting files on my Cartago, the first argument is the file and the second is a folder on my Cartago account
Usage:
./put.sh ./output/align/${FIL}.align.jsonl align
Description: put the resulting files on my Cartago, the first argument is the file and the second is a folder on my Cartago account
Usage:
./put.sh ./output/align/${FIL}.align.jsonl align
Description: remove duplicated entries in clip_list.txt files, -s specifies the name of statistics file, and -d points to the data directory
Usage:
python remove_duplicate.py -s stats.json -d ./output/audio
Usage:
python remove_duplicate.py -s stats.json -d ./output/audio