Monday, August 7, 2017

Week 9: Building an i-vector extractor with bob.kaldi

In the last decade, Gaussian mixture model based on universal background model (GMM-UBM) framework has demonstrated strong performance in speaker verification.[1] It is commonly believed that the mean vectors of GMMs represent the most characteristics of speakers. Extended from GMM-UBM framework, factor analysis (FA) technique [2, 3] attempts to model the speaker components jointly. Each speaker is represented by the mean supervector which is a linear combination of the set of eigenvoices. Based on FA technique, joint factor analysis (JFA) [4, 5] decomposes GMM supervector into speaker component S and channel component C. Inspired by JFA approach, Dehak et al. [7] propose a combination of speaker space and channel space. A new low-dimensional space named total factor space is defined. In this new space, each utterance is represented by a low-dimensional feature vector termed i-vector. The idea of i-vector opens a new era to the analysis of speaker and session variability.

Over the recent years, i-vector approach has emerged to be the state-of-the-art for speaker recognition task. Many popular speech recognition libraries such as Kaldi offer api for training i-vector extractors, see here for a comprehensive overview of available open source tools. To avoid reinventing the wheel, I decided to use one of the existing libraries. Although I admire the quality of Kaldi, I'd prefer a pythonic API to work with since most of the RedHen libraries are wrapped in python. Luckily, I found bob.kaldi, which is a bob package that seamlessly integrate Kaldi functionality with Python-based workflows.

To install bob.kaldi, one needed to first install mini conda from here, then follow the bob installation instruction here. Finally bob.kaldi can be installed:
conda install bob.kaldi

To activate the virtual environment for this package on my Case HPC account, run the following command:

source activate bob_py3 
I have written a python script build_model.py to train an i-vector extractor with functions provided in bob.kadi, a usage case looks like the following:
python build_model.py -d ubm_data.json
where ubm_data.json is a json file that contains a list of training samples for UBM, one can produce such a list from the python script mentioned in the article from the last week. We need this file because training i-vector extractor requires a pre-trained UBM model.

Note that the function bob.kaldi.ivector_train() can accept features for multiple utterances as a 3D array, I found that weird, since two utterances may have different durations, therefore different dimensions. Padding them to the same length does seem to be an elegant solution to me, so I simply changed if feats.ndim == 3: to if feats is list: in the source code, now one can put features in a list.

 

No comments:

Post a Comment