After a discussion over Skype, Peter and I decided to stay with the original Gentle output format, which consists of two components: input transcript and a list of words. Each word in the list has attributes "startOffset" and "endOffset" that refer to its position in the original transcript. We favor this format mainly for two reasons: 1. it already contains all the information we need in a concise and non-redundant form, which makes it a good compromise among many different applications: any specific application can further process it to fit its own need. 2. No changes will be made to Gentle code, hence merging issues can be avoided even when Gentle upgrade its implementation in the future.
As a result of this decision, I wrote a python script align2spk.py to convert Gentle output to a data format suitable for the speaker recognition task, where the speaker turns and sentence boundaries are marked. Later we would need the speaker turns to extract speech signals from audios and collect them as speaker recognition training datasets. Sentence boundaries will be used later for segmenting audios during testing time.
Speaker turns are originally marked in the transcript by ">>", this is because the input transcript to Gentle is the stripped tpt file that contains speaker turns denoted by ">>", as introduced in the last post.
For sentence boundaries, I used the python Natural Language Toolkit (nltk) to split the transcript component into sentences. Splitting a document into sentences is not entirely a trivial task, because in English, sentence ending punctuations such "." can be used for other reasons such as abbreviations (e.g. "U.S.", "Dr."), so the tokenizer from nltk is in fact a pattern recognizer trained from a large set of corpora (see the official documentation for details).
Both the speaker turns and sentence boundaries have attributes "start" and "end" to indicate their occurring time in the audio file. The start time of a speaker turn or a sentence boundary is the end time of the successfully aligned word (with "case"= "success") before it, and its end time is the start time of the next successfully aligned word. If the immediately neighboring words are not successfully aligned (e.g. "case" = "not-found-in-audio"), then the timestamps of the closest aligned words will be used. In this case, the turn or boundary will be marked with an attribute "case"= "cautious", indicating that the time stamps for them may not be very reliable, otherwise they have the attribute "case"="fine".
An example use case for this script is given below:
python align2spk.py -o lucier.align.spk lucier.align.jsonl
No comments:
Post a Comment