Unsupervised Speaker Diarization utilizing Sparse Optimization

0
129
Unsupervised Speaker Diarization utilizing Sparse Optimization


September 29, 2022 Revealed by Md Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren and Rosie Jones

Unsupervised Speaker Diarization utilizing Sparse Optimization

What’s Speaker Diarization?

Speaker diarization is the method of logging the timestamps of when numerous audio system take turns to speak inside a bit of spoken phrase audio. The determine under reveals an audio timeline, annotated with the areas the place completely different audio system had been audible.

 Illustration of speaker diarization

Challenges

Basic approaches to speaker diarization observe an unsupervised course of. This strategy, nevertheless, suffers from various points. First, typical clustering strategies, like spectral clustering, make the most of a non-overlapping assumption among the many clusters, which makes the overlap detection a further step and complicates the general algorithm. Second, the full variety of clusters have to be pre-specified, which could be difficult. Third, the method of computing the embeddings usually makes use of language-specific cues, such because the transcription of the audio. Such language-dependent embeddings are troublesome to scale over a number of languages.

Basic diarization strategy

Supervised algorithms for diarization additionally exist, and are educated end-to-end utilizing labeled knowledge. Nonetheless, these algorithms have their very own challenges. For instance, the coaching dataset is usually imbalanced, which might result in bias over particular age, gender, or race and might result in unfair prediction. To handle this challenge, the coaching dataset should be balanced over many attributes (age, gender, race and so forth.) of the audio system. For diarization purposes on actual world situations like podcasts, the coaching knowledge additionally must be balanced for various classes and codecs of content material. Counterbalancing is troublesome, because it grows the dataset exponentially. Lastly, making ready a supervised coaching dataset could be costly and time consuming because of the annotation of diarization knowledge.

Proposed Resolution

On this work, we proposed an answer that avoids these challenges by using a sparse optimization methodology. This answer is unsupervised, language-agnostic (thus scalable), and overlap-aware. We make the most of a voice embedding that makes use of audio-specific cues solely. Additionally it is tuning-free – i.e. doesn’t require the customers to pre-specify the precise variety of audio system within the audio.

In our algorithm, we preprocess the audio sign utilizing a Voice Exercise Detection step. We use YAMNET from Tensorflow hub [1] for detecting the voiced areas within the audio. Then, we compute the VggVox embeddings [2] for small, overlapping time segments of the audio. We name the sequence of those embeddings for a podcast the embedding sign. We formulate a sparse optimization downside to factorize the embedding sign into the corresponding embedding foundation matrix and the activation matrix. The optimization downside makes an attempt to reconstruct the embedding sign with a minimal variety of distinct embeddings (foundation). We additionally suggest an iterative algorithm to resolve the optimization downside.

Speaker diarization expressed as a matrix factorization downside

Benefits of the Resolution

This strategy is overlap-aware. The VGGVox embedding of an audio, the place two completely different audio system are current, is roughly equal to the weighted common of every audio system’ embeddings. This attribute, together with the sparsity constraint, penalizes developing a brand new embedding for the overlapping areas as proven within the following determine. This makes the optimization algorithm to mannequin the overlapping areas as a linear mixture of the 2 current embeddings.

Illustration of overlap-awareness

As a result of sparsity constraints, this strategy doesn’t have to specify the precise variety of audio system. It requires choosing a most variety of audio system which we do by leveraging a way from linear algebra. We apply Singular Worth Decomposition (SVD) over the embedding sign. The singular values are sorted in descending order. The knee of this sequence is then detected utilizing a knee detection algorithm. The knee location is multiplied by an element of two.5 to estimate the utmost variety of audio system. The issue 2.5 provides us a helpful margin of error for the higher certain of the variety of audio system.

Experimental Outcomes

We consider our diarization algorithm in opposition to a industrial baseline, an in-house implementation of a spectral clustering algorithm, two open supply options, and two naive approaches. As an analysis dataset, we used a widely known public radio-turned-podcast collection. It incorporates podcast audio which is usually an hour-long and incorporates 18 audio system on common. As analysis metrics, we use diarization error price, purity, and protection. Purity and protection are the equal of precision and recall for the diarization downside, and we additionally report the F-score, the geometric imply of purity and protection. As a industrial baseline, we use the diarization answer supplied with the Google Cloud Platform transcription service. The outcomes present that our algorithm outperforms the industrial baseline in all of the completely different metrics for the examined dataset. For an in depth dialogue of this end result, together with another experiments, please try our paper:

Unsupervised Speaker Diarization that’s Agnostic to Language, Overlap-Conscious, and Tuning Free
M. Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren, and Rosie Jones
INTERSPEECH 2022

References

[1] https://www.tensorflow.org/hub/tutorials/yamnet
[2] A. Nagrani, J. S. Chung, W. Xie, A. Zisserman, Voxceleb: Massive-scale speaker verification within the wild, Laptop Speech and Language, 2019

LEAVE A REPLY

Please enter your comment!
Please enter your name here