{"id":660,"date":"2022-10-18T12:29:11","date_gmt":"2022-10-18T12:29:11","guid":{"rendered":"https:\/\/showbizztoday.com\/index.php\/2022\/10\/18\/unsupervised-speaker-diarization-using-sparse-optimization\/"},"modified":"2022-10-18T12:29:11","modified_gmt":"2022-10-18T12:29:11","slug":"unsupervised-speaker-diarization-utilizing-sparse-optimization","status":"publish","type":"post","link":"https:\/\/showbizztoday.com\/index.php\/2022\/10\/18\/unsupervised-speaker-diarization-utilizing-sparse-optimization\/","title":{"rendered":"Unsupervised Speaker Diarization utilizing Sparse Optimization"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n<div class=\"published-date\">\n<div class=\"icon-holder\">\n                                                <img decoding=\"async\" src=\"https:\/\/research.atspotify.com\/wp-content\/themes\/spotify\/images\/icon.png\" alt=\"\"\/>\n                                            <\/div>\n<p><span class=\"date\">September 29, 2022<\/span> Revealed by Md Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren and Rosie Jones<\/p>\n<\/p><\/div>\n<div class=\"img-holder\">\n                                            <img decoding=\"async\" src=\"https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/Spotify_Header_Blog_v4-NO-LOGO_01.png\" class=\"attachment-post-thumbnail size-post-thumbnail wp-post-image\" alt=\"Unsupervised Speaker Diarization using Sparse Optimization\" srcset=\"https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/Spotify_Header_Blog_v4-NO-LOGO_01.png 1667w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/Spotify_Header_Blog_v4-NO-LOGO_01-250x123.png 250w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/Spotify_Header_Blog_v4-NO-LOGO_01-700x344.png 700w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/Spotify_Header_Blog_v4-NO-LOGO_01-768x378.png 768w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/Spotify_Header_Blog_v4-NO-LOGO_01-1536x756.png 1536w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/Spotify_Header_Blog_v4-NO-LOGO_01-120x59.png 120w\" sizes=\"(max-width: 1667px) 100vw, 1667px\"\/><figcaption\/>\n                                        <\/div>\n<h2>What&#8217;s Speaker Diarization?<\/h2>\n<p>Speaker diarization is the method of logging the timestamps of when numerous audio system take turns to speak inside a bit of spoken phrase audio. The determine under reveals an audio timeline, annotated with the areas the place completely different audio system had been audible.<\/p>\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image4-700x153.png\" alt=\"\" class=\"wp-image-1671\" width=\"837\" height=\"183\" srcset=\"https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image4-700x153.png 700w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image4-250x55.png 250w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image4-768x167.png 768w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image4-120x26.png 120w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image4.png 853w\" sizes=\"auto, (max-width: 837px) 100vw, 837px\"\/><\/figure>\n<p class=\"has-text-align-center\"><em>\u00a0Illustration of speaker diarization<\/em><\/p>\n<h2>Challenges<\/h2>\n<p>Basic approaches to speaker diarization observe an unsupervised course of. This strategy, nevertheless, suffers from various points. First, typical clustering strategies, like spectral clustering, make the most of a non-overlapping assumption among the many clusters, which makes the overlap detection a further step and complicates the general algorithm. Second, the full variety of clusters have to be pre-specified, which could be difficult. Third, the method of computing the embeddings usually makes use of language-specific cues, such because the transcription of the audio. Such language-dependent embeddings are troublesome to scale over a number of languages.<\/p>\n<figure class=\"wp-block-image size-full is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image2-1.png\" alt=\"\" class=\"wp-image-1669\" width=\"799\" height=\"235\" srcset=\"https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image2-1.png 476w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image2-1-250x74.png 250w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image2-1-120x35.png 120w\" sizes=\"auto, (max-width: 799px) 100vw, 799px\"\/><\/figure>\n<p class=\"has-text-align-center\"><em>Basic diarization strategy<\/em><\/p>\n<p>Supervised algorithms for diarization additionally exist, and are educated end-to-end utilizing labeled knowledge. Nonetheless, these algorithms have their very own challenges. For instance, the coaching dataset is usually imbalanced, which might result in bias over particular age, gender, or race and might result in unfair prediction. To handle this challenge, the coaching dataset should be balanced over many attributes (age, gender, race and so forth.) of the audio system. For diarization purposes on actual world situations like podcasts, the coaching knowledge additionally must be balanced for various classes and codecs of content material. Counterbalancing is troublesome, because it grows the dataset exponentially. Lastly, making ready a supervised coaching dataset could be costly and time consuming because of the annotation of diarization knowledge.<\/p>\n<h2>Proposed Resolution<\/h2>\n<p>On this work, we proposed an answer that avoids these challenges by using a sparse optimization methodology. This answer is unsupervised, language-agnostic (thus scalable), and overlap-aware. We make the most of a voice embedding that makes use of audio-specific cues solely. Additionally it is tuning-free \u2013 i.e. doesn\u2019t require the customers to pre-specify the precise variety of audio system within the audio.<\/p>\n<p>In our algorithm, we preprocess the audio sign utilizing a Voice Exercise Detection step. We use YAMNET from Tensorflow hub [1] for detecting the voiced areas within the audio. Then, we compute the VggVox embeddings [2] for small, overlapping time segments of the audio. We name the sequence of those embeddings for a podcast the embedding sign. We formulate a sparse optimization downside to factorize the embedding sign into the corresponding embedding foundation matrix and the activation matrix. The optimization downside makes an attempt to reconstruct the embedding sign with a minimal variety of distinct embeddings (foundation). We additionally suggest an iterative algorithm to resolve the optimization downside.<\/p>\n<figure class=\"wp-block-image size-full is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image3-2.png\" alt=\"\" class=\"wp-image-1670\" width=\"830\" height=\"247\" srcset=\"https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image3-2.png 622w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image3-2-250x74.png 250w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image3-2-120x36.png 120w\" sizes=\"auto, (max-width: 830px) 100vw, 830px\"\/><\/figure>\n<p class=\"has-text-align-center\"><em>Speaker diarization expressed as a matrix factorization downside<\/em><\/p>\n<h2>Benefits of the Resolution<\/h2>\n<p>This strategy is overlap-aware. The VGGVox embedding of an audio, the place two completely different audio system are current, is roughly equal to the weighted common of every audio system\u2019 embeddings. This attribute, together with the sparsity constraint, penalizes developing a brand new embedding for the overlapping areas as proven within the following determine. This makes the optimization algorithm to mannequin the overlapping areas as a linear mixture of the 2 current embeddings.<\/p>\n<figure class=\"wp-block-image size-full is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image1-2.png\" alt=\"\" class=\"wp-image-1666\" width=\"838\" height=\"186\" srcset=\"https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image1-2.png 491w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image1-2-250x55.png 250w, https:\/\/storage.googleapis.com\/research-production\/1\/2022\/09\/image1-2-120x27.png 120w\" sizes=\"auto, (max-width: 838px) 100vw, 838px\"\/><\/figure>\n<p class=\"has-text-align-center\"><em>Illustration of overlap-awareness<\/em><\/p>\n<p>As a result of sparsity constraints, this strategy doesn\u2019t have to specify the precise variety of audio system. It requires choosing a most variety of audio system which we do by leveraging a way from linear algebra. We apply Singular Worth Decomposition (SVD) over the embedding sign. The singular values are sorted in descending order. The knee of this sequence is then detected utilizing a knee detection algorithm. The knee location is multiplied by an element of two.5 to estimate the utmost variety of audio system. The issue 2.5 provides us a helpful margin of error for the higher certain of the variety of audio system.<\/p>\n<h2>Experimental Outcomes<\/h2>\n<p>We consider our diarization algorithm in opposition to a industrial baseline, an in-house implementation of a spectral clustering algorithm, two open supply options, and two naive approaches. As an analysis dataset, we used a widely known public radio-turned-podcast collection. It incorporates podcast audio which is usually an hour-long and incorporates 18 audio system on common. As analysis metrics, we use diarization error price, purity, and protection. Purity and protection are the equal of precision and recall for the diarization downside, and we additionally report the F-score, the geometric imply of purity and protection. As a industrial baseline, we use the diarization answer supplied with the Google Cloud Platform transcription service. The outcomes present that our algorithm outperforms the industrial baseline in all of the completely different metrics for the examined dataset. For an in depth dialogue of this end result, together with another experiments, please try our paper:<\/p>\n<p><a href=\"https:\/\/research.atspotify.com\/publications\/unsupervised-speaker-diarization-that-is-agnostic-to-language-overlap-aware-and-free-of-tuning\/\" target=\"_blank\" rel=\"noreferrer noopener\">Unsupervised Speaker Diarization that&#8217;s Agnostic to Language, Overlap-Conscious, and Tuning Free<\/a><br \/>M. Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren, and Rosie Jones<br \/>INTERSPEECH 2022<\/p>\n<h2>References<\/h2>\n<p>[1] <a href=\"https:\/\/www.tensorflow.org\/hub\/tutorials\/yamnet\" target=\"_blank\" rel=\"noopener\">https:\/\/www.tensorflow.org\/hub\/tutorials\/yamnet<\/a><br \/>[2] A. Nagrani, J. S. Chung, W. Xie, A. Zisserman, Voxceleb: Massive-scale speaker verification within the wild, Laptop Speech and Language, 2019<\/p>\n<\/p><\/div>\n<p>[ad_2]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] September 29, 2022 Revealed by Md Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren and Rosie Jones What&#8217;s Speaker Diarization? Speaker diarization is the method of logging the timestamps of when numerous audio system take turns to speak inside a bit of spoken phrase audio. The determine under reveals an audio timeline, annotated with the areas [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":662,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[38],"tags":[],"class_list":{"0":"post-660","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-spotify"},"_links":{"self":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts\/660","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/comments?post=660"}],"version-history":[{"count":0,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts\/660\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/media\/662"}],"wp:attachment":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/media?parent=660"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/categories?post=660"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/tags?post=660"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}