Scalable Dynamic Topic Modeling – Spotify Research : Spotify Research

0
151
Scalable Dynamic Topic Modeling – Spotify Research : Spotify Research


November 15, 2022 Published by Federico Tomasi, Mounia Lalmas and Zhenwen Dai

Scalable Dynamic Topic Modeling – Spotify Research : Spotify Research

Dynamic subject modeling is a nicely established instrument for capturing the temporal dynamics of the matters of a corpus. A limitation of present dynamic subject fashions is that they will solely contemplate a small set of frequent phrases due to their computational complexity and inadequate information for much less frequent phrases. However, in lots of purposes, a very powerful phrases that describe a subject are sometimes the unusual ones. For occasion, within the machine studying literature, phrases equivalent to “Wishart” are comparatively unusual, however they’re strongly associated to the posterior inference literature. In this work, we proposed a dynamic subject mannequin that is ready to seize the correlation between matters related to every doc, the evolution of subject correlation, advert phrase co-occurrence over time, and phrase correlation.

This is particularly necessary once we contemplate textual paperwork, together with spoken paperwork, as a result of these could embody a number of unusual phrases and be comparatively quick. In such circumstances, as normal subject fashions depend on phrase co-occurrence, they might fail to determine matters, and should regard totally different phrases as noise with out acceptable preprocessing.

Dynamic subject modeling with phrase correlation

In subject modeling literature, a subject is a chance distribution of the phrases showing in a doc. Each doc is represented as a combination of a number of matters. The elements depicted within the graphical mannequin are: the chance of a subject (µ), the subject correlation (∑) and topic-word affiliation (β). 

In this work, we contemplate related phrases as concretizations of a single idea. This permits us to take away the necessity of ad-hoc preprocessing (equivalent to tokenization or stemming) to pre-group related phrases into the identical which means. This is achieved by using multi-output Gaussian processes, to be taught a latent embedding for teams of phrases associated with one another. The latent embedding is indicated by H, and the variety of elements is unbiased of the dimensions of the vocabulary. On the opposite hand, the dynamics of every phrase for every subject is modelled by β, which shall be conditioned on the latent embedding H.

Evaluation

We in contrast our mannequin, referred to as MIST (Multi-output with Importance Sampling Topic mannequin)  to state-of-the-art subject fashions, each static ones (LDA and CTM) and dynamic ones (DETM and DCTM). We used a complete of seven totally different datasets with totally different traits (e.g. within the variety of paperwork per time and the time span).

 The desk beneath exhibits the typical per-word perplexity on the take a look at set per every dataset, measured as an exponent of the typical damaging ELBO (decrease certain on the log-likelihood) per phrase (the decrease, the higher).  MIST outperforms the baselines on all datasets. Importantly, we will see that for the datasets with greater than 10k phrases, DCTM, whereas having the second finest efficiency on the opposite datasets, couldn’t be skilled with out working into out-of-memory points.

Computational effectivity

We additional investigated the computational effectivity distinction between DCTM and MIST. In DCTM, the modelling of the phrases for every subject occurs independently, therefore the mannequin must approximate each phrase with each subject. This is computational inefficient and infeasible for big vocabularies. In MIST, we overcome the issue by modelling a hard and fast variety of latent variables solely. This change is mirrored within the computational distinction between the 2 fashions.

We computed the typical time to compute 5 epochs on a dataset with an rising variety of phrases whereas holding a hard and fast variety of samples at totally different time factors. MIST constantly outperforms DCTM throughout all of the vocabulary sizes. In specific, the computational advantage of MIST is extra evident after reaching 100K (and 1M) phrases. 

Modeling rare phrases

We analyze the efficiency of MIST by way of modeling rare key phrases in comparison with earlier strategies. We plot the per-year phrase counts of three key- phrases (“Wishart”, “Kalman” and “Hebbian”) in one of many datasets (NeurIPS). All three phrases on averages seem lower than 100 instances a yr, that are rare, however they’re all clear indicators of Machine Learning and neuroscience subfields. We then examine the phrase possibilities of every phrase within the subject with the strongest connection to the phrase inferred by MIST, DCTM and DETM. We plot the posterior of μ within the subject inferred by MIST the place the phrase is most distinguished, and the posterior of β for every one of many dynamic subject fashions we examined. MIST is proven to be probably the most correct in modeling these phrases, by capturing the overall dynamics of the phrase within the dataset however with out overfitting it.

For instance, contemplate the phrase “Wishart” comparable to the Wishart distribution (and course of). The phrase counts are very low (lower than 50 counts in whole annually). However, the phrase could be very a lot indicative of Bayesian inference subject, because the Wishart distribution is the conjugate prior of the inverse covariance-matrix of a multivariate-normal. Indeed, it has a excessive correlation with the rather more widespread phrase “posterior”. MIST is ready to precisely mannequin the rising dynamics of such phrases over time (center row). Conversely, DCTM and DETM that contemplate phrases independently wouldn’t have sufficient information to precisely mannequin its dynamics.

Some remaining phrases

Our mannequin is ready to effectively mannequin massive vocabularies by using multi output Gaussian processes, sparse Gaussian course of approximations, and significance sampling procedures. As the mannequin makes use of learnt phrase embeddings to narrate phrases collectively, it doesn’t have to see a lot of phrase co-occurrences for every doc, making it the right candidate to research short-text paperwork.

For extra info please try our paper beneath and our earlier weblog publish.

Efficient Inference for Dynamic Topic Modeling with Large Vocabularies
Federico Tomasi, Mounia Lalmas and Zhenwen Dai
UAI 2022

LEAVE A REPLY

Please enter your comment!
Please enter your name here