By Ko-Jen Hsiao, Yesu Feng and Sudarshan Lamkhede
Netflix’s personalised recommender system is a posh system, boasting quite a lot of specialised machine discovered fashions every catering to distinct wants together with “Continue Watching” and “Today’s Top Picks for You.” (Refer to our current overview for extra particulars). However, as we expanded our set of personalization algorithms to satisfy rising enterprise wants, upkeep of the recommender system turned fairly expensive. Furthermore, it was troublesome to switch improvements from one mannequin to a different, given that almost all are independently educated regardless of utilizing widespread knowledge sources. This state of affairs underscored the necessity for a brand new recommender system structure the place member choice studying is centralized, enhancing accessibility and utility throughout completely different fashions.
Particularly, these fashions predominantly extract options from members’ current interplay histories on the platform. Yet, many are confined to a short temporal window because of constraints in serving latency or coaching prices. This limitation has impressed us to develop a basis mannequin for advice. This mannequin goals to assimilate info each from members’ complete interplay histories and our content material at a really massive scale. It facilitates the distribution of those learnings to different fashions, both via shared mannequin weights for advantageous tuning or immediately via embeddings.
The impetus for developing a foundational advice mannequin is predicated on the paradigm shift in pure language processing (NLP) to massive language fashions (LLMs). In NLP, the development is transferring away from quite a few small, specialised fashions in direction of a single, massive language mannequin that may carry out quite a lot of duties both immediately or with minimal fine-tuning. Key insights from this shift embody:
- A Data-Centric Approach: Shifting focus from model-centric methods, which closely depend on function engineering, to a data-centric one. This method prioritizes the buildup of large-scale, high-quality knowledge and, the place possible, goals for end-to-end studying.
- Leveraging Semi-Supervised Learning: The next-token prediction goal in LLMs has confirmed remarkably efficient. It allows large-scale semi-supervised studying utilizing unlabeled knowledge whereas additionally equipping the mannequin with a surprisingly deep understanding of world data.
These insights have formed the design of our basis mannequin, enabling a transition from sustaining quite a few small, specialised fashions to constructing a scalable, environment friendly system. By scaling up semi-supervised coaching knowledge and mannequin parameters, we intention to develop a mannequin that not solely meets present wants but additionally adapts dynamically to evolving calls for, making certain sustainable innovation and useful resource effectivity.
At Netflix, consumer engagement spans a large spectrum, from informal searching to dedicated film watching. With over 300 million customers on the finish of 2024, this interprets into tons of of billions of interactions — an immense dataset comparable in scale to the token quantity of enormous language fashions (LLMs). However, as in LLMs, the standard of information usually outweighs its sheer quantity. To harness this knowledge successfully, we make use of a technique of interplay tokenization, making certain significant occasions are recognized and redundancies are minimized.
Tokenizing User Interactions: Not all uncooked consumer actions contribute equally to understanding preferences. Tokenization helps outline what constitutes a significant “token” in a sequence. Drawing an analogy to Byte Pair Encoding (BPE) in NLP, we are able to consider tokenization as merging adjoining actions to type new, higher-level tokens. However, not like language tokenization, creating these new tokens requires cautious consideration of what info to retain. For occasion, the entire watch period would possibly should be summed or engagement varieties aggregated to protect vital particulars.
This tradeoff between granular knowledge and sequence compression is akin to the stability in LLMs between vocabulary measurement and context window. In our case, the aim is to stability the size of interplay historical past towards the extent of element retained in particular person tokens. Overly lossy tokenization dangers dropping invaluable alerts, whereas too granular a sequence can exceed sensible limits on processing time and reminiscence.
Even with such methods, interplay histories from lively customers can span 1000’s of occasions, exceeding the capability of transformer fashions with normal self consideration layers. In advice methods, context home windows throughout inference are sometimes restricted to tons of of occasions — not because of mannequin functionality however as a result of these companies usually require millisecond-level latency. This constraint is extra stringent than what’s typical in LLM purposes, the place longer inference instances (seconds) are extra tolerable.
To handle this throughout coaching, we implement two key options:
- Sparse Attention Mechanisms: By leveraging sparse consideration strategies equivalent to low-rank compression, the mannequin can lengthen its context window to a number of hundred occasions whereas sustaining computational effectivity. This allows it to course of extra intensive interplay histories and derive richer insights into long-term preferences.
- Sliding Window Sampling: During coaching, we pattern overlapping home windows of interactions from the complete sequence. This ensures the mannequin is uncovered to completely different segments of the consumer’s historical past over a number of epochs, permitting it to be taught from your entire sequence with out requiring an impractically massive context window.
At inference time, when multi-step decoding is required, we are able to deploy KV caching to effectively reuse previous computations and keep low latency.
These approaches collectively permit us to stability the necessity for detailed, long-term interplay modeling with the sensible constraints of mannequin coaching and inference, enhancing each the precision and scalability of our advice system.
Information in Each ‘Token’: While the primary a part of our tokenization course of focuses on structuring sequences of interactions, the following vital step is defining the wealthy info contained inside every token. Unlike LLMs, which usually depend on a single embedding area to symbolize enter tokens, our interplay occasions are filled with heterogeneous particulars. These embody attributes of the motion itself (equivalent to locale, time, period, and machine sort) in addition to details about the content material (equivalent to merchandise ID and metadata like style and launch nation). Most of those options, particularly categorical ones, are immediately embedded inside the mannequin, embracing an end-to-end studying method. However, sure options require particular consideration. For instance, timestamps want extra processing to seize each absolute and relative notions of time, with absolute time being significantly essential for understanding time-sensitive behaviors.
To improve prediction accuracy in sequential advice methods, we set up token options into two classes:
- Request-Time Features: These are options accessible in the meanwhile of prediction, equivalent to log-in time, machine, or location.
- Post-Action Features: These are particulars accessible after an interplay has occurred, equivalent to the precise present interacted with or the period of the interplay.
To predict the following interplay, we mix request-time options from the present step with post-action options from the earlier step. This mixing of contextual and historic info ensures every token within the sequence carries a complete illustration, capturing each the speedy context and consumer conduct patterns over time.
As beforehand talked about, our default method employs the autoregressive next-token prediction goal, much like GPT. This technique successfully leverages the huge scale of unlabeled consumer interplay knowledge. The adoption of this goal in advice methods has proven a number of successes [1–3]. However, given the distinct variations between language duties and advice duties, now we have made a number of vital modifications to the target.
Firstly, in the course of the pretraining section of typical LLMs, equivalent to GPT, each goal token is usually handled with equal weight. In distinction, in our mannequin, not all consumer interactions are of equal significance. For occasion, a 5-minute trailer play mustn’t carry the identical weight as a 2-hour full film watch. A larger problem arises when making an attempt to align long-term consumer satisfaction with particular interactions and suggestions. To handle this, we are able to undertake a multi-token prediction goal throughout coaching, the place the mannequin predicts the following n tokens at every step as an alternative of a single token[4]. This method encourages the mannequin to seize longer-term dependencies and keep away from myopic predictions targeted solely on speedy subsequent occasions.
Secondly, we are able to use a number of fields in our enter knowledge as auxiliary prediction targets along with predicting the following merchandise ID, which stays the first goal. For instance, we are able to derive genres from the objects within the authentic sequence and use this style sequence as an auxiliary goal. This method serves a number of functions: it acts as a regularizer to cut back overfitting on noisy merchandise ID predictions, supplies extra insights into consumer intentions or long-term style preferences, and, when structured hierarchically, can enhance the accuracy of predicting the goal merchandise ID. By first predicting auxiliary targets, equivalent to style or authentic language, the mannequin successfully narrows down the candidate record, simplifying subsequent merchandise ID prediction.
In addition to the infrastructure challenges posed by coaching greater fashions with substantial quantities of consumer interplay knowledge which might be widespread when making an attempt to construct basis fashions, there are a number of distinctive hurdles particular to suggestions to make them viable. One of distinctive challenges is entity cold-starting.
At Netflix, our mission is to entertain the world. New titles are added to the catalog continuously. Therefore the advice basis fashions require a chilly begin functionality, which implies the fashions must estimate members’ preferences for newly launched titles earlier than anybody has engaged with them. To allow this, our basis mannequin coaching framework is constructed with the next two capabilities: Incremental coaching and with the ability to do inference with unseen entities.
- Incremental coaching : Foundation fashions are educated on intensive datasets, together with each member’s historical past of performs and actions, making frequent retraining impractical. However, our catalog and member preferences regularly evolve. Unlike massive language fashions, which might be incrementally educated with steady token vocabularies, our advice fashions require new embeddings for brand spanking new titles, necessitating expanded embedding layers and output elements. To handle this, we warm-start new fashions by reusing parameters from earlier fashions and initializing new parameters for brand spanking new titles. For instance, new title embeddings might be initialized by including slight random noise to present common embeddings or through the use of a weighted mixture of comparable titles’ embeddings primarily based on metadata. This method permits new titles to start out with related embeddings, facilitating quicker fine-tuning. In follow, the initialization technique turns into much less vital when extra member interplay knowledge is used for fine-tuning.
- Dealing with unseen entities : Even with incremental coaching, it’s not all the time assured to be taught effectively on new entities (ex: newly launched titles). It’s additionally attainable that there can be some new entities that aren’t included/seen within the coaching knowledge even when we fine-tune basis fashions on a frequent foundation. Therefore, it’s additionally essential to let basis fashions use metadata info of entities and inputs, not simply member interplay knowledge. Thus, our basis mannequin combines each learnable merchandise id embeddings and learnable embeddings from metadata. The following diagram demonstrates this concept.
To create the ultimate title embedding, we mix this metadata-based embedding with a fully-learnable ID-based embedding utilizing a mixing layer. Instead of merely summing these embeddings, we use an consideration mechanism primarily based on the “age” of the entity. This method permits new titles with restricted interplay knowledge to rely extra on metadata, whereas established titles can rely extra on ID-based embeddings. Since titles with related metadata can have completely different consumer engagement, their embeddings ought to mirror these variations. Introducing some randomness throughout coaching encourages the mannequin to be taught from metadata slightly than relying solely on ID embeddings. This technique ensures that newly-launched or pre-launch titles have cheap embeddings even with no consumer interplay knowledge.
Our advice basis mannequin is designed to know long-term member preferences and might be utilized in numerous methods by downstream purposes:
- Direct Use as a Predictive Model The mannequin is primarily educated to foretell the following entity a consumer will work together with. It contains a number of predictor heads for various duties, equivalent to forecasting member preferences for numerous genres. These might be immediately utilized to satisfy various enterprise wants..
- Utilizing embeddings The mannequin generates invaluable embeddings for members and entities like movies, video games, and genres. These embeddings are calculated in batch jobs and saved to be used in each offline and on-line purposes. They can function options in different fashions or be used for candidate technology, equivalent to retrieving interesting titles for a consumer. High-quality title embeddings additionally help title-to-title suggestions. However, one essential consideration is that the embedding area has arbitrary, uninterpretable dimensions and is incompatible throughout completely different mannequin coaching runs. This poses challenges for downstream customers, who should adapt to every retraining and redeployment, risking bugs because of invalidated assumptions concerning the embedding construction. To handle this, we apply an orthogonal low-rank transformation to stabilize the consumer/merchandise embedding area, making certain constant which means of embedding dimensions, whilst the bottom basis mannequin is retrained and redeployed.
- Fine-Tuning with Specific Data The mannequin’s adaptability permits for fine-tuning with application-specific knowledge. Users can combine the complete mannequin or subgraphs into their very own fashions, fine-tuning them with much less knowledge and computational energy. This method achieves efficiency similar to earlier fashions, regardless of the preliminary basis mannequin requiring vital assets.
In scaling up our basis mannequin for Netflix suggestions, we draw inspiration from the success of enormous language fashions (LLMs). Just as LLMs have demonstrated the facility of scaling in enhancing efficiency, we discover that scaling is essential for enhancing generative advice duties. Successful scaling calls for strong analysis, environment friendly coaching algorithms, and substantial computing assets. Evaluation should successfully differentiate mannequin efficiency and establish areas for enchancment. Scaling includes knowledge, mannequin, and context scaling, incorporating consumer engagement, exterior evaluations, multimedia property, and high-quality embeddings. Our experiments affirm that the scaling legislation additionally applies to our basis mannequin, with constant enhancements noticed as we improve knowledge and mannequin measurement.
In conclusion, our Foundation Model for Personalized Recommendation represents a major step in direction of making a unified, data-centric system that leverages large-scale knowledge to extend the standard of suggestions for our members. This method borrows insights from Large Language Models (LLMs), significantly the ideas of semi-supervised studying and end-to-end coaching, aiming to harness the huge scale of unlabeled consumer interplay knowledge. Addressing distinctive challenges, like chilly begin and presentation bias, the mannequin additionally acknowledges the distinct variations between language duties and advice. The Foundation Model permits numerous downstream purposes, from direct use as a predictive mannequin to generate consumer and entity embeddings for different purposes, and might be fine-tuned for particular canvases. We see promising outcomes from downstream integrations. This transfer from a number of specialised fashions to a extra complete system marks an thrilling growth within the subject of personalised advice methods.
Contributors to this work (identify in alphabetical order): Ai-Lei Sun Aish Fenton Anne Cocos Anuj Shah Arash Aghevli Baolin Li Bowei Yan Dan Zheng Dawen Liang Ding Tong Divya Gadde Emma Kong Gary Yeh Inbar Naor Jin Wang Justin Basilico Kabir Nagrecha Kevin Zielnicki Linas Baltrunas Lingyi Liu Luke Wang Matan Appelbaum Michael Tu Moumita Bhattacharya Pablo Delgado Qiuling Xu Rakesh Komuravelli Raveesh Bhalla Rob Story Roger Menezes Sejoon Oh Shahrzad Naseri Swanand Joshi Trung Nguyen Vito Ostuni Wei Wang Zhe Zhang
- C. Okay. Kang and J. McAuley, “Self-Attentive Sequential Recommendation,” 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 2018, pp. 197–206, doi: 10.1109/ICDM.2018.00035.
- F. Sun et al., “BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer,” Proceedings of the twenty eighth ACM International Conference on Information and Knowledge Management (CIKM ‘19), Beijing, China, 2019, pp. 1441–1450, doi: 10.1145/3357384.3357895.
- J. Zhai et al., “Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations,” arXiv preprint arXiv:2402.17152, 2024.
- F. Gloeckle, B. Youbi Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve, “Better & Faster Large Language Models via Multi-token Prediction,” arXiv preprint arXiv:2404.19737, Apr. 2024.