Foundation Model for Personalized Recommendation | by Netflix Technology Blog | Mar, 2025

0
90
Foundation Model for Personalized Recommendation | by Netflix Technology Blog | Mar, 2025


By Ko-Jen Hsiao, Yesu Feng and Sudarshan Lamkhede

Netflix’s personalised recommender system is a posh system, boasting quite a lot of specialised machine realized fashions every catering to distinct wants together with “Continue Watching” and “Today’s Top Picks for You.” (Refer to our latest overview for extra particulars). However, as we expanded our set of personalization algorithms to satisfy growing enterprise wants, upkeep of the recommender system turned fairly pricey. Furthermore, it was tough to switch improvements from one mannequin to a different, given that almost all are independently skilled regardless of utilizing widespread knowledge sources. This situation underscored the necessity for a brand new recommender system structure the place member desire studying is centralized, enhancing accessibility and utility throughout completely different fashions.

Particularly, these fashions predominantly extract options from members’ latest interplay histories on the platform. Yet, many are confined to a short temporal window because of constraints in serving latency or coaching prices. This limitation has impressed us to develop a basis mannequin for suggestion. This mannequin goals to assimilate data each from members’ complete interplay histories and our content material at a really giant scale. It facilitates the distribution of those learnings to different fashions, both via shared mannequin weights for nice tuning or immediately via embeddings.

The impetus for setting up a foundational suggestion mannequin relies on the paradigm shift in pure language processing (NLP) to giant language fashions (LLMs). In NLP, the development is shifting away from quite a few small, specialised fashions in the direction of a single, giant language mannequin that may carry out quite a lot of duties both immediately or with minimal fine-tuning. Key insights from this shift embody:

  1. A Data-Centric Approach: Shifting focus from model-centric methods, which closely depend on function engineering, to a data-centric one. This method prioritizes the buildup of large-scale, high-quality knowledge and, the place possible, goals for end-to-end studying.
  2. Leveraging Semi-Supervised Learning: The next-token prediction goal in LLMs has confirmed remarkably efficient. It allows large-scale semi-supervised studying utilizing unlabeled knowledge whereas additionally equipping the mannequin with a surprisingly deep understanding of world information.

These insights have formed the design of our basis mannequin, enabling a transition from sustaining quite a few small, specialised fashions to constructing a scalable, environment friendly system. By scaling up semi-supervised coaching knowledge and mannequin parameters, we intention to develop a mannequin that not solely meets present wants but in addition adapts dynamically to evolving calls for, making certain sustainable innovation and useful resource effectivity.

At Netflix, person engagement spans a large spectrum, from informal shopping to dedicated film watching. With over 300 million customers on the finish of 2024, this interprets into lots of of billions of interactions — an immense dataset comparable in scale to the token quantity of huge language fashions (LLMs). However, as in LLMs, the standard of information usually outweighs its sheer quantity. To harness this knowledge successfully, we make use of a technique of interplay tokenization, making certain significant occasions are recognized and redundancies are minimized.

Tokenizing User Interactions: Not all uncooked person actions contribute equally to understanding preferences. Tokenization helps outline what constitutes a significant “token” in a sequence. Drawing an analogy to Byte Pair Encoding (BPE) in NLP, we are able to consider tokenization as merging adjoining actions to type new, higher-level tokens. However, in contrast to language tokenization, creating these new tokens requires cautious consideration of what data to retain. For occasion, the full watch length would possibly should be summed or engagement varieties aggregated to protect essential particulars.

Figure 1.Tokenization of person interplay historical past by merging actions on the identical title, preserving essential data.

This tradeoff between granular knowledge and sequence compression is akin to the steadiness in LLMs between vocabulary dimension and context window. In our case, the objective is to steadiness the size of interplay historical past in opposition to the extent of element retained in particular person tokens. Overly lossy tokenization dangers dropping useful indicators, whereas too granular a sequence can exceed sensible limits on processing time and reminiscence.

Even with such methods, interplay histories from lively customers can span 1000’s of occasions, exceeding the capability of transformer fashions with normal self consideration layers. In suggestion techniques, context home windows throughout inference are sometimes restricted to lots of of occasions — not because of mannequin functionality however as a result of these providers sometimes require millisecond-level latency. This constraint is extra stringent than what’s typical in LLM purposes, the place longer inference instances (seconds) are extra tolerable.

To deal with this throughout coaching, we implement two key options:

  1. Sparse Attention Mechanisms: By leveraging sparse consideration methods resembling low-rank compression, the mannequin can prolong its context window to a number of hundred occasions whereas sustaining computational effectivity. This allows it to course of extra intensive interplay histories and derive richer insights into long-term preferences.
  2. Sliding Window Sampling: During coaching, we pattern overlapping home windows of interactions from the complete sequence. This ensures the mannequin is uncovered to completely different segments of the person’s historical past over a number of epochs, permitting it to study from all the sequence with out requiring an impractically giant context window.

At inference time, when multi-step decoding is required, we are able to deploy KV caching to effectively reuse previous computations and preserve low latency.

These approaches collectively enable us to steadiness the necessity for detailed, long-term interplay modeling with the sensible constraints of mannequin coaching and inference, enhancing each the precision and scalability of our suggestion system.

Information in Each ‘Token’: While the primary a part of our tokenization course of focuses on structuring sequences of interactions, the subsequent essential step is defining the wealthy data contained inside every token. Unlike LLMs, which usually depend on a single embedding house to symbolize enter tokens, our interplay occasions are full of heterogeneous particulars. These embody attributes of the motion itself (resembling locale, time, length, and machine kind) in addition to details about the content material (resembling merchandise ID and metadata like style and launch nation). Most of those options, particularly categorical ones, are immediately embedded inside the mannequin, embracing an end-to-end studying method. However, sure options require particular consideration. For instance, timestamps want further processing to seize each absolute and relative notions of time, with absolute time being notably essential for understanding time-sensitive behaviors.

To improve prediction accuracy in sequential suggestion techniques, we arrange token options into two classes:

  1. Request-Time Features: These are options out there in the mean time of prediction, resembling log-in time, machine, or location.
  2. Post-Action Features: These are particulars out there after an interplay has occurred, resembling the precise present interacted with or the length of the interplay.

To predict the subsequent interplay, we mix request-time options from the present step with post-action options from the earlier step. This mixing of contextual and historic data ensures every token within the sequence carries a complete illustration, capturing each the instant context and person conduct patterns over time.

As beforehand talked about, our default method employs the autoregressive next-token prediction goal, just like GPT. This technique successfully leverages the huge scale of unlabeled person interplay knowledge. The adoption of this goal in suggestion techniques has proven a number of successes [1–3]. However, given the distinct variations between language duties and suggestion duties, we’ve made a number of essential modifications to the target.

Firstly, in the course of the pretraining section of typical LLMs, resembling GPT, each goal token is mostly handled with equal weight. In distinction, in our mannequin, not all person interactions are of equal significance. For occasion, a 5-minute trailer play mustn’t carry the identical weight as a 2-hour full film watch. A larger problem arises when making an attempt to align long-term person satisfaction with particular interactions and suggestions. To deal with this, we are able to undertake a multi-token prediction goal throughout coaching, the place the mannequin predicts the subsequent n tokens at every step as an alternative of a single token[4]. This method encourages the mannequin to seize longer-term dependencies and keep away from myopic predictions centered solely on instant subsequent occasions.

Secondly, we are able to use a number of fields in our enter knowledge as auxiliary prediction targets along with predicting the subsequent merchandise ID, which stays the first goal. For instance, we are able to derive genres from the gadgets within the unique sequence and use this style sequence as an auxiliary goal. This method serves a number of functions: it acts as a regularizer to scale back overfitting on noisy merchandise ID predictions, offers further insights into person intentions or long-term style preferences, and, when structured hierarchically, can enhance the accuracy of predicting the goal merchandise ID. By first predicting auxiliary targets, resembling style or unique language, the mannequin successfully narrows down the candidate record, simplifying subsequent merchandise ID prediction.

In addition to the infrastructure challenges posed by coaching larger fashions with substantial quantities of person interplay knowledge which can be widespread when making an attempt to construct basis fashions, there are a number of distinctive hurdles particular to suggestions to make them viable. One of distinctive challenges is entity cold-starting.

At Netflix, our mission is to entertain the world. New titles are added to the catalog regularly. Therefore the advice basis fashions require a chilly begin functionality, which implies the fashions must estimate members’ preferences for newly launched titles earlier than anybody has engaged with them. To allow this, our basis mannequin coaching framework is constructed with the next two capabilities: Incremental coaching and with the ability to do inference with unseen entities.

  1. Incremental coaching : Foundation fashions are skilled on intensive datasets, together with each member’s historical past of performs and actions, making frequent retraining impractical. However, our catalog and member preferences regularly evolve. Unlike giant language fashions, which may be incrementally skilled with steady token vocabularies, our suggestion fashions require new embeddings for brand new titles, necessitating expanded embedding layers and output parts. To deal with this, we warm-start new fashions by reusing parameters from earlier fashions and initializing new parameters for brand new titles. For instance, new title embeddings may be initialized by including slight random noise to current common embeddings or through the use of a weighted mixture of comparable titles’ embeddings based mostly on metadata. This method permits new titles to begin with related embeddings, facilitating sooner fine-tuning. In follow, the initialization technique turns into much less essential when extra member interplay knowledge is used for fine-tuning.
  2. Dealing with unseen entities : Even with incremental coaching, it’s not all the time assured to study effectively on new entities (ex: newly launched titles). It’s additionally attainable that there might be some new entities that aren’t included/seen within the coaching knowledge even when we fine-tune basis fashions on a frequent foundation. Therefore, it’s additionally essential to let basis fashions use metadata data of entities and inputs, not simply member interplay knowledge. Thus, our basis mannequin combines each learnable merchandise id embeddings and learnable embeddings from metadata. The following diagram demonstrates this concept.
Figure 2. Titles are related to numerous metadata, resembling genres, storylines, and tones. Each kind of metadata might be represented by averaging its respective embeddings, that are then concatenated to type the general metadata-based embedding for the title.

To create the ultimate title embedding, we mix this metadata-based embedding with a fully-learnable ID-based embedding utilizing a mixing layer. Instead of merely summing these embeddings, we use an consideration mechanism based mostly on the “age” of the entity. This method permits new titles with restricted interplay knowledge to rely extra on metadata, whereas established titles can rely extra on ID-based embeddings. Since titles with comparable metadata can have completely different person engagement, their embeddings ought to replicate these variations. Introducing some randomness throughout coaching encourages the mannequin to study from metadata slightly than relying solely on ID embeddings. This technique ensures that newly-launched or pre-launch titles have cheap embeddings even with no person interplay knowledge.

Our suggestion basis mannequin is designed to grasp long-term member preferences and may be utilized in numerous methods by downstream purposes:

  1. Direct Use as a Predictive Model The mannequin is primarily skilled to foretell the subsequent entity a person will work together with. It contains a number of predictor heads for various duties, resembling forecasting member preferences for numerous genres. These may be immediately utilized to satisfy various enterprise wants..
  2. Utilizing embeddings The mannequin generates useful embeddings for members and entities like movies, video games, and genres. These embeddings are calculated in batch jobs and saved to be used in each offline and on-line purposes. They can function options in different fashions or be used for candidate era, resembling retrieving interesting titles for a person. High-quality title embeddings additionally help title-to-title suggestions. However, one essential consideration is that the embedding house has arbitrary, uninterpretable dimensions and is incompatible throughout completely different mannequin coaching runs. This poses challenges for downstream shoppers, who should adapt to every retraining and redeployment, risking bugs because of invalidated assumptions in regards to the embedding construction. To deal with this, we apply an orthogonal low-rank transformation to stabilize the person/merchandise embedding house, making certain constant that means of embedding dimensions, whilst the bottom basis mannequin is retrained and redeployed.
  3. Fine-Tuning with Specific Data The mannequin’s adaptability permits for fine-tuning with application-specific knowledge. Users can combine the complete mannequin or subgraphs into their very own fashions, fine-tuning them with much less knowledge and computational energy. This method achieves efficiency corresponding to earlier fashions, regardless of the preliminary basis mannequin requiring vital assets.

In scaling up our basis mannequin for Netflix suggestions, we draw inspiration from the success of huge language fashions (LLMs). Just as LLMs have demonstrated the ability of scaling in enhancing efficiency, we discover that scaling is essential for enhancing generative suggestion duties. Successful scaling calls for strong analysis, environment friendly coaching algorithms, and substantial computing assets. Evaluation should successfully differentiate mannequin efficiency and determine areas for enchancment. Scaling entails knowledge, mannequin, and context scaling, incorporating person engagement, exterior critiques, multimedia belongings, and high-quality embeddings. Our experiments affirm that the scaling regulation additionally applies to our basis mannequin, with constant enhancements noticed as we enhance knowledge and mannequin dimension.

Figure 3. The relationship between mannequin parameter dimension and relative efficiency enchancment. The plot demonstrates the scaling regulation in suggestion modeling, exhibiting a development of elevated efficiency with bigger mannequin sizes. The x-axis is logarithmically scaled to spotlight development throughout completely different magnitudes.

In conclusion, our Foundation Model for Personalized Recommendation represents a big step in the direction of making a unified, data-centric system that leverages large-scale knowledge to extend the standard of suggestions for our members. This method borrows insights from Large Language Models (LLMs), notably the ideas of semi-supervised studying and end-to-end coaching, aiming to harness the huge scale of unlabeled person interplay knowledge. Addressing distinctive challenges, like chilly begin and presentation bias, the mannequin additionally acknowledges the distinct variations between language duties and suggestion. The Foundation Model permits numerous downstream purposes, from direct use as a predictive mannequin to generate person and entity embeddings for different purposes, and may be fine-tuned for particular canvases. We see promising outcomes from downstream integrations. This transfer from a number of specialised fashions to a extra complete system marks an thrilling growth within the subject of personalised suggestion techniques.

Contributors to this work (title in alphabetical order): Ai-Lei Sun Aish Fenton Anne Cocos Anuj Shah Arash Aghevli Baolin Li Bowei Yan Dan Zheng Dawen Liang Ding Tong Divya Gadde Emma Kong Gary Yeh Inbar Naor Jin Wang Justin Basilico Kabir Nagrecha Kevin Zielnicki Linas Baltrunas Lingyi Liu Luke Wang Matan Appelbaum Michael Tu Moumita Bhattacharya Pablo Delgado Qiuling Xu Rakesh Komuravelli Raveesh Bhalla Rob Story Roger Menezes Sejoon Oh Shahrzad Naseri Swanand Joshi Trung Nguyen Vito Ostuni Wei Wang Zhe Zhang

  1. C. Okay. Kang and J. McAuley, “Self-Attentive Sequential Recommendation,” 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 2018, pp. 197–206, doi: 10.1109/ICDM.2018.00035.
  2. F. Sun et al., “BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer,” Proceedings of the twenty eighth ACM International Conference on Information and Knowledge Management (CIKM ‘19), Beijing, China, 2019, pp. 1441–1450, doi: 10.1145/3357384.3357895.
  3. J. Zhai et al., “Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations,” arXiv preprint arXiv:2402.17152, 2024.
  4. F. Gloeckle, B. Youbi Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve, “Better & Faster Large Language Models via Multi-token Prediction,” arXiv preprint arXiv:2404.19737, Apr. 2024.

LEAVE A REPLY

Please enter your comment!
Please enter your name here