Recommending for Long-Term Member Satisfaction at Netflix | by Netflix Technology Blog

0
113
Recommending for Long-Term Member Satisfaction at Netflix | by Netflix Technology Blog

[ad_1]

By Jiangwei Pan, Gary Tang, Henry Wang, and Justin Basilico

Our mission at Netflix is to entertain the world. Our personalization algorithms play an important function in delivering on this mission for all members by recommending the correct exhibits, films, and video games on the proper time. This purpose extends past rapid engagement; we purpose to create an expertise that brings lasting enjoyment to our members. Traditional recommender methods typically optimize for short-term metrics like clicks or engagement, which can not absolutely seize long-term satisfaction. We try to advocate content material that not solely engages members within the second but in addition enhances their long-term satisfaction, which will increase the worth they get from Netflix, and thus they’ll be extra prone to proceed to be a member.

One easy means we will view suggestions is as a contextual bandit downside. When a member visits, that turns into a context for our system and it selects an motion of what suggestions to point out, after which the member offers varied kinds of suggestions. These suggestions indicators might be rapid (skips, performs, thumbs up/down, or including gadgets to their playlist) or delayed (finishing a present or renewing their subscription). We can outline reward capabilities to mirror the standard of the suggestions from these suggestions indicators after which prepare a contextual bandit coverage on historic knowledge to maximise the anticipated reward.

There are many ways in which a advice mannequin might be improved. They might come from extra informative enter options, extra knowledge, completely different architectures, extra parameters, and so forth. In this publish, we give attention to a less-discussed side about enhancing the recommender goal by defining a reward operate that tries to raised mirror long-term member satisfaction.

Member retention may look like an apparent reward for optimizing long-term satisfaction as a result of members ought to keep in the event that they’re happy, nonetheless it has a number of drawbacks:

  • Noisy: Retention might be influenced by quite a few exterior components, reminiscent of seasonal developments, advertising campaigns, or private circumstances unrelated to the service.
  • Low Sensitivity: Retention is simply delicate for members on the verge of canceling their subscription, not capturing the complete spectrum of member satisfaction.
  • Hard to Attribute: Members may cancel solely after a sequence of dangerous suggestions.
  • Slow to Measure: We solely get one sign per account per 30 days.

Due to those challenges, optimizing for retention alone is impractical.

Instead, we will prepare our bandit coverage to optimize a proxy reward operate that’s extremely aligned with long-term member satisfaction whereas being delicate to particular person suggestions. The proxy reward r(consumer, merchandise) is a operate of consumer interplay with the beneficial merchandise. For instance, if we advocate “One Piece” and a member performs then subsequently completes and provides it a thumbs-up, a easy proxy reward may be outlined as r(consumer, merchandise) = f(play, full, thumb).

Click-through price (CTR)

Click-through price (CTR), or in our case play-through price, might be seen as a easy proxy reward the place r(consumer, merchandise) = 1 if the consumer clicks a advice and 0 in any other case. CTR is a standard suggestions sign that usually displays consumer choice expectations. It is an easy but sturdy baseline for a lot of advice functions. In some circumstances, reminiscent of adverts personalization the place the press is the goal motion, CTR might even be an inexpensive reward for manufacturing fashions. However, typically, over-optimizing CTR can result in selling clickbaity gadgets, which can hurt long-term satisfaction.

Beyond CTR

To align the proxy reward operate extra intently with long-term satisfaction, we have to look past easy interactions, take into account all kinds of consumer actions, and perceive their true implications on consumer satisfaction.

We give a couple of examples within the Netflix context:

  • Fast season completion ✅: Completing a season of a beneficial TV present in at some point is a robust signal of enjoyment and long-term satisfaction.
  • Thumbs-down after completion ❌: Completing a TV present in a number of weeks adopted by a thumbs-down signifies low satisfaction regardless of vital time spent.
  • Playing a film for simply 10 minutes ❓: In this case, the consumer’s satisfaction is ambiguous. The temporary engagement may point out that the consumer determined to desert the film, or it may merely imply the consumer was interrupted and plans to complete the film later, maybe the following day.
  • Discovering new genres ✅ ✅: Watching extra Korean or sport exhibits after “Squid Game” suggests the consumer is discovering one thing new. This discovery was probably much more helpful because it led to quite a lot of engagements in a brand new space for a member.

Reward engineering is the iterative technique of refining the proxy reward operate to align with long-term member satisfaction. It is just like function engineering, besides that it may be derived from knowledge that isn’t accessible at serving time. Reward engineering entails 4 phases: speculation formation, defining a brand new proxy reward, coaching a brand new bandit coverage, and A/B testing. Below is an easy instance.

User suggestions used within the proxy reward operate is commonly delayed or lacking. For instance, a member might resolve to play a beneficial present for only a few minutes on the primary day and take a number of weeks to completely full the present. This completion suggestions is due to this fact delayed. Additionally, some consumer suggestions might by no means happen; whereas we may need in any other case, not all members present a thumbs-up or thumbs-down after finishing a present, leaving us unsure about their stage of enjoyment.

We may try to wait to present an extended window to look at suggestions, however how lengthy ought to we watch for delayed suggestions earlier than computing the proxy rewards? If we wait too lengthy (e.g., weeks), we miss the chance to replace the bandit coverage with the newest knowledge. In a extremely dynamic setting like Netflix, a stale bandit coverage can degrade the consumer expertise and be significantly dangerous at recommending newer gadgets.

Solution: predict lacking suggestions

We purpose to replace the bandit coverage shortly after making a advice whereas additionally defining the proxy reward operate primarily based on all consumer suggestions, together with delayed suggestions. Since delayed suggestions has not been noticed on the time of coverage coaching, we will predict it. This prediction happens for every coaching instance with delayed suggestions, utilizing already noticed suggestions and different related data as much as the coaching time as enter options. Thus, the prediction additionally will get higher as time progresses.

The proxy reward is then calculated for every coaching instance utilizing each noticed and predicted suggestions. These coaching examples are used to replace the bandit coverage.

But aren’t we nonetheless solely counting on noticed suggestions within the proxy reward operate? Yes, as a result of delayed suggestions is predicted primarily based on noticed suggestions. However, it’s easier to cause about rewards utilizing all suggestions immediately. For occasion, the delayed thumbs-up prediction mannequin could also be a fancy neural community that takes into consideration all noticed suggestions (e.g., short-term play patterns). It’s extra simple to outline the proxy reward as a easy operate of the thumbs-up suggestions slightly than a fancy operate of short-term interplay patterns. It can be used to regulate for potential biases in how suggestions is offered.

The reward engineering diagram is up to date with an elective delayed suggestions prediction step.

Two kinds of ML fashions

It’s value noting that this method employs two kinds of ML fashions:

  • Delayed Feedback Prediction Models: These fashions predict p(last suggestions | noticed feedbacks). The predictions are used to outline and compute proxy rewards for bandit coverage coaching examples. As a end result, these fashions are used offline in the course of the bandit coverage coaching.
  • Bandit Policy Models: These fashions are used within the bandit coverage π(merchandise | consumer; r) to generate suggestions on-line and in real-time.

Improved enter options or neural community architectures typically result in higher offline mannequin metrics (e.g., AUC for classification fashions). However, when these improved fashions are subjected to A/B testing, we frequently observe flat and even damaging on-line metrics, which might quantify long-term member satisfaction.

This online-offline metric disparity normally happens when the proxy reward used within the advice coverage will not be absolutely aligned with long-term member satisfaction. In such circumstances, a mannequin might obtain greater proxy rewards (offline metrics) however lead to worse long-term member satisfaction (on-line metrics).

Nevertheless, the mannequin enchancment is real. One method to resolve that is to additional refine the proxy reward definition to align higher with the improved mannequin. When this tuning leads to constructive on-line metrics, the mannequin enchancment might be successfully productized. See [1] for extra discussions on this problem.

In this publish, we offered an outline of our reward engineering efforts to align Netflix suggestions with long-term member satisfaction. While retention stays our north star, it isn’t straightforward to optimize immediately. Therefore, our efforts give attention to defining a proxy reward that’s aligned with long-term satisfaction and delicate to particular person suggestions. Finally, we mentioned the distinctive problem of delayed consumer suggestions at Netflix and proposed an method that has confirmed efficient for us. Refer to [2] for an earlier overview of the reward innovation efforts at Netflix.

As we proceed to enhance our suggestions, a number of open questions stay:

  • Can we be taught an excellent proxy reward operate mechanically by correlating habits with retention?
  • How lengthy ought to we watch for delayed suggestions earlier than utilizing its predicted worth in coverage coaching?
  • How can we leverage Reinforcement Learning to additional align the coverage with long-term satisfaction?

[1] Deep studying for recommender methods: A Netflix case research. AI Magazine 2021. Harald Steck, Linas Baltrunas, Ehtsham Elahi, Dawen Liang, Yves Raimond, Justin Basilico.

[2] Reward innovation for long-term member satisfaction. RecSys 2023. Gary Tang, Jiangwei Pan, Henry Wang, Justin Basilico.

[ad_2]

LEAVE A REPLY

Please enter your comment!
Please enter your name here