In-context Exploration-Exploitation for Reinforcement Learning


May 07, 2024 Published by Zhenwen Dai, Federico Tomasi, Sina Ghiassian

RS061 In-context Exploration-Exploitation for Reinforcement Learning

Machine studying (ML) fashions are extensively used on Spotify to supply customers with a day by day personalised listening expertise. To guarantee these programs regularly adapt to customers’ preferences and swiftly modify to adjustments of their pursuits, the ML programs have to quickly replace themselves with incoming consumer interplay information. The means of updating an ML mannequin is illustrated within the following diagram. Ideally, the shorter the replace cycle, the faster the ML programs can study customers’ desire adjustments. However, in typical manufacturing ML programs, a mannequin replace cycle can vary from a number of hours to some days.

As depicted within the diagram, the replace cycle is primarily pushed by the ML mannequin’s suggestions. A high-quality suggestion not solely suggests content material that aligns with a consumer’s preferences but additionally selects content material that effectively gathers details about the consumer’s real-time curiosity. This permits the ML mannequin to reply rapidly to a consumer’s wants. A principled strategy to sort out this problem is named Bayesian choice making. This methodology addresses the issue by sustaining a Bayesian perception a couple of consumer’s curiosity and deciding on content material that decreases the uncertainty of this perception whereas catering to the consumer’s curiosity. This balancing act is also known as the exploration-exploitation (EE) trade-off.

Conventional strategies on this area accomplish the Bayesian perception replace by mannequin updating, sometimes applied by way of gradient-based optimization. While being a effectively established concept framework, Bayesian modeling approaches undergo from excessive computational complexity and varied different limitations equivalent to laborious to specify good prior distribution.  

In this work, we current a groundbreaking strategy to mannequin updating often known as in-context exploration exploitation (ICEE). ICEE permits us to realize Bayesian perception updates by neural community inference, made attainable by adopting the idea of in-context studying. With this strategy, Bayesian choice making could be completed by commonplace supervised studying and neural community inference, enabling real-time adaptation to customers’ desire. The following sections present an in depth rationalization of ICEE.

Return conditioned In-context Policy Learning

Consider the advice downside by the lens of reinforcement studying (RL). The personalization of every consumer’s listening expertise could be seen as a definite activity as a result of important variation in customers’ content material preferences. Each consumer’s listening session could be interpreted as an episode within the episodic RL formulation because the consumer’s intent could differ from session to session, resulting in different actions. These actions can generate statistics used to outline the reward for our agent. This setting is often known as meta RL.

ICEE is designed to deal with the aforementioned meta RL downside with one further constraint: it can not carry out parameter updates whereas studying to unravel particular person duties. ICEE tackles this problem by increasing the framework of return conditioned RL with in-context studying. In this formulation, each coverage studying and motion prediction are modeled as a sequence prediction downside. For every activity, the knowledge of all of the episodes are consolidated right into a single sequence, as illustrated within the determine beneath.

In this sequence, a time step in an episode is represented by a triplet: state, return-to-go, and a mix of the motion taken and the ensuing reward. The return-to-go signifies the agent’s future efficiency. The time steps inside an episode are organized chronologically. There is not any particular requirement concerning the order of episodes. For simplicity, we additionally prepare the episodes in chronological order. This sequence is then fed right into a Transformer decoder structure with a causal consideration mechanism. The Transformer decoder’s output, which corresponds to the return-to-go’s location, is used to foretell the motion for the corresponding time step.

During inference, we begin with a sequence consisting of solely the preliminary state of the primary episode and the corresponding return-to-go, from which an motion is sampled from the mannequin. After making use of the motion within the atmosphere, we observe the reward and the following state. This new info is added to the enter sequence and the following motion is sampled. This interplay loop continues till the episode concludes. After finishing the primary episode, we start with the preliminary state of a brand new episode and repeat the method. Through this methodology, the mannequin can clear up a brand new activity after just a few episodes. It’s clear from this description that no mannequin parameter updates happen in the course of the task-solving course of. All studying happens by amassing info by actions and utilizing this info to tell future actions.

A key element within the above description is the selection of return-to-go. It permits the mannequin to distinguish between good and unhealthy actions throughout coaching and to solely choose good actions throughout inference. In ICEE, the return-to-go consists of in-episode return-to-go and cross-episode return-to-go. For in-episode return-to-go, we comply with the design of the multi-game choice transformer (MGDT). It defines return-to-go because the cumulative future reward from the present step and features a mannequin that predicts return at every step. During inference, it samples a worth from the return distribution skewed in direction of good returns and makes use of this worth because the return-to-go. For cross-episode return-to-go, we use a binary worth. It’s set to 1 if the present episode’s return is best than all earlier episodes, and 0 in any other case. During inference, the worth is ready to 1 whereas taking actions and adjusted to the true worth after the episode ends.

Compared to earlier return conditioned RL approaches, the stability between exploration and exploitation is especially essential. The mannequin must be taught to unravel a brand new activity with the least variety of episodes. To obtain environment friendly EE, we suggest an unbiased coaching goal for ICEE. This is predicated on the statement that the motion realized from commonplace return conditioned RL strategies is biased in direction of the info assortment coverage. By reformulating the coaching goal, we get hold of an unbiased coaching goal, which corrects the earlier coaching goal with a likelihood ratio between the uniform motion distribution and the info assortment distribution.


Bayesian Optimization (BO) is a really profitable software of EE. A BO algorithm is ready to seek for the optimum of a operate with the minimal variety of operate evaluations. The Gaussian course of (GP) primarily based BO strategies have been extensively utilized in varied domains like hyper-parameter tuning, drug discovery, aerodynamic optimization. To consider the EE efficiency of ICEE, we apply ICEE to BO and evaluate it with a GP-based strategy utilizing one in all most generally used acquisition capabilities, anticipated enchancment (EI). 

The above figures present a comparability of ICEE and EI on a set of 2D benchmark capabilities. The search effectivity of ICEE is on par with the GP-based BO methodology with EI. This signifies that ICEE can be taught to carry out EE by ICL. A transparent benefit of ICEE is that the entire search is completed by mannequin inference with out want of any gradient optimization. In distinction, GP-based BO strategies want to suit a GP surrogate operate at every step, which ends up in a major velocity distinction. 

Grid-world RL. We examine the in-context coverage studying functionality of ICEE on sequential RL issues. We give attention to the households of environments that can not be solved by zero-shot generalization of a pre-trained mannequin, so in-context coverage studying is important for fixing the duties. We use the 2 grid world environments: darkish room and darkish key-to-door. 

The experiment outcomes, proven in determine above, exhibit that ICEE is ready to clear up the sampled video games effectively in comparison with the baseline strategies. The EE functionality permits ICEE to seek for the lacking info effectively after which acts with confidence as soon as the lacking info is discovered. More particulars of the above experiments could be discovered within the paper.


We current an in-context EE algorithm by extending the choice transformer formulation to in-context studying and deriving an unbiased coaching goal. Through the experiments on BO and discrete RL issues, we exhibit that: (i) ICEE can carry out EE in in-context studying with out the necessity of specific Bayesian inference; (ii) The efficiency of ICEE is on par with state-of-the-art BO strategies with out the necessity of gradient optimization, which ends up in important speed-up; (iii) New RL duties could be solved inside tens of episodes. 

These outcomes present that ICEE might be a promising know-how to boost the responsiveness of personalization experiences by considerably lowering mannequin replace cycles. 

For extra info, please check with our paper: 
In-context Exploration-Exploitation for Reinforcement Learning
Zhenwen Dai, Federico Tomasi, Sina Ghiassian
ICLR 2024


Please enter your comment!
Please enter your name here