Reinforcement Learning For Real Life: Contextual Bandits

Ranko Mosic
4 min readMay 20, 2020

RL is not really a solved area and particularly there are very simple problems which will break it .

A simple problem breaking all common multistep Reinforcement Learning algos ( Langford )¹

To work around current RL limitation various expert level tricks are used. Trick #4 in John Langford’s taxonomy makes it possible to use RL in practice — a substantial step beyond supervised learning that we can routinely apply in the same way that we can routinely apply to supervised learning.

John coined the term contextual bandits back in 2007. Major cloud providers ( Microsoft, Google, AWS ) provide services related to Vowpal Wabbit — an open source project featuring contextual bandits.

Turns out there’s a special case .. if you just have an observation, a policy which chooses an action and then you see a reward and a goal is to
maximize the immediate sum of rewards, this is what I would call
contextual bandit. It is the same setup as full Reinforcement Learning except the reward is directly associated with an action in the context.

Viewed as a function, it’s the same object as a classifier and
supervised learning, but the key difference is that a policy acts. If policy does something in the world that’s actually really important to the process of learning, because the way that it acts is going to influence the reward that it observes, which is going to influence the training.

Contextual bandits are useful for internet applications ( recommenders ), where feedback action is immediately and explicitly available. The problem with standard SL trained recommenders is their inability to generalize i.e. overfitting ( even after you buy an item online you will still be getting adds for that same item ).

A fundamental claim is that you need to do exploration of this setting to
succeed in general. So maybe you have a policy which says that a person is interested in a space article, but if you explore a little bit, sometimes you display a food article, then maybe you discover they’re actually more interested in the food article. That process of gathering information requires some form of exploration.

Policy shows articles related to space exploration
Policy sometimes displays food article

Traditional solution to this is to deploy policy in an A/B test, wait a couple of weeks and see how it performs. It turns out that you can do it a different way. You can use this historic ( offline ) data to value the policy.

Policy evaluated using offline data only

That means that instead of taking two weeks to evaluate a policy we can take a minute or less. We can iterate much faster and try to figure out what the right features are and what the right representation is.

Microsoft Personalizer — commercial contextual bandit based service

¹ DeepMind AlphaStar uses imitation learning ( a supervised learning technique ) to overcome credit assignment problem i.e. to overcome situation ( bootstrap ) where RL is not able to come up with a good initial strategy. RL is then used to further improve the AlphaStar agent. The goal of Starcraft 2 game is to build up its own resources while fighting the opposing forces:

AlphaStar vs Human Player

The credit assignment issue in RL is really really hard. I do believe we could do better and that’s maybe a research challenge for the future.

OpenAI Five, Dota2 bot, solves a similar problem entirely with self-play; this was possible due to the fact that Dota2 action space is much smaller and game itself for simplified and limited.

--

--

Ranko Mosic

Applied AI Consultant Full Stack. GLG Network Expert https://glginsights.com/ . AI tech advisor for VCs, investors, startups.