Why is Reinforcement Learning so Curious?

Why is reinforcement learning so curious? originally appeared on Quora - the knowledge sharing network where compelling questions are answered by people with unique insights.

Answer by Alex Smola, Professor, Carnegie Mellon University and Chief Scientist, 1-Page, on Quora.

Supervised Learning

In this simplest of all cases we get some data, say pairs of (images, labels), e.g. of cats and humans and we want to build a cat vs. human discriminator. So we go and use our favorite machine learning algorithm to do this. When deployed, we assume that the distribution of cats and humans is the same, so anything that works provably on the training set will work well in practice. Most people focus on this easy case.

Covariate Shift

Now assume that our training set is a bit different than the scenario in which this gets deployed. For instance, imagine that the training set doesn't have a lot of pictures of tabby cats, because they're harder to catch. Now, we have no way of telling whether our algorithm will work well in practice, since we don't have quite so much information about tabby cats. This is often referred to as covariate shift where the training distribution is a bit different. We can fix this by reweighting the training set suitably to give the few tabby cat pictures in the training set a higher weight.

Bandit Algorithms and Explore-Explore

So far we assumed that all the data is given to us and that we don't really have much of a choice of what we do. Bandit algorithms address this. For instance, when Google wants to display an ad for the query 'cat' it has thousands of possible ads at its disposition, ranging from kitty litter to catsuits. Which ad gets the most clicks depends very much on the ad, the user, and the context. Hence it must try out different versions to determine which ones are best. Doing this brute force is very expensive (nobody wants to see too many catsuit ads). A popular strategy is to use what is called UCB (upper confidence bound) algorithms where a decision is made with the benefit for uncertainty. Once we see what happens, we can improve our estimates and find the next best guess for an ad.

Reinforcement Learning

The typical model online is that users have the memory of a goldfish. That is, it doesn't really matter very much what we did in the past to determine what will happen in the future. Obviously this isn't true in reality and reinforcement learning deals with systems that are stateful. E.g. a user might remember seeing catsuit pictures and might therefore be too embarrassed to search for cat pictures in the future ever again. Flying planes, helicopters, and recommending content to users all fall into this category.

One popular way to deal with this is that we try to estimate what the best step is, given the expected value of the position after we take this step and the benefit in taking the next step.

Adversarial Environments

When using this for games we also have to assume that the environment is out to get us. That is, the other player also wants to win at Go, Chess, etc. This makes it a bit harder to estimate what the environment will do in the next step. And this is one of the neat bits that the authors of AlphaGo got right. Side note - there's some discussion whether Lee Sedol managed to get AlphaGo into an unanticipated state by his moves in the fourth game. This just shows that adversarial environments are very hard to handle.

This question originally appeared on Quora. - the knowledge sharing network where compelling questions are answered by people with unique insights. You can follow Quora on Twitter, Facebook, and Google+. More questions:​