Real World Interactive Learning Alekh Agarwal John Langford KDD Tutorial, August 19 des and full references at http://hunch.net/~rwil/kdd2018.htm The Supervised Learning Paradigm

Training examples 1 1 5 4 3

7 5 3 5 3 5 5

9 0 6 3 5 2

0 0 Accurate digit classifier Training labels Supervised Learner 2

Supervised Learning is cool 3 How about news? Repeatedly: 1. Observe features of user+articles 2. Choose a news article. 3. Observe click-or-not Goal: Maximize fraction of clicks 4 A standard pipeline

1. 2. 3. 4. 5. 6. Collect information. Build Learn Act: Deploy in A/B test for 2 weeks

A/B test fails Why? 5 Q: What goes wrong? Is childproofing interesting to Alekh? A: Need Right Signal for Right Answer 6 F r a c t io n v a lu e r e t a in e d

Q: What goes wrong? Model value over time 1 0.8 0.6 0.4 0.2

0 Day 1/Day 1 Day 1/Day 2 Day 1/Day 3 A: The world changes! 7 Interactive Learning

Action (selected news story) Learnin g Features Consequenc e (user history, news stories)

(click-ornot) 8 Q: Why Interactive Learning? A: $$$ Use free interaction data rather than expensive labels 9 Q: Why Interactive Learning?

AI: A function programmed with data AI: An economically viable digital agent that explores, learns, and acts 10 Flavors of Interactive Learning Full Reinforcement Learning: Special Domains +Right Signal, -Nonstationary Bad, -$$$ +AI Contextual Bandits: Immediate Reward RL Rightish Signal, +Nonstationary ok, +$$$, +AI Active Learning: Choose examples to label.

-Wrong Signal, -Nonstationary bad, +$$$, -not AI 11 What is RL? Repeatedly 1. Observe user profile 2. Recommend news story 3. Observe users response Context/state Action/ decision Reward

Goal: Maximize a user engagement metric e.g.: clicks, dwell time, number of visits Long-term12 Contextual bandits (CBs) Repeatedly 1. Observe user profile 2. Recommend news story 3. Observe users response Context/state

Action/ decision Reward Goal: Maximize a user engagement metric e.g.: clicks, dwell time, number of visits Long-term13 Contextual bandits (CBs) Repeatedly Contextual bandit assumption:

Choice of actions at one time-step does not effect future contexts and rewards. 1. Observe user profile e.g.: Recommendations to one user do notContext/state influence anothers preferences Ignoring long-term effects on the same user 2.

Recommend news story Action/ 3. Observe users response decision Many works: are drawn i.i.d. from the same distribution at Reward each round. Goal: Maximize a user engagement metric

e.g.: clicks, dwell time Immediate14 Ex: Which advice? Repeatedly: 1. Observe features of user+advice 2. Choose an advice. 3. Observe steps walked Goal: Healthy behaviors 15

Other Real-world Applications News Rec: [LCLS 10] Ad Choice: [BPQCCPRSS 12] Ad Format: [TRSA 13] Education: [MLLBP 14] Music Rec: [WWHW 14] Robotics: [PG 16] Wellness/Health: [ZKZ 09, SLLSPM 11, NSTWCSM 14, PGCRRH 14, NHS 15, KHSBATM 15, HFKMTY 16] 16

Why CBs? Reasonable approximation for many applications News, ads, education, music recommendation, robotics, health/wellness Eliminates need for long-term planning Still need to explore across actions while generalizing across contexts and obtaining high rewards Application integration remains challenging 17

Take-aways 1) Good fit for many real problems 18 Outline 1)Algs & Theory Overview 1) Evaluate? 2) Learn? 3) Explore?

2)Things that go wrong in practice 3)Systems for going right 19 Contextual Bandits Repeatedly: 1. Observe features 2. Choose action based on features 3. Observe reward Goal: Maximize expected reward

20 Policies Policy maps features to actions. Policy = Classifier that acts. 21 Exploration Polic y 22

Exploration Randomization Polic y 23 Exploration Randomization Polic y

24 Inverse Propensity Score(IPS) [HT 52] Given experience and a policy , how good is ? Matches Propensity Score

25 What do we know about IPS? Theorem: For all , for all Proof: For all , 26 Reward over time

Systems actual online performance Offline estimate of a baselines performance 27 Better Evaluation Techniques Double Robust: [DLL 11] Weighted IPS: [K 92, SJ 15]

Clipping: [BL 08] 28 Learning from Exploration [Z Given Data how to maximize ? 03] Maximize instead! Equivalent to: with importance weight Importance weighted multiclass classification!

Vowpal Wabbit: Online/Fast learning BSD License, 10 year project Mailing List>500, Github>1K forks, >4K stars, >1K issues, >100 contributors Command Line/C++/C#/Python/Java/AzureML/Daemon 30 VW for Contextual Bandit

Learning echo 1:2:0.5 | here are some features | vw --cb 2 Format: :: | features Implements several algorithms for policy learning from exploration data 31 Better Learning from Exploration Data

Policy Gradient: [W 92] Offset Tree: [BL 09] Double Robust for learning: [DLL 11] Importance Weighted Regression: [BAL 18] Weighted IPS for learning: [SJ 15] 32 Evaluating Online Learning Problem: How do you evaluate an online learning algorithm Offline? Answer: Use Progressive Validation [BKL 99, CCG 04]

Theorem: 1) Expected PV value = Uniform expected policy value. 33 How do you do Exploration? Simplest Algorithm: -greedy. With probability act uniformly at random With probability act greedily using

best policy on all seen data so far 34 Better Exploration Algorithms Ensemble-based UCB-based Thompson Sampling: [T 33] LinUCB: [CLRS 11, A02] EXP4: [ACFS 02] RegCB: [FADLS 18]

Epoch Greedy: [LZ 07] Polytime: [DHKKLRZ 11] Cover&Bag: [AHKLLS 14] Bootstrap: [EK 14] 35 Ensemble-based exploration Maintain distribution over policies good so far On a new context, sample a policy and take its action Avoid choosing obviously bad actions*

Diverse policies to cover all plausibly good actions Refine distribution as we see more data explore/exploit tradeoff Algorithms robust with minimal assumptions *Typically rely on some uniform exploration 36 UCB-based exploration Maintain confidence intervals for reward of actions

given context 37 UCB-based exploration Maintain confidence intervals for reward of actions given context Choose amongst non-dominated actions 38

UCB-based exploration Maintain confidence intervals for reward of actions given context Choose amongst non-dominated actions No uniform exploration Need to predict expected rewards 39 Evaluating Exploration Problem:

Algorithms How do you take the choice of examples acquired by an exploration algorithm into account? IPS does not work. Consider evaluating -greedy for different values from exploration data. Smaller More exploit actions Better IPS score Does not capture effect of exploration on future learning 40 Evaluating Exploration

Algorithms Problem: How do you take the choice of examples acquired by an exploration algorithm into account? Answer: Rejection Sample from history. [DELL 12] Theorem: Realized history is unbiased up to length observed. Better versions: [DELL 14] & VW code 41

Simulating Exploration Algorithms Use supervised classification datasets to simulate CB problems: 1. Algorithm chooses action 2. Reveal reward for this action (e.g. 1 if correct class, 0 o/w) Low variance, many diverse datasets across problem types Hard to capture non-stationarity, diverse reward structures 42

Simulating Exploration Algorithms [BAL 18]: Study using ~500 multiclass classification datasets from OpenML UCB-based approach of [FADLS 18] wins overall Greedy(!) and Cover [AHKLLS 14] also close Also evaluate impact of reductions, reward encodings etc. 43 More Details! NIPS tutorial: http://hunch.net/~jl/interact.pdf Johns Spring 2017 Cornell Tech class (

http://hunch.net/~mltf) with slides and recordings Alekhs Fall 2017 Columbia class ( http://alekhagarwal.net/bandits_and_rl/) with lecture notes 44 Take-aways 1) Good fit for many problems 2) Fundamental questions have useful answers 45

46