Real World Interactive Learning

Real World Interactive Learning Alekh Agarwal John Langford KDD Tutorial, August 19 des and full references at http://hunch.net/~rwil/kdd2018.htm The Supervised Learning Paradigm

Training examples 1 1 5 4 3

7 5 3 5 3 5 5

9 0 6 3 5 2

0 0 Accurate digit classifier Training labels Supervised Learner 2

Supervised Learning is cool 3 How about news? Repeatedly: 1. Observe features of user+articles 2. Choose a news article. 3. Observe click-or-not Goal: Maximize fraction of clicks 4 A standard pipeline

1. 2. 3. 4. 5. 6. Collect information. Build Learn Act: Deploy in A/B test for 2 weeks

A/B test fails Why? 5 Q: What goes wrong? Is childproofing interesting to Alekh? A: Need Right Signal for Right Answer 6 F r a c t io n v a lu e r e t a in e d

Q: What goes wrong? Model value over time 1 0.8 0.6 0.4 0.2

0 Day 1/Day 1 Day 1/Day 2 Day 1/Day 3 A: The world changes! 7 Interactive Learning

Action (selected news story) Learnin g Features Consequenc e (user history, news stories)

(click-ornot) 8 Q: Why Interactive Learning? A: $$$ Use free interaction data rather than expensive labels 9 Q: Why Interactive Learning?

AI: A function programmed with data AI: An economically viable digital agent that explores, learns, and acts 10 Flavors of Interactive Learning Full Reinforcement Learning: Special Domains +Right Signal, -Nonstationary Bad, -$$$ +AI Contextual Bandits: Immediate Reward RL Rightish Signal, +Nonstationary ok, +$$$, +AI Active Learning: Choose examples to label.

-Wrong Signal, -Nonstationary bad, +$$$, -not AI 11 What is RL? Repeatedly 1. Observe user profile 2. Recommend news story 3. Observe users response Context/state Action/ decision Reward

Goal: Maximize a user engagement metric e.g.: clicks, dwell time, number of visits Long-term12 Contextual bandits (CBs) Repeatedly 1. Observe user profile 2. Recommend news story 3. Observe users response Context/state

Action/ decision Reward Goal: Maximize a user engagement metric e.g.: clicks, dwell time, number of visits Long-term13 Contextual bandits (CBs) Repeatedly Contextual bandit assumption:

Choice of actions at one time-step does not effect future contexts and rewards. 1. Observe user profile e.g.: Recommendations to one user do notContext/state influence anothers preferences Ignoring long-term effects on the same user 2.

Recommend news story Action/ 3. Observe users response decision Many works: are drawn i.i.d. from the same distribution at Reward each round. Goal: Maximize a user engagement metric

e.g.: clicks, dwell time Immediate14 Ex: Which advice? Repeatedly: 1. Observe features of user+advice 2. Choose an advice. 3. Observe steps walked Goal: Healthy behaviors 15

Other Real-world Applications News Rec: [LCLS 10] Ad Choice: [BPQCCPRSS 12] Ad Format: [TRSA 13] Education: [MLLBP 14] Music Rec: [WWHW 14] Robotics: [PG 16] Wellness/Health: [ZKZ 09, SLLSPM 11, NSTWCSM 14, PGCRRH 14, NHS 15, KHSBATM 15, HFKMTY 16] 16

Why CBs? Reasonable approximation for many applications News, ads, education, music recommendation, robotics, health/wellness Eliminates need for long-term planning Still need to explore across actions while generalizing across contexts and obtaining high rewards Application integration remains challenging 17

Take-aways 1) Good fit for many real problems 18 Outline 1)Algs & Theory Overview 1) Evaluate? 2) Learn? 3) Explore?

2)Things that go wrong in practice 3)Systems for going right 19 Contextual Bandits Repeatedly: 1. Observe features 2. Choose action based on features 3. Observe reward Goal: Maximize expected reward

20 Policies Policy maps features to actions. Policy = Classifier that acts. 21 Exploration Polic y 22

Exploration Randomization Polic y 23 Exploration Randomization Polic y

24 Inverse Propensity Score(IPS) [HT 52] Given experience and a policy , how good is ? Matches Propensity Score

25 What do we know about IPS? Theorem: For all , for all Proof: For all , 26 Reward over time

Systems actual online performance Offline estimate of a baselines performance 27 Better Evaluation Techniques Double Robust: [DLL 11] Weighted IPS: [K 92, SJ 15]

Clipping: [BL 08] 28 Learning from Exploration [Z Given Data how to maximize ? 03] Maximize instead! Equivalent to: with importance weight Importance weighted multiclass classification!

Vowpal Wabbit: Online/Fast learning BSD License, 10 year project Mailing List>500, Github>1K forks, >4K stars, >1K issues, >100 contributors Command Line/C++/C#/Python/Java/AzureML/Daemon 30 VW for Contextual Bandit

Learning echo 1:2:0.5 | here are some features | vw --cb 2 Format: :: | features Implements several algorithms for policy learning from exploration data 31 Better Learning from Exploration Data

Policy Gradient: [W 92] Offset Tree: [BL 09] Double Robust for learning: [DLL 11] Importance Weighted Regression: [BAL 18] Weighted IPS for learning: [SJ 15] 32 Evaluating Online Learning Problem: How do you evaluate an online learning algorithm Offline? Answer: Use Progressive Validation [BKL 99, CCG 04]

Theorem: 1) Expected PV value = Uniform expected policy value. 33 How do you do Exploration? Simplest Algorithm: -greedy. With probability act uniformly at random With probability act greedily using

best policy on all seen data so far 34 Better Exploration Algorithms Ensemble-based UCB-based Thompson Sampling: [T 33] LinUCB: [CLRS 11, A02] EXP4: [ACFS 02] RegCB: [FADLS 18]

Epoch Greedy: [LZ 07] Polytime: [DHKKLRZ 11] Cover&Bag: [AHKLLS 14] Bootstrap: [EK 14] 35 Ensemble-based exploration Maintain distribution over policies good so far On a new context, sample a policy and take its action Avoid choosing obviously bad actions*

Diverse policies to cover all plausibly good actions Refine distribution as we see more data explore/exploit tradeoff Algorithms robust with minimal assumptions *Typically rely on some uniform exploration 36 UCB-based exploration Maintain confidence intervals for reward of actions

given context 37 UCB-based exploration Maintain confidence intervals for reward of actions given context Choose amongst non-dominated actions 38

UCB-based exploration Maintain confidence intervals for reward of actions given context Choose amongst non-dominated actions No uniform exploration Need to predict expected rewards 39 Evaluating Exploration Problem:

Algorithms How do you take the choice of examples acquired by an exploration algorithm into account? IPS does not work. Consider evaluating -greedy for different values from exploration data. Smaller More exploit actions Better IPS score Does not capture effect of exploration on future learning 40 Evaluating Exploration

Algorithms Problem: How do you take the choice of examples acquired by an exploration algorithm into account? Answer: Rejection Sample from history. [DELL 12] Theorem: Realized history is unbiased up to length observed. Better versions: [DELL 14] & VW code 41

Simulating Exploration Algorithms Use supervised classification datasets to simulate CB problems: 1. Algorithm chooses action 2. Reveal reward for this action (e.g. 1 if correct class, 0 o/w) Low variance, many diverse datasets across problem types Hard to capture non-stationarity, diverse reward structures 42

Simulating Exploration Algorithms [BAL 18]: Study using ~500 multiclass classification datasets from OpenML UCB-based approach of [FADLS 18] wins overall Greedy(!) and Cover [AHKLLS 14] also close Also evaluate impact of reductions, reward encodings etc. 43 More Details! NIPS tutorial: http://hunch.net/~jl/interact.pdf Johns Spring 2017 Cornell Tech class (

http://hunch.net/~mltf) with slides and recordings Alekhs Fall 2017 Columbia class ( http://alekhagarwal.net/bandits_and_rl/) with lecture notes 44 Take-aways 1) Good fit for many problems 2) Fundamental questions have useful answers 45

46

Recently Viewed Presentations

  • Title

    Title

    Where we used 4 × 3 bytes for term pointers without blocking . . .. . .we now use 3 bytes for one pointer plus 4 bytes for indicating the length of each term. We save 12 − (3 +...
  • A-level Mathematics - 2 year Route Map AS

    A-level Mathematics - 2 year Route Map AS

    [Manipulate polynomials algebraically, including expanding brackets and collecting like terms, factorisation and simple algebraic division; use of the factor theorem] Chapter 2.1 and 2.3. Assumed knowledge: simplifying and manipulating algebraic expressions. Simplifying rational expressions will be covered in
  • A Happy, Healthier You… Harvesting Your Potential MAHAP Fall ...

    A Happy, Healthier You… Harvesting Your Potential MAHAP Fall ...

    Seth Poplaski is the Social Media Specialist for Northern Light Health. He is a native Mainer who resides in Hampden with his wife and two daughter. Before coming to Northern Light Health, he worked in credit unions throughout the state...
  • Programul Operaional Asisten Tehnic 2014-2020 Eveniment de promovare

    Programul Operaional Asisten Tehnic 2014-2020 Eveniment de promovare

    * curs inforeuro aferent lunii septembrie 2019: 1 euro = 4,7271 lei. Ministerul Fondurilor Europene. Autoritatea de Management pentru POAT. Stadiul. implementării. POAT la 30.09.2019. AXA 1. PROIECTE DEPUSE-24. PROIECTE CONTRACTATE- 22. PROIECTE RESPINSE/REDEPUSE- 2.
  • Health Communicator's Social Media Toolkit Overview

    Health Communicator's Social Media Toolkit Overview

    Syndicated Content. Partner Web Sites. Content Syndication Benefits. No cost . Minimal maintenance solution to keep content up-to-date . Easy self registration process. More efficient use of critical resources. Variety of health topics available . Top Users.
  • Planning: Heuristics and CSP Planning Computer Science cpsc322,

    Planning: Heuristics and CSP Planning Computer Science cpsc322,

    CPSC 322, Lecture 18. Slide . Heuristics for Forward Planning. Heuristic function: estimate of the distance form a state to the goal. In planning this is the_____
  • Latest, Greatest Assistive Technology and Other Supports for ...

    Latest, Greatest Assistive Technology and Other Supports for ...

    Latest, Greatest Assistive Technology and Other Supports for Children's Literacy Development in Inclusive Settings Patsy L. Pierce, Ph.D. Center for Literacy & Disability Studies, UNC-CH
  • Independence and its Heroes - Sacramento State

    Independence and its Heroes - Sacramento State

    Independence and its Heroes. Independence…remained by far the most important moment . for the new nations that emerged; representations of its. heroes and martyrs have become talismans or icons. signifying those beliefs, and reinterpreted with reverence, or. with irony, by...