Real World Interactive Learning Alekh Agarwal John Langford KDD Tutorial, August 19 des and full references at http://hunch.net/~rwil/kdd2018.htm The Supervised Learning Paradigm
Training examples 1 1 5 4 3
7 5 3 5 3 5 5
9 0 6 3 5 2
0 0 Accurate digit classifier Training labels Supervised Learner 2
Supervised Learning is cool 3 How about news? Repeatedly: 1. Observe features of user+articles 2. Choose a news article. 3. Observe click-or-not Goal: Maximize fraction of clicks 4 A standard pipeline
1. 2. 3. 4. 5. 6. Collect information. Build Learn Act: Deploy in A/B test for 2 weeks
A/B test fails Why? 5 Q: What goes wrong? Is childproofing interesting to Alekh? A: Need Right Signal for Right Answer 6 F r a c t io n v a lu e r e t a in e d
Q: What goes wrong? Model value over time 1 0.8 0.6 0.4 0.2
0 Day 1/Day 1 Day 1/Day 2 Day 1/Day 3 A: The world changes! 7 Interactive Learning
Action (selected news story) Learnin g Features Consequenc e (user history, news stories)
(click-ornot) 8 Q: Why Interactive Learning? A: $$$ Use free interaction data rather than expensive labels 9 Q: Why Interactive Learning?
AI: A function programmed with data AI: An economically viable digital agent that explores, learns, and acts 10 Flavors of Interactive Learning Full Reinforcement Learning: Special Domains +Right Signal, -Nonstationary Bad, -$$$ +AI Contextual Bandits: Immediate Reward RL Rightish Signal, +Nonstationary ok, +$$$, +AI Active Learning: Choose examples to label.
-Wrong Signal, -Nonstationary bad, +$$$, -not AI 11 What is RL? Repeatedly 1. Observe user profile 2. Recommend news story 3. Observe users response Context/state Action/ decision Reward
Goal: Maximize a user engagement metric e.g.: clicks, dwell time, number of visits Long-term12 Contextual bandits (CBs) Repeatedly 1. Observe user profile 2. Recommend news story 3. Observe users response Context/state
Action/ decision Reward Goal: Maximize a user engagement metric e.g.: clicks, dwell time, number of visits Long-term13 Contextual bandits (CBs) Repeatedly Contextual bandit assumption:
Choice of actions at one time-step does not effect future contexts and rewards. 1. Observe user profile e.g.: Recommendations to one user do notContext/state influence anothers preferences Ignoring long-term effects on the same user 2.
Recommend news story Action/ 3. Observe users response decision Many works: are drawn i.i.d. from the same distribution at Reward each round. Goal: Maximize a user engagement metric
e.g.: clicks, dwell time Immediate14 Ex: Which advice? Repeatedly: 1. Observe features of user+advice 2. Choose an advice. 3. Observe steps walked Goal: Healthy behaviors 15
Why CBs? Reasonable approximation for many applications News, ads, education, music recommendation, robotics, health/wellness Eliminates need for long-term planning Still need to explore across actions while generalizing across contexts and obtaining high rewards Application integration remains challenging 17
Take-aways 1) Good fit for many real problems 18 Outline 1)Algs & Theory Overview 1) Evaluate? 2) Learn? 3) Explore?
2)Things that go wrong in practice 3)Systems for going right 19 Contextual Bandits Repeatedly: 1. Observe features 2. Choose action based on features 3. Observe reward Goal: Maximize expected reward
20 Policies Policy maps features to actions. Policy = Classifier that acts. 21 Exploration Polic y 22
Exploration Randomization Polic y 23 Exploration Randomization Polic y
24 Inverse Propensity Score(IPS) [HT 52] Given experience and a policy , how good is ? Matches Propensity Score
25 What do we know about IPS? Theorem: For all , for all Proof: For all , 26 Reward over time
Systems actual online performance Offline estimate of a baselines performance 27 Better Evaluation Techniques Double Robust: [DLL 11] Weighted IPS: [K 92, SJ 15]
Clipping: [BL 08] 28 Learning from Exploration [Z Given Data how to maximize ? 03] Maximize instead! Equivalent to: with importance weight Importance weighted multiclass classification!
Learning echo 1:2:0.5 | here are some features | vw --cb 2 Format: :: | features Implements several algorithms for policy learning from exploration data 31 Better Learning from Exploration Data
Policy Gradient: [W 92] Offset Tree: [BL 09] Double Robust for learning: [DLL 11] Importance Weighted Regression: [BAL 18] Weighted IPS for learning: [SJ 15] 32 Evaluating Online Learning Problem: How do you evaluate an online learning algorithm Offline? Answer: Use Progressive Validation [BKL 99, CCG 04]
Theorem: 1) Expected PV value = Uniform expected policy value. 33 How do you do Exploration? Simplest Algorithm: -greedy. With probability act uniformly at random With probability act greedily using
best policy on all seen data so far 34 Better Exploration Algorithms Ensemble-based UCB-based Thompson Sampling: [T 33] LinUCB: [CLRS 11, A02] EXP4: [ACFS 02] RegCB: [FADLS 18]
Epoch Greedy: [LZ 07] Polytime: [DHKKLRZ 11] Cover&Bag: [AHKLLS 14] Bootstrap: [EK 14] 35 Ensemble-based exploration Maintain distribution over policies good so far On a new context, sample a policy and take its action Avoid choosing obviously bad actions*
Diverse policies to cover all plausibly good actions Refine distribution as we see more data explore/exploit tradeoff Algorithms robust with minimal assumptions *Typically rely on some uniform exploration 36 UCB-based exploration Maintain confidence intervals for reward of actions
given context 37 UCB-based exploration Maintain confidence intervals for reward of actions given context Choose amongst non-dominated actions 38
UCB-based exploration Maintain confidence intervals for reward of actions given context Choose amongst non-dominated actions No uniform exploration Need to predict expected rewards 39 Evaluating Exploration Problem:
Algorithms How do you take the choice of examples acquired by an exploration algorithm into account? IPS does not work. Consider evaluating -greedy for different values from exploration data. Smaller More exploit actions Better IPS score Does not capture effect of exploration on future learning 40 Evaluating Exploration
Algorithms Problem: How do you take the choice of examples acquired by an exploration algorithm into account? Answer: Rejection Sample from history. [DELL 12] Theorem: Realized history is unbiased up to length observed. Better versions: [DELL 14] & VW code 41
Simulating Exploration Algorithms Use supervised classification datasets to simulate CB problems: 1. Algorithm chooses action 2. Reveal reward for this action (e.g. 1 if correct class, 0 o/w) Low variance, many diverse datasets across problem types Hard to capture non-stationarity, diverse reward structures 42
Simulating Exploration Algorithms [BAL 18]: Study using ~500 multiclass classification datasets from OpenML UCB-based approach of [FADLS 18] wins overall Greedy(!) and Cover [AHKLLS 14] also close Also evaluate impact of reductions, reward encodings etc. 43 More Details! NIPS tutorial: http://hunch.net/~jl/interact.pdf Johns Spring 2017 Cornell Tech class (
http://hunch.net/~mltf) with slides and recordings Alekhs Fall 2017 Columbia class ( http://alekhagarwal.net/bandits_and_rl/) with lecture notes 44 Take-aways 1) Good fit for many problems 2) Fundamental questions have useful answers 45
[Manipulate polynomials algebraically, including expanding brackets and collecting like terms, factorisation and simple algebraic division; use of the factor theorem] Chapter 2.1 and 2.3. Assumed knowledge: simplifying and manipulating algebraic expressions. Simplifying rational expressions will be covered in
Seth Poplaski is the Social Media Specialist for Northern Light Health. He is a native Mainer who resides in Hampden with his wife and two daughter. Before coming to Northern Light Health, he worked in credit unions throughout the state...
* curs inforeuro aferent lunii septembrie 2019: 1 euro = 4,7271 lei. Ministerul Fondurilor Europene. Autoritatea de Management pentru POAT. Stadiul. implementării. POAT la 30.09.2019. AXA 1. PROIECTE DEPUSE-24. PROIECTE CONTRACTATE- 22. PROIECTE RESPINSE/REDEPUSE- 2.
Syndicated Content. Partner Web Sites. Content Syndication Benefits. No cost . Minimal maintenance solution to keep content up-to-date . Easy self registration process. More efficient use of critical resources. Variety of health topics available . Top Users.
Independence and its Heroes. Independence…remained by far the most important moment . for the new nations that emerged; representations of its. heroes and martyrs have become talismans or icons. signifying those beliefs, and reinterpreted with reverence, or. with irony, by...
Ready to download the document? Go ahead and hit continue!