Pairwise Document Similarty in Large Collection with MapReduce

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed, Jimmy Lin, and Douglas W. Oard iSchool, Cloud Computing Class Talk, Oct 6th 2008 1 Overview Abstract Problem Trivial Solution MapReduce Solution Efficiency Tricks Identity Resolution in Email Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 2 Abstract Problem

0.20 0.30 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 ~~~~~~~~~~ 0.34 ~~~~~~~~~~ 0.34 0.13 0.74 0.20 0.30 ~~~~~~~~~~ 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 ~~~~~~~~~~ 0.34 0.34 0.13 0.74 0.20 0.30

~~~~~~~~~~ 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 ~~~~~~~~~~ 0.34 0.34 0.13 0.74 0.20 ~~~~~~~~~~ 0.30 ~~~~~~~~~~ 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 0.34 0.34 0.13 0.74 0.20 ~~~~~~~~~~

0.30 ~~~~~~~~~~ 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 0.34 0.34 0.13 0.74 Applications: Clustering Coreference resolution more-like-that queries Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 3 Similarity of Documents

Simple inner product Cosine similarity Term weights di dj Standard problem in IR tf-idf, BM25, etc. Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 4 Trivial Solution load each vector o(N) times load each term o(dft2) times Goal scalable and efficient solution for large collections

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 5 Better Solution Each term contributes only if appears in Load weights for each term once Each term contributes o(dft2) partial scores Allows efficiency tricks Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 6 Decomposition MapReduce Each term contributes only if appears in reduce index map

Load weights for each term once Each term contributes o(dft2) partial scores Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 7 MapReduce Framework (b) Shuffle (a) Map (k1, v1) input input input input [k2, v2] map map map (c) Reduce (k2, [v2]) Shuffling group values

by: [keys] [(k3, v3)] reduce output reduce output reduce output map handles low-level details transparently Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 8 Standard Indexing (a) Map doc doc

doc doc (b) Shuffle tokenize tokenize tokenize Shuffling group values by: terms (c) Reduce combine posting list combine posting list combine posting list tokenize

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 9 Indexing (3-doc toy collection) Clinton Clinton Obama Obama Clinton Clinton Clinton 1 2 1 Cheney Clinton Cheney 1 Indexing Barack

1 Clinton Clinton Barack Barack Obama Obama Obama Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 1 1 10 Pairwise Similarity (a) Generate pairs Clinton 1 2 (b) Group pairs (c) Sum pairs

2 1 2 Cheney 1 1 2 2 1 3 Barack 1 1 Obama 1 1

1 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 11 Pairwise Similarity (b) Group pairs (a) Generate pairs (abstract) term postings multiply term postings multiply term postings multiply term postings multiply

Shuffling group values by: pairs (c) Sum pairs sum similarity sum similarity sum similarity Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 12 Experimental Setup Elsayed, Lin, and Oard, ACL 2008 0.16.0

Cluster of 19 machines Each w/ two processors (single core) Aquaint-2 collection Open source MapReduce implementation 906K documents Okapi BM25 Subsets of collection Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 13 Efficiency (disk space)

Intermediate Pairs (billions) Aquaint-2 Collection, ~ 906k docs 9,000 8 trillion intermediate pairs 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 0 10 20 30 40 50 60

70 80 90 100 Corpus Size (%) Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 14 Terms: Zipfian Distribution each term t contributes o(df t 2 ) partial results doc freq (df) very few terms dominate the computations most frequent term (said) 3%

most frequent 10 terms 15% most frequent 100 terms 57% most frequent 1000 terms 95% ~0.1% of total terms (99.9% df-cut) term rank Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 15 Efficiency (disk space) Intermediate Pairs (billions) Aquaint-2 Collection, ~ 906k doc 9,000 8,000 8 trillion intermediate pairs no df-cut df-cut at 99.999% df-cut at 99.99% df-cut at 99.9% df-cut at 99% 7,000 6,000

5,000 4,000 3,000 0.5 trillion intermediate pairs 2,000 1,000 0 0 10 20 30 40 50 60 70 80 90

100 Corpus Size (%) Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 16 Effectiveness (recent Effect of df-cut on effectiveness work) Medline04 - 909k abstracts- Ad-hoc retrieval 100 Relative P5 (%) 95 90 Drop 0.1% of terms Near-Linear Growth Fit on disk Cost 2% in Effectiveness 85 80 75 70

65 60 55 50 99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00 df-cut (%) Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 17

Implementation Issues BM25s Similarity Model tf1 tf1 * dl1 dl1 tf1 0.5 1.5 tf1 0.5 1.5 avg dl avg dl N df 0.5 * log df 0.5 TF, IDF Document length DF-Cut

Build a histogram Pick the absolute df for the % df-cut Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 18 Other Approximation Techniques ? Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 19 Other Approximation Techniques (2) Absolute df Consider only terms that appear in at least n (or %) documents An absolute lower bound on df, instead of just removing the % most-frequent terms

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 20 Other Approximation Techniques (3) tf-Cut Consider only documents (in posting list) with tf > T ; T=1 or 2 OR: Consider only the top N documents based on tf for each term Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 21 Other Approximation Techniques (4) Similarity Threshold Consider only partial scores > SimT Computing Pairwise Document Similarity in Large Collections: A

MapReduce Perspective 22 Other Approximation Techniques: (5) Ranked List Keep only the most similar N documents In the reduce phase Good for ad-hoc retrieval and more-like this queries Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 23 Space-Saving Tricks (1) Stripes Stripes instead of pairs

Group by doc-id not pairs 1 2 1 2 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 24 Space-Saving Tricks (2) Blocking No need to generate the whole matrix at once Generate different blocks of the matrix at different steps limit the max space required for intermediate results Similarity Matrix Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 25

Identity Resolution in Email Topical Similarity Social Similarity Joint Resolution of Mentions Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 26 Basic Problem Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the WHO? WHO? call will be too late for him. Computing Pairwise Document Similarity in Large Collections: A

MapReduce Perspective 27 Enron Collection 55 Sheilas !! Message-ID: <[email protected]> Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT) From: [email protected] To: [email protected] weisman maynes jarnot Subject: RE: Shhhh.... it's a SURPRISE ! pardo nacey kirby X-From: Sager, Elizabeth glover ferrarini knudsen richX-To: '[email protected]@ENRON' dey boehringer Hi Shari macleod jones lutz

breeden glover Hope all ishoward well. Count me in for the group present. huckaby darling wollam See ya next week if not earlier tweed watson jortner Liza mcintyre perlick neylon Elizabeth Sager chadwick advani whanger 713-853-6349 birmingham hester nagel

-----Original Message----kahanek kenner graves From: [email protected]@ENRON foraker lewis mclaughlin Sent: Monday, July 30, 2001 2:24 PM To: Sager, walton Elizabeth; Murphy, Harlan; [email protected]; tasman venville [email protected] fisher whitman rappazzo Cc: [email protected] petitt miller ! Subject: berggren Shhhh.... it's a SURPRISE Dombo osowski swatek Please call me (713) 207-5233 Robbins

kelly hollis Thanks! chang Rank Rank Candidates Candidates Shari Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 28 Generative Model 1. Choose person person c to mention p(c) 2. Choose appropriate context context X to mention c p(X | c) 3.

Choose a mention mention l p(l | X, c) sheila Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective GE conference call 29 3-Step Solution (1) Identity Modeling (2) Context Reconstruction (3) Mention Resolution Posterior Distribution Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 30

Contextual Space Topical Context Conversational Context Local Context Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 31 Topical Context Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE GEConference Call has be rescheduled Sheila want Scott to participate? Looks like the call Did Sheila call will be too late for him. Date: Fri Dec 15 05:33:00 EST 2000 From: [email protected] To: vince j kaminski Cc: sheila walton [email protected]

Subject: Re: Grant Masson Great news. Lets get this moving along. Sheila, Sheila can you work out GE GEletter? Vince, I am in London Monday/Tuesday, back Weds late. I'll ask Sheila to fix this for you and if you need me call call me on my cell phone. Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 32 Contextual Space Social Context Topical Context Conversational Context Local Context Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 33 Social Context Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann [email protected]

To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Date: Tue, 19 Dec 2000 07:07:00 -0800 (PST) From: [email protected] To: [email protected] [email protected] Subject: ESA Option Execution Kay Can you initial the ESA assignment and assumption agreement or should I ask Sheila Sheila Tweed Tweed to do it? I believe she is currently en route from Portland. Thanks, Rebecca Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 34 Contextual Space (mentions) Sheila Tweed [email protected] social social Sheila Walton

Sheila topical topical social sheila Sheila topical conversational sg Joint Resolution of Mentions Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 35 Topical Expansion Each email is a document Index all (bodies of) emails

remove all signature and salutation lines Use temporal constraints Need an email-to-date/time mapping Check for each pair of documents Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 36 Social Expansion Can we use the same technique? For each email: list of participating email addresses comprises the document MessageID: 3563 Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him.

2563 [email protected] [email protected] Index the new social documents and apply same topical expansion process Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 37 Social Similarity Models Intersection size Jaccard Coefficent Boolean All given temporal constraints Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective

38 Joint Resolution Sheila Tweed [email protected] social social Sheila Walton Sheila topical topical social sheila Sheila topical conversational sg Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective

39 Joint Resolution Mention Graph Spread Current Resolution Combine Context Info Update Resolution Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 40 Joint Resolution Mention Graph map Work in Progress! shuffle reduce

MapReduce! Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 41 System Design Threads Emails Identity Models Mention Recognition Conv. Expansion Local Expansion Topical Expansion Social Expansion Context-Free Resolution

Conv. Context Local Context Topical Context Social Context Context-Free Resolution Mentions Merging Contexts Prior Resolution Joint Resolution Posterior Computing Pairwise Document Similarity in Large Collections: A Resolution MapReduce Perspective 42

Iterative Joint Resolution Input: Context Graph + Prior Resolution Mapper Consider one mention Takes: 1. 2. Spread context info and prior resolution to all mentions in context Reducer Consider one mention Takes: 1. 2.

out-edges and context info prior resolution in-edges and context info prior resolution Compute posterior resolution Multiple Iterations Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 43 Conclusion Simple and efficient MapReduce solution applied to both topical and social expansion in

Identity Resolution in Email different tricks for approximation Shuffling is critical df-cut controls efficiency vs. effectiveness tradeoff 99.9% df-cut achieves 98% relative accuracy Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 44 Thank You! Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 45 Algorithm Matrix must fit in memory

Works for small collections Otherwise: disk access optimization Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 46

Recently Viewed Presentations

  • Department of Justice - CRI/TA -Reform Update

    Department of Justice - CRI/TA -Reform Update

    Not on My Watch program continues . ... Seeks to increase female success rate at PAT and Academy, which tend to be upper body strength heavy. In 2016, the unit held 243 different recruiting events across the country. ... CRI/TA...
  • PowerPoint Slides for Starting Out With C++ Eearly Objects ...

    PowerPoint Slides for Starting Out With C++ Eearly Objects ...

    PowerPoint Slides for Starting Out With C++ Eearly Objects Eighth Edition Author: ... String Operators string Member Functions 3.9 Using C-Strings C-String Input C-String Initialization vs. Assignment C-String and Keyboard Input Conversions Between Numbers and Strings Conversion Classes ...
  • Transaction Processing Systems (TPS)

    Transaction Processing Systems (TPS)

    REA Diagram for the Revenue Cycle Revenue Cycle Business Activities What are the four basic revenue cycle business activities? Sales order entry Shipping Billing and accounts receivable Cash collections 1. Sales Order Entry This step includes all the activities involved...
  • UNDERSTANDING THE BIBLE 1. 2. 3. 4. 5.

    UNDERSTANDING THE BIBLE 1. 2. 3. 4. 5.

    (Luke 24:25, 26 ESV) … everything written about me in the Law of Moses and the Prophets and the Psalms must be fulfilled." Then he opened their minds to understand the Scriptures, and said to them, "Thus it is written,...
  • Large Molecule Small Molecule - RIC | RIC

    Large Molecule Small Molecule - RIC | RIC

    Discussion. The results indicated that CMTPPB may have bound with the DNA. The evidence for this was: A negative energy on HyperChem. On the Cary Spectrophotometer, there was a shift in the melting curve of the DNA and CMTPPB sample,...
  • Central Idea vs. Theme - robeson.k12.nc.us

    Central Idea vs. Theme - robeson.k12.nc.us

    They are common topics or big ideas on which themes are centered. * Courage * Hope * Honesty * Hard work * Love * Acceptance * Kindness * Jealousy * Fears * Being yourself * Doing the right thing Common...
  • Aim:

    Aim:

    Mini-lesson What was the Neolithic Revolution? From the time of the first hominid (6-7 million years ago), until 8,000 BCE was the Paleolithic (Old Stone Age). In 8,000 BCE the Neolithic Revolution changed the way humans live around the world....
  • Navigation - Chapter 2: Compass

    Navigation - Chapter 2: Compass

    The MONUMENT on NASHAWENA ISLAND can be seen off the starboard bow (045R) while on heading 25lC at 9:30 a.m. Plot the bearing and label it. 0930 276 6-7 . On another day at 6:25 a.m., while running C262C at...