More LSH Application: Entity Resolution Application: Similar News Articles Distance Measures LS Families of Hash Functions Jeffrey D. Ullman Stanford University LSH for Cosine Distance

Entity Resolution Similarity of Records A Simple Bucketing Process Validating the Results Entity Resolution The entity-resolution problem is to examine a collection of records and determine which refer to the same entity. Entities could be people, events, etc. Typically, we want to merge records if their

values in corresponding fields are similar. 3 Matching Customer Records I once took a consulting job solving the following problem: Company A agreed to solicit customers for Company B, for a fee. They then argued over how many customers. Neither recorded exactly which customers were involved.

4 Customer Records (2) Each company had about 1 million records describing customers that might have been sent from A to B. Records had name, address, and phone, but for various reasons, they could be different for the same person. E.g., misspellings, but there are many sources of error.

5 Customer Records (3) Problem: (1 million)2 is too many pairs of records to score. Solution: A simple LSH. Three hash functions: exact values of name, address, phone. Compare iff records are identical in at least one. Misses similar records with a small differences in all three fields.

6 Customer Records (4) Design a measure (score ) of how similar records are: E.g., deduct points for small misspellings (Jeffrey vs. Jeffery) or same phone with different area code. Score all pairs of records that the LSH scheme identified as candidates; report high scores as matches.

7 Aside: Hashing Names, Etc. Problem: How do we hash strings such as names so there is one bucket for each string? Answer: Sort the strings instead. Another option was to use a few million buckets, and deal with buckets that contain several different strings. 8

Aside: Validation of Results We were able to tell what values of the scoring function were reliable in an interesting way. Identical records had an average creation-date difference of 10 days. We only looked for records created within 90 days of each other, so bogus matches had a 45day average difference in creation dates. 9 Validation (2) By looking at the pool of matches with a fixed

score, we could compute the average timedifference, say x, and deduce that fraction (45x)/35 of them were valid matches. Alas, the lawyers didnt think the jury would understand. 10 Validation Generalized Any field not used in the LSH could have been used to validate, provided corresponding values were closer for true matches than false. Example: if records had a height field, we would expect true matches to be close, false matches

to have the average difference for random people. 11 Similar News Articles A New Way of Shingling Bucketing by Length Application: Same News Article The Political-Science Dept. at Stanford asked a team from CS to help them with the problem of

identifying duplicate, on-line news articles. Problem: the same article, say from the Associated Press, appears on the Web site of many newspapers, but looks quite different. 13 News Articles (2) Each newspaper surrounds the text of the article with: Its own logo and text. Ads. Perhaps links to other articles.

A newspaper may also crop the article (delete parts). 14 News Articles (3) The team came up with its own solution, that included shingling, but not minhashing or LSH. A special way of shingling that appears quite good for this application. LSH substitute: candidates are articles of similar

length. 15 Enter LSH I told them the story of minhashing + LSH. They implemented it and found it faster for similarities below 80%. Aside: Thats no surprise. When the similarity threshold is high, there are better methods see Sect. 3.9 of MMDS and/or YouTube videos 8-4, 8-5, and 8-6.

16 Enter LSH (2) Their first attempt at minhashing was very inefficient. They were unaware of the importance of doing the minhashing row-by-row. Since their data was column-by-column, they needed to sort once before minhashing. 17

Specialized Shingling Technique The team observed that news articles have a lot of stop words, while ads do not. Buy Sudzo vs. I recommend that you buy Sudzo for your laundry. They defined a shingle to be a stop word and the next two following words. 18 Why it Works

By requiring each shingle to have a stop word, they biased the mapping from documents to shingles so it picked more shingles from the article than from the ads. Pages with the same article, but different ads, have higher Jaccard similarity than those with the same ads, different articles. 19 Distance Measures Triangle Inequality Euclidean Distance

Cosine Distance Jaccard Distance Edit Distance Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close.

Example: Jaccard similarity is not a distance; 1 minus Jaccard similarity is. 21 Axioms of a Distance Measure d is a distance measure if it is a function from pairs of points to real numbers such that: 1. 2.

3. 4. d(x,y) > 0. d(x,y) = 0 iff x = y. d(x,y) = d(y,x). d(x,y) < d(x,z) + d(z,y) (triangle inequality). 22 Some Euclidean Distances L2 norm: d(x,y) = square root of the sum of the

squares of the differences between x and y in each dimension. The most common notion of distance. L1 norm: sum of the differences in each dimension. Manhattan distance = distance if you had to travel along coordinates only. 23 Examples of Euclidean Distances L2-norm:

dist(a,b) = (42+32) =5 b = (9,8) 5 4 a = (5,5) 3 L1-norm: dist(a,b) =

4+3 = 7 24 Question For Thought People have defined Lr norms for any r, even fractional r. What do these norms look like as r gets larger? What if r approaches 0? 25 Some Non-Euclidean

Distances Jaccard distance for sets = 1 minus Jaccard similarity. Cosine distance for vectors = angle between the vectors. Edit distance for strings = number of inserts and deletes to change one string into another. 26 Example: Jaccard Distance Consider x = {1,2,3,4} and y = {1,3,5}

Size of intersection = 2; size of union = 5, Jaccard similarity (not distance) = 2/5. d(x,y) = 1 (Jaccard similarity) = 3/5. 27 Why J.D. Is a Distance Measure d(x,y) > 0 because |xy| < |xy|. Thus, similarity < 1 and distance = 1 similarity > 0. d(x,x) = 0 because xx = xx. And if x y, then |xy| is strictly less than |

xy|, so sim(x,y) < 1; thus d(x,y) > 0. d(x,y) = d(y,x) because union and intersection are symmetric. d(x,y) < d(x,z) + d(z,y) trickier next slide. 28 Triangle Inequality for J.D. d(x,z) d(z,y)

d(x,y) 1 - |x z| + 1 - |y z| > 1 -|x y| |x z| |y z| |x y| Remember: |a b|/|a b| = probability that minhash(a) = minhash(b). Thus, 1 - |a b|/|a b| = probability that minhash(a) minhash(b). Need to show: prob[minhash(x) minhash(y)] < prob[minhash(x) minhash(z)] + prob[minhash(z) minhash(y)] 29

Proof Whenever minhash(x) minhash(y), at least one of minhash(x) minhash(z) and minhash(z) minhash(y) must be true. minhash(x) minhash(y minhash(x) minhash(z) minhash(z) minhash(y) 30

Cosine Distance Think of a point as a vector from the origin [0,0,,0] to its location. Two points vectors make an angle, whose cosine is the normalized dot-product of the vectors: p1.p2/|p2||p1|. Example: p1 = [1,0,2,-2,0]; p2 = [0,0,3,0,0]. p1.p2 = 6; |p1| = |p2| = 9 = 3. cos() = 6/9; is about 48 degrees. 31 Edit Distance

The edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other. An equivalent definition: d(x,y) = |x| + |y| - 2| LCS(x,y)|. LCS = longest common subsequence = any longest string obtained both by deleting from x and deleting from y. 32 Example: Edit Distance x = abcde ; y = bcduve.

Turn x into y by deleting a, then inserting u and v after d. Edit distance = 3. Or, computing edit distance through the LCS, note that LCS(x,y) = bcde. Then:|x| + |y| - 2|LCS(x,y)| = 5 + 6 2*4 = 3 = edit distance. Question for thought: An example of two strings with two different LCSs? Hint: let one string be ab. 33

LSH Families of Hash Functions Definition Combining hash functions Making steep S-Curves Hash Functions Decide Equality There is a subtlety about what a hash function is, in the context of LSH families. A hash function h really takes two elements x and y, and returns a decision whether x and y are candidates for comparison.

Example: the family of minhash functions computes minhash values and says yes iff they are the same. Shorthand: h(x) = h(y) means h says yes for pair of elements x and y. 35 LSH Families Defined Suppose we have a space S of points with a distance measure d. A family H of hash functions is said to be (d1,d2,p1,p2)-sensitive if for any x and y in S:

1. If d(x,y) < d1, then the probability over all h in H, that h(x) = h(y) is at least p1. 2. If d(x,y) > d2, then the probability over all h in H, that h(x) = h(y) is at most p2. 36 LS Families: Illustration p1 p2 High probability;

at least p1 Low probability; at most p2 ??? d1 d2 37

Example: LS Family Let: S = subsets of some universal set, d = Jaccard distance, H formed from the minhash functions for all permutations of the universal set. Then Prob[h(x)=h(y)] = 1-d(x,y). Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance. 38

Example: LS Family (2) Claim: H is a (1/3, 3/4, 2/3, 1/4)-sensitive family for S and d. Then probability If distance > 3/4 that minhash values (so similarity < 1/4) agree is < 1/4 If distance < 1/3 (so similarity > 2/3) Then probability

that minhash values agree is > 2/3 For Jaccard similarity, minhashing gives us a (d1,d2,(1-d1),(1-d2))-sensitive family for any d1 < d2. 39 Amplifying an LSHFamily The bands technique we learned for signature matrices carries over to this more general setting. Goal: the S-curve effect seen there. AND construction like rows in a band.

OR construction like many bands. 40 AND of Hash Functions Given family H, construct family H whose members each consist of r functions from H. For h = {h1,,hr} in H, h(x)=h(y) if and only if hi(x)=hi(y) for all i. Theorem: If H is (d1,d2,p1,p2)-sensitive, then H is (d1,d2,(p1)r,(p2)r)-sensitive. Proof: Use fact that hi s are independent. Also lowers probability

for small distances (Bad) Lowers probability for large distances (Good) 41 OR of Hash Functions Given family H, construct family H whose members each consist of b functions from H. For h = {h1,,hb} in H, h(x)=h(y) if and only if hi(x)=hi(y) for some i. Theorem: If H is (d1,d2,p1,p2)-sensitive, then

H is (d1,d2,1-(1-p1)b,1-(1-p2)b)-sensitive. Raises probability for small distances (Good) Raises probability for large distances (Bad) 42 Combine AND and OR Constructions By choosing b and r correctly, we can make the lower probability approach 0 while the higher

approaches 1. As for the signature matrix, we can use the AND construction followed by the OR construction. Or vice-versa. Or any sequence of ANDs and ORs alternating. 43 AND-OR Composition Each of the two probabilities p is transformed into 1-(1-pr)b. The S-curve studied before. Example: Take H and construct H by the AND

construction with r = 4. Then, from H, construct H by the OR construction with b = 4. 44 Table for Function 1(1-p4)4 p .2 .3 .4 .5 .6 .7

.8 .9 1-(1-p4)4 .0064 .0320 .0985 .2275 .4260 .6666 .8785 .9860 Example: Transforms a

(.2,.8,.8,.2)-sensitive family into a (.2,.8,.8785,.0064)sensitive family. 45 OR-AND Composition Each of the two probabilities p is transformed into (1-(1-p)b)r. The same S-curve, mirrored horizontally and vertically. Example: Take H and construct H by the OR

construction with b = 4. Then, from H, construct H by the AND construction with r = 4. 46 Table for Function (14 4 (1-p) ) p .1 .2 .3 .4 .5

.6 .7 .8 (1-(1-p)4)4 .0140 .1215 .3334 .5740 .7725 .9015 .9680 .9936

Example: Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9936,.1215)sensitive family. 47 Cascading Constructions Example: Apply the (4,4) OR-AND construction followed by the (4,4) AND-OR construction. Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9999996,.0008715)-sensitive family.

48 General Use of S-Curves For each AND-OR S-curve 1-(1-pr)b, there is a threshold t, for which 1-(1-tr)b = t. Above t, high probabilities are increased; below t, low probabilities are decreased. You improve the sensitivity as long as the low probability is less than t, and the high probability is greater than t. Iterate as you like. Similar observation for the OR-AND type of S-

curve: (1-(1-p)b)r. 49 Visualization of Threshold Probability Is raised Threshold t Probability Is lowered

p t 50 An LSH Family for Cosine Distance Random Hyperplanes Sketches (Signatures) Random Hyperplanes (1) For cosine distance, there is a technique

analogous to minhashing for generating a (d1,d2,(1-d1/180),(1-d2/180))-sensitive family for any d1 and d2. Called random hyperplanes. 52 Random Hyperplanes (2) Each vector v determines a hash function hv with two buckets. hv(x) = +1 if v.x > 0; hv(x) = -1 if v.x < 0. LS-family H = set of all functions derived from

any vector v. Claim: Prob[h(x)=h(y)] = 1 (angle between x and y divided by 180). 53 Proof of Claim Look in the plane of x and y. Note: what is important is that hyperplane is outside the angle,

v not that the vector is inside. x Hyperplanes for which h(x) = h(y) Hyperplanes (normal to v ) for which h(x) h(y) y

Prob[Red case] = /180 54 Signatures for Cosine Distance Pick some number of vectors, and hash your data for each vector. The result is a signature (sketch) of +1s and 1s that can be used for LSH like the minhash signatures for Jaccard distance. But you dont have to think this way. The existence of the LSH-family is sufficient for

amplification by AND/OR. 55 Simplification We need not pick from among all possible vectors v to form a component of a sketch. It suffices to consider only vectors v consisting of +1 and 1 components. 56