Corpora and Statistical Methods Lecture 12

Corpora and Statistical Methods Albert Gatt Part 1 Semantic similarity and the vector space model Synonymy Different phonological/orthographic words highly related meanings: sofa / couch boy / lad Traditional definition: w1 is synonymous with w2 if w1 can replace w2 in a sentence, salva veritate Is this ever the case? Can we replace one word for another and keep our sentence identical?

The importance of text genre & register With near-synonyms, there are often register-governed conditions of use. E.g. naive vs gullible vs ingenuous You're so bloody gullible [] [] outside on the pavement trying to entice gullible idiots in [] You're so ingenuous . You tackle things the wrong way. The commentator's ingenuous query could just as well have been prompted [] However, it is ingenuous to suppose that peace process [] (source: BNC)

Synonymy vs. Similarity The contextual theory of synonymy: based on the work of Wittgenstein (1953), and Firth (1957) You shall know a word by the company it keeps (Firth 1957) Under this view, perfect synonyms might not exist. But words can be judged as highly similar if people put them into the same linguistic contexts, and judge the change to be slight. Synonymy vs. similarity: example Miller & Charles 1991:

Weak contextual hypothesis: The similarity of the context in which 2 words appear contributes to the semantic similarity of those words. E.g. snake is similar to [resp. synonym of] serpent to the extent that we find snake and serpent in the same linguistic contexts. It is much more likely that snake/serpent will occur in similar contexts than snake/toad NB: this is not a discrete notion of synonymy, but a continuous definition of similarity The Miller/Charles

experiment Subjects were given sentences with missing words; asked to place words they felt were OK in each context. Method to compare words A and B: find sentences containing A find sentences containing B delete A and B from sentences and shuffle them ask people to choose which sentences to place A and B in. Results: People tend to put similar words in the same context, and this is highly correlated with occurrence in similar contexts in corpora. Issues with similarity Similar is a much broader concept than

synonymous: Contextually related, though differing in meaning: man / woman boy / girl master / pupil Contextually related, but with opposite meanings: big / small clever / stupid Uses of similarity Assumption: semantically similar words

behave in similar ways Information retrieval: query expansion with related terms K nearest neighbours, e.g.: given: a set of elements, each assigned to some topic task: classify an unknown w by topic method: find the topic that is most prevalent among ws semantic neighbours Common approaches Vector-space approaches: represent word w as a vector containing the words (or other features) in the context of w compare the vectors of w1, w2 various vector-distance measures available Information-theoretic measures:

w1 is similar to w2 to the extent that knowing about w1 increases my knowledge (decreases my uncertainty) about w2 Part 2 Vector-space models Basic data structure Matrix M Mij = no. of times wi co-occurs with wj (in some window). Can also have Document * word matrix We can treat matrix cells as boolean: if M ij > 0, then wi co-occurs with wj , else it does not. Distance measures

Many measures take a set-theoretic perspective: vectors can be: binary (indicate co-occurrence or not) real-valued (indicate frequency, or probability) similarity is a function of what two vectors have in common Classic similarity/distance measures Boolean vector (sets) Real-valued vector Dice coefficient

Dice coefficient n 2 w v 2 min( wi , vi ) wv w v Jaccard Coefficient i 1 n i

i i 1 Jaccard Coefficient n wv min( w , v ) wv max( w , v ) i i

i 1 n i i 1 i Dice vs. Jaccard Dice (car, truck) On the boolean matrix: (2 * 2)/(4+2) = 0.66 Jaccard On the boolean matrix: 2/4 = 0.5 Dice is more generous; Jaccard penalises

lack of overlap more. Classic similarity/distance measures Boolean vector (sets) Real-valued vector Cosine similarity Cosine similarity vw (= angle between 2 n

i 1 i 2 v w v wi n n v w i1vi i v w vectors) Part 3 probabilistic approaches Turning counts to

probabilities P(spacewalking|cosmonaut) = = 0.5 P(red|car) = = 0.25 NB: this transforms each row into a probability distribution corresponding to a word Probabilistic measures of distance KL-Divergence: treat W1 as an approximation of W2 P( x | v) D(v || w) P( x | v) log P( x | w) x Problems:

asymmetric: D(p||q) D(q||p) not so useful for word-word similarity if denominator = 0, then D(v||w) is undefined Probabilistic measures of distance Information radius (aka Jenson-Shannon Divergence) compares total divergence between p and q to the average of p and q vw v w symmetric!

IRad (W1 , W2 ) IRad (v, w) D v || D w || 2 2 Dagan et al (1997) showed this measure to be superior to KL-Divergence, when applied Some characteristics of vectorspace measures 1.

Very simple conceptually; 2. Flexible: can represent similarity based on document co-occurrence, word cooccurrence etc; 3. Vectors can be arbitrarily large, representing wide context windows; 4. Can be expanded to take into account grammatical relations (e.g. head-modifier, verb-argument, etc).

Grammar-informed methods: Lin (1998) Intuition: The similarity of any two things (words, documents, people, plants) is a function of the information gained by having: a joint description of a and b in terms of what they have in common compared to describing a and b separately E.g. do we gain more by a joint description of:

apple and chair (both THINGS) apple and banana (both FRUIT: more specific) Lins definition cont/d Essentially, we compare the info content of the common definition to the info content of the separate definition sim(a, b) inf . content of joint description inf content of separate descriptions NB: essentially mutual information! An application to corpora From a corpus-based point of view, what do

words have in common? context, obviously How to define context? just bag-of-words (typical of vector-space models) more grammatically sophisticated Kilgarriffs (2003) application Definition of the notion of context, following Lin: define F(w) as the set of grammatical contexts in which w occurs a context is a triple : rel is a grammatical relation w is the word of interest w is the other word in rel

Grammatical relations can be obtained using a dependency parser. Grammatical co-occurrence matrix for cell Source: Jurafsky & Martin (2009), after Lin (1998) Example with w = cell Example triples: Observe that each triple f consists of the relation r, the

second word in the relation w, ..and the word of interest w We can now compute the level of association between the word w and each of its triples f: P( w, f ) I ( w, f ) log 2 P( w) P(r | w) P( w' | w) An information-theoretic measure that was proposed as a generalisation of the idea of pointwise mutual information. Calculating similarity Given that we have grammatical triples for our words of interest, similarity of w1 and

w2 is a function of: the triples they have in common the triples that are unique to each 2 I ( F ( w1 ) I ( F ( w2 )) simLin ( w1 , w2 ) I ( F ( w1 )) I ( F ( w2 )) I.e.: mutual info of what the two words have in common, divided by sum of mutual info of what each word has Sample results: master & pupil common: Subject-of: read, sit, know Modifier: good, form

Possession: interest master only: Subject-of: ask Modifier: past (cf. past master) pupil only: Subject-of: make, find PP_at-p: school Concrete implementation The online SketchEngine gives grammatical relations of words, plus thesaurus which rates words by similarity to a head word. This is based on the Lin 1998 model. Limitations (or characteristics)

Only applicable as a measure of similarity between words of the same category makes no sense to compare grammatical relations of different category words Does not distinguish between near-synonyms and similar words student ~ pupil master ~ pupil MI is sensitive to low-frequency: a relation which occurs only once in the corpus can come out as highly significant.

Part 4 Applications of vector-space models to information retrieval Information retrieval Basic problem definition: Store a (very large) collection of documents Document = Newspaper articles, encyclopedia entries, medline abstracts, html pages... Given a user query (some set of words), retrieve the documents that are most relevant to that query. Most IR systems take a bag of words approach: Document = the words it contains

No syntactic information or higher order semantic information IR architecture Basic representation Same as for semantic similarity, except that we use a document by term (=word) matrix d A document d is represented as a vector whose cells contain term weights. d j (w1, j , w2, j ,...,wn, j )

Example document representation Fried eggplant recipe Document representation flour Place the flour, egg, and bread crumbs each in 3 small bowls. Add the 1/2 teaspoon of salt to the egg and whisk to combine. Season the bread crumbs with the tablespoon of Essence and stir with a fork or your hands to thoroughly combine. Dredge each piece of eggplant in the flour,

coating thoroughly and then shaking to remove any excess flour. Coat each piece with the egg, then dredge in the bread crumb mixture, pressing to make the bread crumbs adhere. Transfer the eggplant pieces to a rack or to paper towels to let them dry slightly before frying. In a deep, heavy skillet heat 1/2-inch of vegetable oil to 375 degrees F. Fry the eggplant pieces, in batches if necessary, for about 1 minute on each side, or until golden brown. Transfer with tongs to paper towels to drain. Sprinkle lightly with salt before serving. Serve with marinara sauce, or as desired. dj 3

egg 3 Brea eggp d lant crum b 4 3 The term weights are just simple document frequencies (for now) Example query

representation Document representation flour dj 3 egg 3 User query Brea eggp d lant

crum b 4 3 The term weights are just simple document frequencies (for now) Suppose user types: egg and breadcrumb q (rep 0,1,1,could

0) Query be: More generally Let d1 be the eggplant recipe, and d2 be a fried chicken recipe. flou egg r Bread eggpla crumb nt chick en d1

3 3 4 3 0 d2 2 3 3

0 2 User query: q (0,1,1,0,0) Note: intuitively, this query should match both docs (both contain egg and breadcrumb) Which doc would the query fried chicken match? Query processing We can use the same model as we used for

computing word similarity, to compute the degree of match between a query and a doc. E.g. Compute the cosine similarity between the query and the document vector. Documents can then be ranked by their similarity to the query. Term weighting So far, the intuition has been that: frequent terms in a document capture the basic meaning of the document. Another intuition: terms that crop up in a few documents are more discriminatory. flou egg

r Bread eggpla crumb nt chick en d1 3 3 4 3

0 d2 2 3 3 0 2 Inverse document frequency (IDF) A way of giving a higher weight to more

discriminative words. N idf i log ni N = no. of docs in the collection ni = number of documents containing term i We combine IDF with TF (the term frequency) wi , j tf i , j idfi TF/IDF tf flou egg r Bread eggpla

crumb nt chick en d1 3 3 4 3 0 d2

2 3 3 0 2 idf flou egg r 0 0

Bread eggpla crumb nt 0 0.30 chick en 0.30 TF-IDF weighting Properties: Weights terms higher if they are: frequent in a document AND rare in the document collection as a whole. Modified similarity for query/document

retrieval: 2 tf tf ( idf ) w , q w , d thew words actually We only take intoaccount wq , d sim q, d query ) in (the 2 2 (tf q ,qidf q ) (tf d ,d idf d )

q q i i i d d i i i Part 5 Evaluation of IR Evaluation of IR systems

As with most NLP systems, we require some function that our system should maximise. A lot of NLP evaluation rely on precision and recall. Basic rationale For a given classification problem, we have: a gold standard against which to compare our systems results, compared to the target gold standard: false positives (fp) false negatives (fn) true positives (tp) true negatives (tn) Performance typically measured in terms of

precision and recall. Precision Definition: proportion of items that are correctly classified i.e. proportion of true positives out of all the systems classifications tp precision (P) tp fp Recall Definition: proportion of the actual target (gold standard)

items that our system classifies correctly tp recall (R) tp fn total no. of items that should be correctly classified, including those the system doesnt get Combining precision and recall Typically use the F-measure as a global estimate of performance against gold standard

We need some factor (alpha) to weight precision and recall; 0.5 gives them equal weighting 1 F 1 1 1 P R Precision/Recall in IR We assume that the results returned by the IR system can be divided into: Relevant docs (tp) Non-relevant docs (fp)

Precision = the fraction of docs that are relevant out of the set of returned docs Recall: fraction of docs that are relevant out of the whole set of relevant docs Problem: IR systems tend to rank documents by relevance. Method 1: interpolated precision and recall We can split documents into equivalence classes (those at a given rank) and compute P and R for each rank. From: J&M 2009 p. 807

Based on a collection of 9 docs. Recall increases when relevant items are encountered. Precision is very variable! Method 1: interpolated precision and recall Plot of max precision at different recall intervals.

Method 2: Mean average precision Here, we just compute the average precision at or above a given rank. 1 MAP Pr (d ) Rr dR r Rr = set of relevant docs at or above rank r Precisionr(d) = the precision at the rank at which the document d was found. NB: this metric favours systems that return relevant

documents at high ranks. Is this justified? Part 5 Improving IR Improving IR: Simple techniques A lot of queries will contain: Morphologically inflected words (beans, etc) Function words (fried and breadcrumb) Performance usually improves if: We perform some kind of stemming We use a stop list Function words arent very informative (cf. Zipfs

law), while stemming allows us to identify querydoc relatedness in a more general fashion. Using semantic information Homonymy and polysemy can reduce precision: A query containing bank will match docs containing both senses of the word. But the user will generally only be interested in one of the senses. Perhaps Word Sense Disambiguation can help improve IR? More on WSD next week. Query expansion One way to try to improve IR is to expand a query by:

Finding similar words to the words in the query (using a thesaurus) E.g. q = (Steve Jobs); qexp = (Steve Jobs, Mac) But this will depend on the quality of our thesaurus. E.g. Is it useful to expand the query dog with the related words cat, animal and horse? (These are actual examples from the BNC SketchEngine thesaurus) Or is it more useful to restrict our thesaurus to only synonyms (dog, canine etc)?

Recently Viewed Presentations

  • GEO2R - National Cancer Institute

    GEO2R - National Cancer Institute

    Curated GEO profiles (gene-centered representation of the data) A description of the biological sample & protocols to which it was subjected. Not all GEO series have a corresponding . GEO dataset record. Curated GEO dataset (offers quick query & lookup...
  • Allergic Rhinitis - Ilford VTS

    Allergic Rhinitis - Ilford VTS

    80-95% - originates from Little's area on the anterior nasal septum, which contains the Kiesselbach plexus of vessels. Less commonly - originates from branches of the sphenopalatine artery in the posterior nasal cavity. Older people . More profuse. Bleeding from...
  • Horizon III Design The Art and Science of

    Horizon III Design The Art and Science of

    Results only - developed by a team of women - culture based Holacracy - developed by a team of men - process/structure based We needed both driven by "Drive" * ROWE * DRIVE Intrinsic motivation around: Freedom Mastery Purpose *...
  • Production Planning & Scheduling in Large Corporations

    Production Planning & Scheduling in Large Corporations

    Jumbled flow (job Shop) Disconnected line flow (batch) Connected line flow (assembly Line) Continuous flow (chemical plants) Process type Production volume & mix Low volume, low standardi- zation Multiple products, low volume Few major products, high volume High volume, high...
  • Technical Challenges of Plug-In Hybrid Electric Vehicles and ...

    Technical Challenges of Plug-In Hybrid Electric Vehicles and ...

    Potential to displace 6.7 MMbpd(equiv. to 52% of net imports) Reduces CO. 2. emissions by 27%Emissions move from tailpipes to smokestacks (and base load plants) … cheaper to clean up. Reduce the imports of petroleum worth $900 per day. Conservative...
  • Anglo Saxon Period 449 - 1066

    Anglo Saxon Period 449 - 1066

    Arial Lucida Sans Book Antiqua Wingdings 2 Wingdings Wingdings 3 Calibri Arial Black Bradley Hand ITC Apex Anglo Saxon Period 449 - 1066 BEOWULF: AN EPIC HERO Literary Characteristics of Anglo-Saxon period THE ELEGY Medieval Period 1066-1485 Canterbury Tales (cont)...
  • Quantum One 1 2 Preparation of States Using

    Quantum One 1 2 Preparation of States Using

    Clearly, in order to actually test the third postulate, it is necessary to generate an ensemble of quantum mechanical systems that are all in the same physical state, since the content of that postulate involves statistical predictions that apply collectively...
  • Using Together 27001:2013 and 19770-1:2017

    Using Together 27001:2013 and 19770-1:2017

    ISO ITAM Version 3 written in coordination with ISO 55001:2014 for physical asset management. 19770-1 is a "discipline-specific extension" of 55001, with changes. Both are based on same ISO MSS headings and text. 19770-1 is not a "sector-specific application" of...