Slide Material for DHS Reverse Site Visit

Information Extraction Data Mining and Topic Discovery with Probabilistic Models Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Wei Li, Xuerui Wang, Andres Corrada, Ben Wellner, Chris Pal, Michael Hay, Natasha Mohanty, David Mimno, Gideon Mann. Information Extraction with Conditional Random Fields Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Wei Li, Xuerui Wang, Andres Corrada, Ben Wellner, Chris Pal, Michael Hay, Natasha Mohanty, David Mimno, Gideon Mann. Goal: Mine actionable knowledge from unstructured text. An HR office Jobs, but not HR jobs

Jobs, but not HR jobs Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.htm OtherCompanyJobs: foodscience.com-Job1 A Portal for Job Openings Data Mining the Extracted Job Information IE from Research Papers [McCallum et al 99] IE from Research Papers QuickTime and a

TIFF (LZW) decompressor are needed to see this picture. Mining Research Papers [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004] QuickTime and a TIFF (LZW) decompressor are needed to see this picture. QuickTime and a TIFF (LZW) decompressor are needed to see this picture. IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries Why prefer knowledge base search over

page search Targeted, restricted universe of hits Dont show resumes when Im looking for job openings. Specialized queries Topic-specific Multi-dimensional Based on information spread on multiple pages. Get correct granularity Site, page, paragraph Specialized display Super-targeted hit summarization in terms of DB slot values Ability to support sophisticated data mining

Information Extraction needed to automatically build the Knowledge Base. What is Information Extraction As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.

Richard Stallman, founder of the Free Software Foundation, countered saying NAME TITLE ORGANIZATION What is Information Extraction As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select

customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying IE NAME Bill Gates Bill Veghte Richard Stallman TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft.. What is Information Extraction As a family of techniques:

Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying Microsoft Corporation CEO

Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is Information Extraction As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and

development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is Information Extraction

As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying

Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is Information Extraction As a family of techniques: Information Extraction = segmentation + classification + association + clustering Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select

customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying * Microsoft Corporation CEO Bill Gates * Microsoft Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation NAME Bill Gates Bill Veghte Richard Stallman

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft.. October 14, 2002, 4:00 a.m. PT Larger Context Spider Filter Data Mining IE Segment Classify

Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Prediction Outlier detection Decision support Landscape of IE Tasks (1/4): Pattern Feature Domain Text paragraphs without formatting Grammatical sentences and some formatting & links

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, rich formatting & links Tables Landscape of IE Tasks (2/4): Pattern Scope Web site specific Formatting Amazon.com Book Pages Genre specific Layout Resumes Wide, non-specific

Language University Names Landscape of IE Tasks (3/4): Pattern Complexity E.g. word patterns: Closed set Regular set U.S. states U.S. phone numbers He was born in Alabama Phone: (413) 545-1323 The big Wyoming sky The CALD main office can be reached at 412-268-1299 Complex pattern U.S. postal addresses

University of Arkansas P.O. Box 140 Hope, AR 71802 Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Ambiguous patterns, needing context and many sources of evidence Person names was among the six houses sold by Hope Feldman that year. Pawel Opalinski, Software Engineer at WhizBang Labs. Landscape of IE Tasks (4/4): Pattern Combinations Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Binary relationship

Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Person: Jeffrey Immelt Location: Connecticut Named entity extraction Relation: Company-Location Company: General Electric Location: Connecticut N-ary record Relation: Company: Title: Out: In: Succession General Electric CEO

Jack Welsh Jeffrey Immelt Evaluation of Single Entity Extraction TRUTH: Michael Kearns and Sebastian Seung will start Mondays tutorial, followed by Richard M. Karpe and Martin Cooke. PRED: Michael Kearns and Sebastian Seung will start Mondays tutorial, followed by Richard M. Karpe and Martin Cooke. Precision = # correctly predicted segments = # predicted segments Recall = # correctly predicted segments # true segments F1

= 2 6 = 2 4 Harmonic mean of Precision & Recall = 1 ((1/P) + (1/R)) / 2 State of the Art Performance Named entity recognition Person, Location, Organization, F1 in high 80s or low- to mid-90s Binary relation extraction Contained-in (Location1, Location2) Member-of (Person1, Organization1) F1 in 60s or 70s or 80s

Wrapper induction Extremely accurate performance obtainable Human effort (~30min) required on each site Outline Examples of IE and Data Mining IE with Hidden Markov Models Introduction to Conditional Random Fields (CRFs) Examples of IE with CRFs Sequence Alignment with CRFs Semi-supervised Learning Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, Graphical model Finite state model ... S t-1 St State sequence

Observation sequence transitions ... observations ... Generates: S t+1 O Ot t -1 O t +1 v |o | o1

o2 o3 o4 o5 o6 o7 o8 vv P ( s , o ) P ( st | st 1 ) P (ot | st ) t =1 Parameters: for all states S={s1,s2,} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Usually a multinomial over

atomic, fixed alphabet Observation (emission) probabilities: P(ot|st ) Training: Maximize probability of training observations (w/ prior) IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) Yesterday Pedro Domingos spoke this example sentence. Any words said to be generated by the designated person name state extract as a person name: Person name: Pedro Domingos HMM Example: Nymble [Bikel, et al 1998], [BBN IdentiFinder]

Task: Named Entity Extraction Person start-ofsentence end-ofsentence Org Other Train on ~500k words of news wire text. Case Mixed Upper Mixed Observation probabilities P(st | st-1, ot-1 ) P(ot | st , st-1 ) or (Five other name classes)

Results: Transition probabilities Language English English Spanish P(ot | st , ot-1 ) Back-off to: Back-off to: P(st | st-1 ) P(ot | st ) P(st ) P(ot ) F1 . 93%

91% 90% Other examples of shrinkage for HMMs in IE: [Freitag and McCallum 99] We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S t-1 identity of word ends in -ski is capitalized is part of a noun phrase is Wisniewski is in a list of city names is under node X in WordNet part of ends in is in bold font noun phrase -ski is indented O t 1 is in hyperlink anchor last person name was female next two words are and Associates

St S t+1 Ot O t +1 Problems with Richer Representation and a Generative Model These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! S t-1

O t -1 Ignore the dependencies. This causes over-counting of evidence (ala nave Bayes). Big problem when combining evidence, as in Viterbi! St S t+1 S t-1 St S t+1 Ot O t +1 O Ot

O t +1 t -1 Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o): Can examine features, but not responsible for generating them. Dont have to explicitly model their dependencies. Dont waste modeling effort trying to generate what we are given at test time anyway. Outline Examples of IE and Data Mining IE with Hidden Markov Models Introduction to Conditional Random Fields (CRFs) Examples of IE with CRFs Sequence Alignment with CRFs Semi-supervised Learning From HMMs to Conditional Random Fields v o = o1,o2 ,...on v

s = s1,s2 ,...sn Joint [Lafferty, McCallum, Pereira 2001] St-1 v |o| Conditional St+1 ... vv P( s, o ) = P(st | st1 )P(ot | st ) t=1 St Ot-1

Ot ... Ot+1 v |o| 1 v v P( s | o ) = v P(st | st1 )P(ot | st ) P(o ) t=1 St-1 St St+1 ... v |o| =

1 v s (st ,st1 ) o (ot ,st ) Z(o ) t=1 where o (t) = exp k f k (st ,ot ) k Ot-1 Ot Ot+1 ... (A super-special case of Conditional Random Fields.) Set parameters by maximum likelihood, using optimization method on L. Linear Chain Conditional Random Fields

[Lafferty, McCallum, Pereira 2001] St St+1 St+2 St+3 St+4 O = Ot, Ot+1, Ot+2, Ot+3, Ot+4 Markov on s, conditional dependency on o. v |o| 1 v v v P( s | o ) exp j f j (st ,st1, o,t) Z ov t=1 j

Hammersley-Clifford-Besag theorem stipulates that the CRF has this forman exponential function of the cliques in the graph. Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S|2)just like HMMs. CRFs vs. HMMs More general and expressive modeling technique Comparable computational efficiency Features may be arbitrary functions of any or all observations Parameters need not fully specify generation of observations; require less training data Easy to incorporate domain knowledge State means only state of process, vs state of process and observational history Im keeping Training CRFs Maximize log - likelihood of parameters given training data : v v ( i) L({ k } |{ o, s }) Log - likelihood gradient : L

v v v v vv 2 = Ck ( s (i), o (i) ) P{ k } ( s | o (i) ) Ck ( s, o (i) ) k v k i i s vv v Ck ( s, o ) = f k (o,t,st1,st ) t Feature count using correct labels - Feature count using predicted labels - Smoothing penalty

Outline Examples of IE and Data Mining IE with Hidden Markov Models Introduction to Conditional Random Fields (CRFs) Examples of IE with CRFs Sequence Alignment with CRFs Semi-supervised Learning Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------: : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------Year

: of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :-----------------: : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------: 1,000 Head --- Pounds --Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592

3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from www.fedstats.gov CRF of milk during 1995 at $19.9 billion dollars, was eturns averaged $12.93 per hundredweight, 1994. Marketings totaled 154 billion pounds, ngs include whole milk sold to plants and dealers consumers.

ds of milk were used on farms where produced, es were fed 78 percent of this milk with the cer households. 1993-95 ------------------------------------ n of Milk and Milkfat 2/ -------------------------------------: Percentage : Non-Table Table Title Table Header

Table Data Row Table Section Data Row Table Footnote ... (12 in all) Features: uction of Milk and Milkfat: w Labels: Total ----: of Fat in All :-----------------Milk Produced : Milk : Milkfat ------------------------------------

Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev. ... Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}. Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Line labels, percent correct HMM Stateless MaxEnt CRF 65 % 85 % 95 % Table segments,

F1 64 % 92 % IE from Research Papers [McCallum et al 99] IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) 75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 error 40% [Han, Giles, et al, 2003] Conditional Random Fields (CRFs)

[Peng, McCallum, 2004] 93.9 Chinese Word Segmentation ~100k words data, Penn Chinese Treebank [McCallum & Feng 2003] Lexicon features: Adjective ending character adverb ending character building words Chinese number characters Chinese period cities and regions countries dates department characters digit characters foreign name chars function words job title locations money

negative characters organization indicator preposition characters provinces punctuation chars Roman alphabetics Roman digits stopwords surnames symbol characters verb chars wordlist (188k lexicon) Chinese Word Segmentation Results [McCallum & Feng 2003] Precision and recall of segments with perfect boundaries: Method # training sentences testing segmentation prec. recall F1

[Peng] [Ponte] [Teahan] [Xue] CRF CRF CRF ~5M ? ~40k ~10k 2805 140 56 75.1 84.4 ? 95.2 97.3 95.4 93.9 74.0

87.8 ? 95.1 97.8 96.0 95.0 74.2 86.0 94.4 95.2 97.5 95.7 94.4 Prev. worlds best error 50% Named Entity Recognition CRICKET MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract.

Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. Labels: PER ORG LOC MISC Examples: Yayuk Basuki Innocent Butare 3M KDP Cleveland Cleveland Nirmal Hriday The Oval Java Basque 1,000 Lakes Rally Automatically Induced Features

[McCallum & Li, 2003, CoNLL] Index Feature 0 inside-noun-phrase (ot-1) 5 stopword (ot) 20 capitalized (ot+1) 75 word=the (ot) 100 in-person-lexicon (ot-1)

200 word=in (ot+2) 500 word=Republic (ot+1) 711 word=RBI (ot) & header=BASEBALL 1027 header=CRICKET (ot) & in-English-county-lexicon (ot) 1298 company-suffix-word (firstmentiont+2) 4040 location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1) 4945

moderately-rare-first-name (ot-1) & very-common-last-name (ot) 4474 word=the (ot-2) & word=of (ot) Named Entity Extraction Results [McCallum & Li, 2003, CoNLL] Method F1 HMMs BBN's Identifinder 73% CRFs w/out Feature Induction 83% CRFs with Feature Induction based on LikelihoodGain 90% Related Work CRFs are widely used for information extraction ...including more complex structures, like trees:

[Zhu, Nie, Zhang, Wen, ICML 2007] Dynamic Hierarchical Markov Random Fields and their Application to Web Data Extraction [Viola & Narasimhan]: Learning to Extract Information from Semi-structured Text using a Discriminative Context Free Grammar [Jousse et al 2006]: Conditional Random Fields for XML Trees Outline Examples of IE and Data Mining IE with Hidden Markov Models Introduction to Conditional Random Fields (CRFs) Examples of IE with CRFs Sequence Alignment with CRFs Semi-supervised Learning String Edit Distance Distance between sequences x and y: cost of lowest-cost sequence of edit operations that transform string x into y. String Edit Distance Distance between sequences x and y: cost of lowest-cost sequence of edit operations that transform string x into y.

Applications Database Record Deduplication Apex International Hotel Grassmarket Street Apex Internatl Grasmarket Street Records are duplicates of the same hotel? String Edit Distance Distance between sequences x and y: cost of lowest-cost sequence of edit operations that transform string x into y. Applications Database Record Deduplication Biological Sequences QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

AGCTCTTACGATAGAGGACTCCAGA AGGTCTTACCAAAGAGGACTTCAGA QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. String Edit Distance Distance between sequences x and y: cost of lowest-cost sequence of edit operations that transform string x into y. Applications Database Record Deduplication Biological Sequences Machine Translation Il a achete une pomme He bought an apple String Edit Distance Distance between sequences x and y: cost of lowest-cost sequence of edit operations that transform string x into y. Applications Database Record Deduplication Biological Sequences

Machine Translation Textual Entailment He bought a new car last night He purchased a brand new automobile yesterday evening Levenshtein Distance Edit operations Align two strings copy insert delete subst x1 = x2 = Copy a character from x to y Insert a character into y Delete a character from y Substitute one character for another i a m _ W . _ C o h o n

copy subst copy copy copy copy delete delete delete copy copy subst insert

copy copy copy copy operation cost (cost 0) (cost 1) (cost 1) (cost 1) William W. Cohon Willleam Cohen W i l l Lowest cost alignment [1966] W i l l l e a m _ C o h e n

0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 Total cost = 6 = Levenshtein Distance Levenshtein Distance Edit operations copy insert delete subst Copy a character from x to y Insert a character into y Delete a character from y Substitute one character for another (cost 0) (cost 1) (cost 1) (cost 1) Dynamic program D(i,j) = score of best alignment from x1... xi to y1... yj.

D(i-1,j-1) + (xiyj ) D(i,j) = min D(i-1,j) + 1 D(i,j-1) + 1 W i l l i a m 0 1 2 3 4 5 6 7 W 1 0 1 2

3 4 5 6 i 2 1 0 1 2 3 4 5 l 3 2 1 0 1 2 3 4 l

4 3 2 1 0 1 2 3 l 5 4 3 2 1 1 2 3 e 6 5 4 3 2 2

2 3 a 7 6 5 4 3 3 2 3 m 8 7 6 5 4 4 4 2 insert subst

total cost = distance Levenshtein Distance with Markov Dependencies Edit operations copy insert delete subst Cost after a Copy a character from x to y Insert a character into y Delete a character from y Substitute one character for another repeated delete is cheaper c 0 1 1

1 i d s 0 0 0 1 1 1 2 1 12 1 1 1 1 Learn these costs from training data subst copy delete insert W i l l i a

m 0 1 2 3 4 5 6 7 W 1 0 1 2 3 4 5 6 i 2 1 0 1

2 3 4 5 l 3 2 1 0 1 2 3 4 l 4 3 2 1 0 1 2 3 l

5 4 3 2 1 1 2 3 e 6 5 4 3 2 2 2 3 a 7 6 5 4 3 3

2 3 m 8 7 6 5 4 4 4 2 3D DP table Ristad & Yianilos (1997) Essentially a Pair-HMM, generating a edit/state/alignment-sequence and two strings string 1 p(x1,x 2 ) = copy

Match score = 8 subst p(a,x1,x 2 ) = p(at | at1 ) p(x1,a t .i1 , x 2,a t .i2 | at ) 8 copy W i l l l e a m 8 copy 8 copy 7 10 11 12 13 14 15 16 copy

6 9 delete 5 8 delete 4 7 delete 3 6 copy 2

5 copy 1 4 i a m _ W . _ C o h o n subst 4 insert 3 copy 2 copy 1

copy string 2 W i l l copy alignment x1 a.i1 a.e a.i2 x2 9 10 11 12 13 14 _ C o h e n complete data likelihood t p(a

a:x 1 ,x 2 t | at1 ) p(x1,a t .i1 , x 2,a t .i2 | at ) incomplete data likelihood (sum over all alignments consistent with x1 and x2) t Given training set of matching string pairs, objective fn is O = p(x (1 j ),x (2 j ) ) j Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely. Ristad & Yianilos Regrets Limited features of input strings Examine only single character pair at a time Difficult to use upcoming string context, lexicons, ... Example: Senator John Green John Green

Limited edit operations Difficult to generate arbitrary jumps in both strings Example: UMass University of Massachusetts. Trained only on positive match data Doesnt include information-rich near misses Example: ACM SIGIR ACM SIGCHI So, consider model trained by conditional probability Conditional Probability (Sequence) Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(y|x) instead of P(y,x): Can examine features, but not responsible for generating them. Dont have to explicitly model their dependencies. CRF String Edit Distance string 1 9 10 11 12 13 14 _ C o h e n

p(a,x1,x 2 ) = p(at | at1 ) p(x1,a t .i1 , x 2,a t .i2 | at ) t conditional complete data likelihood p(a | x1,x 2 ) = 1 Z x 1 ,x 2 (a ,a t ,x1,x 2 ) t1 t pairs, Want to train from set of string each labeled one of {match, non-match} match non-match match match

non-match copy 8 subst joint complete data likelihood 8 copy W i l l l e a m 8 copy 8 copy 7

10 11 12 13 14 15 16 copy 6 9 delete 5 8 delete 4 7 delete 3 6 copy

2 5 copy 1 4 subst 4 i a m _ W . _ C o h o n insert 3 copy 2 copy

1 copy string 2 W i l l copy alignment x1 a.i1 a.e a.i2 x2 William W. Cohon Bruce DAmbrosio Tommi Jaakkola Stuart Russell Tom Deitterich Willlleam Cohen Bruce Croft

Tommi Jakola Stuart Russel Tom Dean CRF String Edit Distance FSM subst copy delete insert CRF String Edit Distance FSM conditional incomplete data likelihood p(m | x1,x 2 ) = a S m subst 1

Z x 1 ,x 2 (a ,a t ,x1,x 2 ) t1 t copy match m=1 delete insert subst copy Start non-match

m=0 delete insert CRF String Edit Distance FSM x1 = Tommi Jaakkola x2 = Tommi Jakola subst copy match m=1 delete insert subst copy Probability summed over all alignments in match states 0.8

Start non-match m=0 delete insert Probability summed over all alignments in non-match states 0.2 CRF String Edit Distance FSM x1 = Tom Dietterich x2 = Tom Dean subst copy match m=1 delete insert

subst copy Probability summed over all alignments in match states 0.1 Start non-match m=0 delete insert Probability summed over all alignments in non-match states 0.9 Parameter Estimation

Given training set of string pairs and match/non-match labels, objective fn is the incomplete log likelihood The complete log likelihood log( p(m j O = log( p(m( j ) | x (1 j ),x (2 j ) )) j ( j) | a,x (1 j ),x (2 j ) ) p(a | x (1 j ),x (2 j ) )) a Expectation Maximization E-step: Estimate distribution over alignments, p(a | x ( j ),x ( j ) ) , using current parameters M-step: Change parameters to maximize the complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS)

1 2 This is conditional EM, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form. Efficient Training Dynamic programming table is 3D; |x1| = |x2| = 100, |S| = 12, .... 120,000 entries Use beam search during E-step [Pal, Sutton, McCallum 2005] Unlike completely observed CRFs, objective function is not convex. Initialize parameters not at zero, but so as to yield a reasonable initial edit distance. What Alignments are Learned? x1 = Tommi Jaakkola x2 = Tommi Jakola T o m m i subst copy

match m=1 delete insert subst copy Start non-match m=0 delete insert T o m m i J a k

o l a J a a k k o l a What Alignments are Learned? x1 = Bruce Croft x2 = Tom Dean subst copy match m=1 delete insert Start B r u c e subst copy

non-match m=0 delete insert T o m D e a n C r o f t What Alignments are Learned? x1 = Jaime Carbonell x2 = Jamie Callan subst copy match m=1 delete

insert Start J a i m e subst copy non-match m=0 delete insert J a m i e C a l l

a n C a r b o n e l l Summary of Advantages Arbitrary features of the input strings Examine past, future context Use lexicons, WordNet Extremely flexible edit operations Single operation may make arbitrary jumps in both strings, of size determined by input features Discriminative Training Maximize ability to predict match vs non-match Easy to Label Data Match/Non-Match no need for labeled alignments Experimental Results: Data Sets Restaurant name, Restaurant address 864 records, 112 matches E.g. Abes Bar & Grill, E. Main St

Abes Grill, East Main Street People names, UIS DB generator synthetic noise E.g. John Smith vs Snith, John CiteSeer Citations In four sections: Reason, Face, Reinforce, Constraint E.g. Rusell & Norvig, Artificial Intelligence: A Modern... Russell & Norvig, Artificial Intelligence: An Intro... Experimental Results: Features same, different same-alphabetic, different alphbetic same-numeric, different-numeric

punctuation1, punctuation2 alphabet-mismatch, numeric-mismatch end-of-1, end-of-2 same-next-character, different-next-character Experimental Results: Edit Operations insert, delete, substitute/copy swap-two-characters skip-word-if-in-lexicon skip-parenthesized-words skip-any-word substitute-word-pairs-in-translation-lexicon skip-word-if-present-in-other-string Experimental Results [Bilenko & Mooney 2003] F1 (average of precision and recall)

Distance metric Restaurant name Restaurant address CiteSeer Reason Face Reinf Constraint Levenshtein Learned Leven. Vector Learned Vector 0.290 0.354 0.365 0.433 0.686

0.712 0.380 0.532 0.927 0.938 0.897 0.924 0.924 0.941 0.923 0.913 0.952 0.966 0.922 0.875 0.893 0.907 0.903 0.808 Experimental Results [Bilenko & Mooney 2003]

F1 (average of precision and recall) Distance metric Restaurant name Restaurant address CiteSeer Reason Face Reinf Constraint Levenshtein Learned Leven. Vector Learned Vector 0.290 0.354 0.365 0.433

0.686 0.712 0.380 0.532 0.927 0.938 0.897 0.924 0.952 0.966 0.922 0.875 0.893 0.907 0.903 0.808 0.924 0.941 0.923 0.913 CRF Edit Distance

0.448 0.783 0.964 0.918 0.917 0.976 Experimental Results Data set: person names, with word-order noise added F1 Without skip-if-present-in-other-string With skip-if-present-in-other-string 0.856 0.981 Related Work Learned Edit Distance [Bilenko & Mooney 2003], [Cohen et al 2003],...

[Joachims 2003]: Max-margin, trained on alignments Conditionally-trained models with latent variables [Jebara 1999]: Conditional Expectation Maximization [Quattoni, Collins, Darrell 2005]: CRF for visual object recognition, with latent classes for object sub-patches [Zettlemoyer & Collins 2005]: CRF for mapping sentences to logical form, with latent parses. Outline Examples of IE and Data Mining IE with Hidden Markov Models Introduction to Conditional Random Fields (CRFs) Examples of IE with CRFs Sequence Alignment with CRFs Semi-supervised Learning Semi-Supervised Learning How to train with limited labeled data? Augment with lots of unlabeled data Expectation Regularization [Mann, McCallum, ICML 2007] Supervised Learning Decision boundary

Creation of labeled instances requires extensive human effort What if limited labeled data? Small amount of labeled data Semi-Supervised Learning: Labeled & Unlabeled data Small amount of labeled data Large amount of unlabeled data Augment limited labeled data by using unlabeled data More Semi-Supervised Algorithms than Applications 30 25 # papers 20 Algorithms Applications 15

10 5 0 1998 2000 2002 2004 2006 Compiled from [Zhu, 2007] Weakness of Many Semi-Supervised Algorithms Difficult to Implement Significantly more complicated than supervised counterparts Fragile Meta-parameters hard to tune Lacking in Scalability

O(n2) or O(n3) on unlabeled data EM will generally degrade [tagging] accuracy, except when only a limited amount of hand-tagged text is available. [Merialdo, 1994] When the percentage of labeled data increases from 50% to 75%, the performance of [Label Propagation with Jensen-Shannon divergence] and SVM become almost same, while [Label propagation with cosine distance] performs significantly worse than SVM. [Niu,Ji,Tan, 2005] Families of Semi-Supervised Learning 1. 2. 3. 4.

Expectation Maximization Graph-Based Methods Auxiliary Functions Decision Boundaries in Sparse Regions Family 1 : Expectation Maximization [Dempster, Laird, Rubin, 1977] Fragile -- often worse than supervised Family 2: Graph-Based Methods [Szummer, Jaakkola, 2002] [Zhu, Ghahramani, 2002] Lacking in scalability, Sensitive to choice of metric Family 3: Auxiliary-Task Methods [Ando and Zhang, 2005] Complicated to find appropriate auxiliary tasks Family 4: Decision Boundary in Sparse Region Family 4: Decision Boundary in Sparse

Region Transductive SVMs [Joachims, 1999]: Sparsity measured by margin Entropy Regularization [Grandvalet and Bengio, 2005] by label entropy Minimal Entropy Solution! How do we know the minimal entropy solution is wrong? We suspect at least some of the data is in the second class! 0.8 0.7 0.6 0.5 0.4 In fact we often have prior knowledge of the relative class proportions 0.3 0.2

0.1 0 Class Size 0.8 : Student 0.2 : Professor How do we know the minimal entropy solution is wrong? We suspect at least some of the data is in the second class! In fact we often have prior knowledge of the relative class proportions 0.9 0.8 0.7 0.6 0.5 0.4

0.3 0.2 0.1 0 Class Size 0.1 : Gene Mention 0.9 : Background How do we know the minimal entropy solution is wrong? We suspect at least some of the data is in the second class! 0.6 0.5 0.4 0.3 In fact we often have prior knowledge of the relative class proportions

0.2 0.1 0 Class Size 0.6 : Person 0.4 : Organization Families of Semi-Supervised Learning 1. 2. 3. 4. 5. Expectation Maximization Graph-Based Methods Auxiliary Functions Decision Boundaries in Sparse Regions Expectation Regularization 0.8 0.7 0.6

0.5 0.4 0.3 0.2 0.1 0 Class Size Family 5: Expectation Regularization Low density region Favor decision boundaries that match the prior 0.8 0.7 0.6 0.5 0.4 0.3 0.2

0.1 0 Class Size Family 5: Expectation Regularization 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Class Size Label Regularization Expectation Regularization special case general case p(y)

p(y|feature) Expectation Regularization Simple: Easy to implement Robust: Meta-parameters need little or no tuning Scalable: Linear in number of unlabeled examples Discriminative Models Predict class boundaries directly Do not directly estimate class densities Make few assumptions (e.g. independence) on features Are trained by optimizing conditional loglikelihood Logistic Regression Constraints Expectations Expectation Regularization (XR)

Log-likelihood KL-Divergence between a prior distribution and an expected distribution over the unlabeled data XR Prior distribution (provided from supervised training or estimated on the labeled data) Models expected distribution on the unlabeled data After Training, Model Matches Prior Distribution Supervised only Supervised + XR Gradient for Logistic Regression When the gradient is 0

XR Results for Classification Secondary Structure Prediction Accuracy # Labeled Examples 2 100 1000 SVM (supervised) 55.41% 66.29% Cluster Kernel SVM 57.05% 65.97% QC Smartsub 57.68%

59.16% Nave Bayes (supervised) 52.42% 57.12% 64.47% Nave Bayes EM 50.79% 57.34% 57.60% Logistic Regression (supervised) 52.42% 56.74% 65.43%

Logistic Regression + Ent. Reg. 48.56% 54.45% 58.28% Logistic Regression + XR 57.08% 58.51% 65.44% XR Results for Classification: Sliding Window Model CoNLL03 Named Entity Recognition Shared Task XR Results for Classification: Sliding Window Model 2 BioCreativeII 2007 Gene/Gene Product Extraction XR Results for Classification: Sliding

Window Model 3 Wall Street Journal Part-of-Speech Tagging XR Results for Classification: SRAA Simulated/Real Auto/Aviation Text Classification Noise in Prior Knowledge What happens when users estimates of the class proportions is in error? Noisy Prior Distribution CoNLL03 Named Entity Recognition Shared Task 20% change in probability of majority class Conclusion Expectation Regularization is an effective, robust method of semi-supervised training which can be applied to discriminative models, such as logistic regression Ongoing and Future Work Applying Expectation Regularization to other

discriminative models, e.g. conditional random fields Experimenting with priors other than class label priors End of Part 1 Joint Inference in Information Extraction & Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Wei Li, Xuerui Wang, Andres Corrada, Ben Wellner, Chris Pal, Michael Hay, Natasha Mohanty, David Mimno, Gideon Mann. From Text to Actionable Knowledge Spider Filter Data Mining IE Segment Classify Associate Cluster

Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Prediction Outlier detection Decision support Knowledge Discovery IE Problem: Segment Classify Associate

Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Combined in serial juxtaposition, IE and DM are unaware of each others weaknesses and opportunities. 1) DM begins from a populated DB, unaware of where the data came from, or its inherent errors and uncertainties. 2) IE is unaware of emerging patterns and regularities in the DB. The accuracy of both suffers, and significant mining of complex text sources is beyond reach. Solution:

Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Emerging Patterns

Prediction Outlier detection Decision support Solution: Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Probabilistic Model Discover patterns - entity types - links / relations

- events Discriminatively-trained undirected graphical models Document collection Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller], [Jensen], [Geetor], [Domingos] Complex Inference and Learning Just what we researchers like to sink our teeth into! Actionable knowledge Prediction Outlier detection Decision support Scientific Questions What model structures will capture salient dependencies? Will joint inference actually improve accuracy?

How to do inference in these large graphical models? How to do parameter estimation efficiently in these models, which are built from multiple large components? How to do structure discovery in these models? Outline The need for joint inference Examples of joint inference Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) Joint Relation Extraction and Data Mining (ICM) Probability + First-order Logic, Co-ref on Entities (MCMC)

Cascaded Predictions Named-entity tag Part-of-speech Segmentation (output prediction) Chinese character (input observation) Cascaded Predictions Named-entity tag Part-of-speech (output prediction) Segmentation (input observation) Chinese character (input observation) Cascaded Predictions

Named-entity tag (output prediction) Part-of-speech (input obseration) Segmentation (input observation) Chinese character (input observation) But errors cascade--must be perfect at every stage to do well. Joint Prediction Cross-Product over Labels O(|V| x 14852) parameters O(|o| x 14852) running time 3 x 45 x 11 = 1485 possible states e.g.: state label = (Wordbeg, Noun, Person) Segmentation+POS+NE (output prediction) Chinese character (input observation) Joint Prediction

Factorial CRF O(|V| x 2785) parameters Named-entity tag (output prediction) Part-of-speech (output prediction) Segmentation (output prediction) Chinese character (input observation) Linear-Chain to Factorial CRFs Model Definition Linear-chain T p(y | x) = Factorial p(y | x) = 1

y (y t , y t1 ) xy (x t , y t ) Z(x) t=1 1 Z(x) T (u ,u u t t1 ) v (v t ,v t1 ) w (w t ,w t1 ) t=1 uv (ut ,v t ) vw (v t ,w t ) wx (w t , x t ) where

() = exp k f k () k y x ... ... u ... v ... w ... x ...

Dynamic CRFs Undirected conditionally-trained analogue to Dynamic Bayes Nets (DBNs) Factorial Higher-Order Hierarchical Training CRFs Maximize log - likelihood of parameters given training data : v v ( i) L({ k } |{ o, s }) Log - likelihood gradient : L v v v v vv 2 = Ck ( s (i), o (i) ) P{ k } ( s | o (i) ) Ck ( s, o (i) ) k v k i i s

vv v Ck ( s, o ) = f k (o,t,st1,st ) t Feature count using correct labels - Feature count using predicted labels - Smoothing penalty Training DCRFs Maximize log - likelihood of parameters given training data : v v ( i) L({ k } |{ o, s }) Log - likelihood gradient : L v v v v vv

2 = Ck ( s (i), o (i) ) P{ k } ( s | o (i) ) Ck ( s, o (i) ) k v k i i s vv v Ck ( s, o ) = f k (o,t,c) t c Cliques Feature count using correct labels - Feature count using predicted labels Same form as general CRFs - Smoothing penalty

Experiments Simultaneous noun-phrase & part-of-speech tagging B I I B I I O O O N N N O N N V O V Rockwell International Corp. 's Tulsa unit said it signed B I I O B I O B I O J N V

O N O N N a tentative agreement extending its contract with Boeing Co. Data from CoNLL Shared Task 2000 (Newswire) 8936 training instances 45 POS tags, 3 NP tags Features: word identity, capitalization, regexs, lexicons Experiments Simultaneous noun-phrase & part-of-speech tagging B I I B I I O O O N N N O N N V O V Rockwell International Corp. 's Tulsa unit said it signed

B I I O B I O B I O J N V O N O N N a tentative agreement extending its contract with Boeing Co. Two experiments Compare exact and approximate inference Compare Noun Phrase Segmentation F1 of Cascaded CRF+CRF Cascaded Brill+CRF Joint Factorial DCRFs

Comparison of Cascaded CRF+CRF Brill+CRF & Joint FCRF POS acc 98.28 N/A 98.92 Joint acc 95.56 N/A 96.48 NP F1 93.10

93.33 93.87 CRF+CRF and FCRF trained on 8936 CoNLL sentences Brill tagger trained on 30,000+ sentences, including CoNLL test set! 20% error Accuracy by Training Set Size Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Outline The need for joint inference Examples of joint inference Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution

Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) Joint Relation Extraction and Data Mining (ICM) Probability + First-order Logic, Co-ref on Entities (MCMC) Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004] Senator Joe Green said today . Green ran

for Dependency among similar, distant mentions ignored. Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004] Senator Joe Green said today . Green ran for 14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterized BP [Wainwright et al, 2002]

See also [Finkel, et al, 2005] Outline The need for joint inference Examples of joint inference Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) Joint Relation Extraction and Data Mining (ICM) Probability + First-order Logic, Co-ref on Entities (MCMC) Joint co-reference among all pairs

Affinity Matrix CRF Entity resolution Object correspondence . . . Mr Powell . . . 45 . . . Powell . . . Y/N Y/N 99 Y/N 11 ~25% reduction in error on co-reference of proper nouns in newswire. . . . she . . . Inference: Correlational clustering graph partitioning

[Bansal, Blum, Chawla, 2002] [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] Coreference Resolution AKA "record linkage", "database record deduplication", "citation matching", "object correspondence", "identity uncertainty" Input Output News article, with named-entity "mentions" tagged Number of entities, N = 3 Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . he . . . . . . . . . . . . . . . . . . . Condoleezza Rice . . . . . . . . . Mr Powell . . . . . . . . . .she . . . . . . . . . . . . . . . . . . . . . Powell . . . . . . . . . . . . . . . President Bush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rice . . . . . . . . . . . . . . . . Bush . . . . . . . . . . . . . . . . . . . . .

........................... #1 Secretary of State Colin Powell he Mr. Powell Powell #2 Condoleezza Rice she Rice ......................... #3 President Bush Bush Inside the Traditional Solution Pair-wise Affinity Metric Mention (3) . . . Mr Powell . . . N Y Y Y

Y N Y Y N Y N N Y Y Mention (4) Y/N? . . . Powell . . . Two words in common One word in common "Normalized" mentions are string identical Capitalized word in common > 50% character tri-gram overlap < 25% character tri-gram overlap In same sentence Within two sentences Further than 3 sentences apart

"Hobbs Distance" < 3 Number of entities in between two mentions = 0 Number of entities in between two mentions > 4 Font matches Default OVERALL SCORE = 29 13 39 17 19 -34 9 8 -1 11 12 -3 1 -19 98 > threshold=0 The Problem

. . . Mr Powell . . . affinity = 98 Y affinity = 104 Pair-wise merging decisions are being made independently from each other . . . Powell . . . N Y affinity = 11 . . . she . . . Affinity measures are noisy and imperfect. They should be made in relational dependence

with each other. A Generative Model Solution [Russell 2001], [Pasula et al 2002] (Applied to citation matching, and object correspondence in vision) N id Issues: context words id surname distance fonts . . . gender

age . . . 1) Generative model makes it difficult to use complex features. 2) Number of entities is hard-coded into the model structure, but we are supposed to predict num entities! Thus we must modify model structure during inference---MCMC. A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . 45

. . . Powell . . . Y/N Y/N 30 Y/N Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. 11 . . . she . . . 1 v v

P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik ) Z xv i, j l i, j,k A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . 45 . . . Powell . . . Y/N Y/N 30 Y/N 11 Make pair-wise merging

decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. . . . she . . . 1 v v P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik ) Z xv i, j l i, j,k A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003]

. . . Mr Powell . . . 45) . . . Powell . . . Y 30) N Y (11) . . . she . . . infinity 1 v v P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik ) Z xv i, j l i, j,k

A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . 45) . . . Powell . . . Y 30) N N (11) . . . she . . . 64

1 v v P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik ) Z xv i, j l i, j,k Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . . 106 30 134 11 . . . Condoleezza Rice . . . . . . she . . . 10

v v log(P( y | x )) l f l (x i , x j , y ij ) = i, j l w i, j w/in paritions ij w i, j across paritions ij Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45

. . . Powell . . . 106 30 134 11 . . . Condoleezza Rice . . . . . . she . . . 10 v v log(P( y | x )) l f l (x i , x j , y ij ) = i, j l w i, j w/in paritions ij

w i, j across paritions ij = 22 Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . . 106 30 134 11 . . . Condoleezza Rice . . . . . . she . . . 10

v v log(P( y | x )) l f l (x i , x j , y ij ) = i, j l w i, j w/in paritions ij + w' i, j across paritions ij = 314 Co-reference Experimental Results [McCallum & Wellner, 2003]

Proper noun co-reference DARPA ACE broadcast news transcripts, 117 stories Single-link threshold Best prev match [Morton] MRFs Partition F1 16 % 83 % 88 % error=30% Pair F1 18 % 89 % 92 % error=28% DARPA MUC-6 newswire article corpus, 30 stories Single-link threshold Best prev match [Morton] MRFs Partition F1 11%

70 % 74 % error=13% Pair F1 7% 76 % 80 % error=17% Joint Co-reference for Multiple Entity Types [Culotta & McCallum 2005] People Stuart Russell Y/N Stuart Russell Y/N Y/N S. Russel Joint Co-reference for Multiple Entity Types [Culotta & McCallum 2005] People

Stuart Russell Organizations University of California at Berkeley Y/N Y/N Stuart Russell Y/N Y/N S. Russel Berkeley Y/N Y/N Berkeley Joint Co-reference for Multiple Entity Types [Culotta & McCallum 2005] People Stuart Russell

Organizations University of California at Berkeley Y/N Y/N Stuart Russell Y/N Y/N S. Russel Berkeley Y/N Y/N Reduces error by 22% Berkeley Outline The need for joint inference Examples of joint inference Joint Labeling of Cascaded Sequences

(Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) Joint Relation Extraction and Data Mining (ICM) Probability + First-order Logic, Co-ref on Entities (MCMC) Joint segmentation and co-reference Extraction from and matching of research paper citations. o s Laurel, B. Interface Agents:

Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), AddisonWesley, 1990. World Knowledge c y Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. p Co-reference decisions y Database field values c s

c y Citation attributes s o Segmentation o 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Sparse Generalized Belief Propagation [Pal, Sutton, McCallum, 2005] [Wellner, McCallum, Peng, Hay, UAI 2004] see also [Marthi, Milch, Russell, 2003] Joint segmentation and co-reference Joint IE and Coreference from Research Paper Citations Textual citation mentions

(noisy, with duplicates) Paper database, with fields, clean, duplicates collapsed AUTHORS TITLE Cowell, Dawid Probab Montemerlo, ThrunFastSLAM Kjaerulff Approxi QuickTime and a TIFF (LZW) decompressor are needed to see this picture. VENUE Springer AAAI Technic Citation Segmentation and Coreference Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . Citation Segmentation and Coreference Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . 1) Segment citation fields Citation Segmentation and Coreference Laurel, B. Y ? N

Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . 1) Segment citation fields 2) Resolve coreferent citations Citation Segmentation and Coreference Laurel, B. Y ? N Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . AUTHOR = TITLE = PAGES = BOOKTITLE = EDITOR = PUBLISHER = YEAR = Brenda Laurel Interface Agents: Metaphors with Character 355-366 The Art of Human-Computer Interface Design T. Smith Addison-Wesley 1990 1) Segment citation fields 2) Resolve coreferent citations 3)

Form canonical database record Resolving conflicts Citation Segmentation and Coreference Laurel, B. Y ? N Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . AUTHOR = TITLE = PAGES = BOOKTITLE = EDITOR = PUBLISHER = YEAR = Perform

Brenda Laurel Interface Agents: Metaphors with Character 355-366 The Art of Human-Computer Interface Design T. Smith Addison-Wesley 1990 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record jointly. IE + Coreference Model AUT AUT YR TITL TITL

CRF Segmentation s Observed citation x J Besag 1986 On the IE + Coreference Model AUTHOR = J Besag YEAR = 1986 TITLE = On the Citation mention attributes c CRF Segmentation s Observed citation

x J Besag 1986 On the IE + Coreference Model Smyth , P Data mining Structure for each citation mention c s x Smyth . 2001 Data Mining J Besag 1986 On the IE + Coreference Model Smyth ,

P Data mining Binary coreference variables for each pair of mentions c s x Smyth . 2001 Data Mining J Besag 1986 On the IE + Coreference Model Smyth , P Data mining Binary coreference variables for each pair of mentions y n n

c s x Smyth . 2001 Data Mining J Besag 1986 On the IE + Coreference Model Smyth , P Data mining AUTHOR = P Smyth YEAR = 2001 TITLE = Data Mining ... Research paper entity attribute nodes y

n n c s x Smyth . 2001 Data Mining J Besag 1986 On the IE + Coreference Model Smyth Research paper entity attribute node , P Data mining y y y

c s x Smyth . 2001 Data Mining J Besag 1986 On the IE + Coreference Model Smyth , P Data mining y n n c s x Smyth . 2001 Data Mining

J Besag 1986 On the Such a highly connected graph makes exact inference intractable, so Approximate Inference 1 Loopy Belief Propagation m1(v2) v1 m2(v3) v2 m2(v1) v4 v3 m3(v2) v5

messages passed between nodes v6 Approximate Inference 1 Loopy Belief Propagation Generalized Belief Propagation m1(v2) v1 m2(v3) v2 m2(v1) v3 m3(v2)

messages passed between nodes v4 v5 v6 v1 v2 v3 v4 v5 v6 v7 v8 v9

messages passed between regions Here, a message is a conditional probability table passed among nodes. But when message size grows exponentially with region size! Approximate Inference 2 Iterated Conditional Modes (ICM) v1 v2 v3 = held constant [Besag 1986] v4 v6i+1 = argmax P(v6i | v \ v6i) v6i

v5 v6 Approximate Inference 2 Iterated Conditional Modes (ICM) v1 v2 v3 = held constant [Besag 1986] v4 v5j+1 = argmax P(v5j | v \ v5j) v5j v5

v6 Approximate Inference 2 Iterated Conditional Modes (ICM) v1 v2 v3 = held constant [Besag 1986] v4 v5 v6 v4k+1 = argmax P(v4k | v \ v4k) v4k

But greedy, and easily falls into local minima. Approximate Inference 2 Iterated Conditional Modes (ICM) v1 v2 v3 = held constant [Besag 1986] v4 v5 v6 v4k+1 = argmax P(v4k | v \ v4k) v4k

Iterated Conditional Sampling or Sparse Belief Propagation Instead of passing only argmax, sample of argmaxes of P(v | v \ v ) e.g. an N-best list (the top N values) k 4 v1 v4 v2 v5 v3 v6 k 4 Can use Generalized Version of this; doing exact inference on a region of

several nodes at once. Here, a message grows only linearly with region size and N! Inference by Sparse Generalized BP Smyth , P Data mining [Pal, Sutton, McCallum 2005] Exact inference on these linear-chain regions From each chain pass an N-best List into coreference Smyth . 2001 Data Mining J Besag 1986 On the Inference by Sparse Generalized BP Smyth

, P Data mining [Pal, Sutton, McCallum 2005] Approximate inference by graph partitioning Make scale to 1M citations with Canopies integrating out uncertainty in samples of extraction Smyth . 2001 Data Mining [McCallum, Nigam, Ungar 2000] J Besag 1986 On the Inference: Sample = N-best List from CRF Segmentation Name

Title Book Title Year Laurel, B. Interface Agents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B. Interface Agents: Metaphors with Character The Art of Human Computer Interface Design

1990 Agents: Metaphors with Character The Art of Human Computer Interface Design Laurel, B. Interface When calculating similarity with another citation, have more opportunity to find correct, matching fields. Name Title Laurel, B Interface Agents:

Metaphors with Character The Laurel, B. Interface Agents: Metaphors with Character Laurel, B. Interface Agents Metaphors with Character 1990 y?n Inference by Sparse Generalized BP

Smyth , P Data mining [Pal, Sutton, McCallum 2005] Exact (exhaustive) inference over entity attributes y n n Smyth . 2001 Data Mining J Besag 1986 On the Inference by Sparse Generalized BP Smyth ,

P Data mining [Pal, Sutton, McCallum 2005] Revisit exact inference on IE linear chain, now conditioned on entity attributes y n n Smyth . 2001 Data Mining J Besag 1986 On the Parameter Estimation: Piecewise Training [Sutton & McCallum 2005] Divide-and-conquer parameter estimation IE Linear-chain Exact MAP Coref graph edge weights

MAP on individual edges Entity attribute potentials MAP, pseudo-likelihood y n n In all cases: Climb MAP gradient with quasi-Newton method Results on 4 Sections of CiteSeer Citations Coreference F1 performance N Reinforce Face Reason Constraint

1 0.946 0.967 0.945 0.961 3 0.95 0.979 0.961 0.960 7 0.948 0.979

0.951 0.971 9 0.982 0.967 0.960 0.971 Optimal 0.995 0.992 0.994 0.988 Average error reduction is 35%. Optimal makes best use of N-best list by using true labels.

Indicates that even more improvement can be obtained Joint segmentation and co-reference [Wellner, McCallum, Peng, Hay, UAI 2004] o Extraction from and matching of research paper citations. s Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), AddisonWesley, 1990. World Knowledge c y p

Co-reference decisions y Database field values Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. c s c y s o Citation attributes Segmentation

o 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Sparse Belief Propagation [Pal, Sutton, McCallum, 2005] Outline The need for joint inference Examples of joint inference Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) Joint Relation Extraction and Data Mining (ICM)

Probability + First-order Logic, Co-ref on Entities (MCMC) Data 270 Wikipedia articles 1000 paragraphs 4700 relations 52 relation types JobTitle, BirthDay, Friend, Sister, Husband, Employer, Cousin, Competition, Education, Targeted for density of relations Bush/Kennedy/Manning/Coppola families and friends George W. Bush his father George H. W. Bush his cousin John Prescott Ellis George H. W. Bush his sister Nancy Ellis Bush Nancy Ellis Bush her son John Prescott Ellis Cousin = Fathers Sisters Son sibling George HW Bush Nancy Ellis Bush

son George X W Bush cousin son John Prescott Ellis Y likely a cousin John Kerry celebrated with Stuart Forbes Name Son Rosemary Forbes John Kerry James Forbes Stuart Forbes

Name Sibling Rosemary Forbes James Forbes Rosemary Forbes son John Kerry sibling cousin James Forbes son Stuart Forbes Iterative DB Construction Joseph P. Kennedy, Sr son John F. Kennedy

with Rose Fitzgerald Son Wife Name Son Joseph P. Kennedy John F. Kennedy Rose Fitzgerald John F. Kennedy Ronald Reagan George W. Bush Use relational Fill

DB with features with first-pass CRFCRF second-pass (0.3) Results ME CRF RCRF RCRF .9 RCRF .5 RCRF Truth RCRF Truth.5 F1

.5489 .5995 .6100 .6008 .6136 .6791 .6363 Prec .6475 .7019 .6799 .7177 .7095

.7553 .7343 Recall .4763 .5232 .5531 .5166 .5406 .6169 .5614 ME = maximum entropy CRF = conditional random field RCRF = CRF + mined features Examples of Discovered Relational

Features Mother: FatherWife Cousin: MotherHusbandNephew Friend: EducationStudent Education: FatherEducation Boss: BossSon MemberOf: GrandfatherMemberOf Competition: PoliticalPartyMemberCompetition Outline The need for joint inference Examples of joint inference Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution

Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) Joint Relation Extraction and Data Mining (ICM) Probability + First-order Logic, Co-ref on Entities (MCMC) Sometimes graph partitioning with pairwise comparisons is not enough. Entities have multiple attributes (name, email, institution, location); need to measure compatibility among them. Having 2 given names is common, but not 4. e.g. Howard M. Dean / Martin, Dean / Howard Martin Need to measure size of the clusters of mentions. a pair of lastname strings that differ > 5? We need measures on hypothesized entities We need First-order logic

Pairwise Co-reference Features Howard Dean SamePerson(Dean Martin, Howard Dean)? SamePerson(Howard Dean, Howard Martin)? Pairwise Features StringMatch(x1,x2) EditDistance(x1,x2) Dean Martin Howard Martin SamePerson(Dean Martin, Howard Martin)? Toward High-Order Representations Identity Uncertainty Howard Dean First-Order Features maximum edit distance between any pair is < 0.5

number of distinct names < 3 All have same gender There exists a number mismatch Only pronouns SamePerson(Howard Dean, Howard Martin, Dean Martin)? Dean Martin Howard Martin Weighted Logic This model brings together the two main (long-separated) branches of Artificial Intelligence: Logic Probability Toward High-Order Representations . . SamePerson(x ,x ,x ,x ,x ,x ) . 1

2 3 4 5 Identity Uncertainty Combinatorial Explosion! 6 SamePerson(x1,x2 ,x3,x4 ,x5) SamePerson(x1,x2 ,x3,x4) SamePerson(x1,x2 ,x3)

SamePerson(x1,x2) Dean Martin Howard Dean Howard Martin Dino Howie . . . Martin This space complexity is common in first-order probabilistic models

Markov Logic as a Template to Construct a Markov Network using First-Order Logic [Richardson & Domingos 2005] [Paskin & Russell 2002] ground Markov network grounding Markov network requires space O(nr) n = number constants r = highest clause arity How can we perform inference and learning in models that cannot be grounded? Inference in First-Order Models SAT Solvers Weighted SAT solvers [Kautz et al 1997] Requires complete grounding of network LazySAT [Singla & Domingos 2006] Saves memory by only storing clauses that may become unsatisfied Initialization still requires time O(nr) to visit all ground clauses

Inference in First-Order Models MCMC Gibbs Sampling Difficult to move between high probability configurations by changing single variables Although, consider MC-SAT! [Poon & Domingos 06] An alternative: Metropolis-Hastings sampling [Culotta & McCallum 2006] Can be extended to partial configurations Only instantiate relevant variables Key advantage: can design arbitrary smart jumps Successfully used in BLOG models [Milch et al 2005] 2 parts: proposal distribution, acceptance distribution. Model First-order features QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Howard Dean Governor Howie fw: SamePerson(x) Dean Martin Dino Howard Martin Howie Martin Model Howard Martin Howie Martin QuickTime and a TIFF (LZW) decompressor are needed to see this picture. Howard Dean Governor Howie Dean Martin Dino

Model ZX: Sum over all possible configurations! Inference with Metropolis-Hastings p(y)/p(y) : likelihood ratio Ratio of P(Y|X) ZX cancels! q(y|y) : proposal distribution probability of proposing move y y What is nice about this? Can design arbitrary smart proposal distributions Proposal Distribution Dean Martin Howie Martin Howard Martin Dino y

Dean Martin Dino Howard Martin Howie Martin y Proposal Distribution Dean Martin Howie Martin Howard Martin Dino Dean Martin Howie Martin Howard Martin Howie Martin y y Proposal Distribution

Dean Martin Howie Martin Howard Martin Howie Martin Dean Martin Howie Martin Howard Martin Dino y y Feature List Exact Match/Mis-Match

Entity type Gender (requires lexicon) Number Case Entity Text Entity Head Entity Modifier/Numerical Modifier Sentence WordNet: hypernym,synonym,antonym Other Relative pronoun agreement Sentence distance in bins Partial text overlaps Quantification Existential a gender mismatch three different first names

Universal NER type match named mentions str identical Filters (limit quantifiers to mention type) None Pronoun Nominal (description) Proper (name) Learning the Likelihood Ratio Cant normalize over all possible ys. ...temporarily consider the following... ad hoc training: Maximize p(b|y,x), where b in {TRUE, FALSE}

Error Driven Training Motivation Where to get training examples? Generate all possible partial clusters intractable Sample uniformly? Sample from clusters visited during inference? Error driven Focus learning on examples that need it most Error-Driven Training Results B-cubed F1 Non-Error-driven 69 Error-driven 72 Learning the Likelihood Ratio Given a pair of configurations, learn to rank the better configuration higher.

Rank-Based Training Instead of training [Powell, Mr. Powell, he] --> TRUE [Powell, Mr. Powell, she] --> FALSE ...Rather... [Powell, Mr. Powell, he] > [Powell, Mr. Powell, she] [Powell, Mr. Powell, he] > [Powell, Mr. Powell] [Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she] Rank-Based Training Results B-cubed F1 Non-Ranked-Based (Error-driven) 72 Rank-Based (Error-driven) 79 Previous best in literature 68

Experimental Results ACE 2005 newswire coreference All entity types Proper-, common-, pro- nouns 443 documents B-cubed Previous best results, 1997: Previous best results, 2002: Previous best results, 2005: Our new results [Culotta, Wick, Hall, McCallum, NAACL/HLT 2007] 65 67 68 79 Learning the Proposal Distribution by Tying Parameters Proposal distribution q(y|y) cheap approximation to p(y) Reuse subset of parameters in p(y) E.g. in identity uncertainty model Sample two clusters

Stochastic agglomerative clustering to propose new configuration Scalability Currently running on >2 million author name mentions. Canopies Schedule of different proposal distributions Weighted Logic Summary Bring together Logic Probability Inference and Learning incredibly difficult. Our recommendation: MCMC for inference Error-driven, rank-based for training Outline The need for joint inference Examples of joint inference Joint Labeling of Cascaded Sequences (Belief Propagation)

Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) Joint Relation Extraction and Data Mining (ICM) Probability + First-order Logic, Co-ref on Entities (MCMC) End of Part 2 Sparse Belief Propagation Beam Search in arbitrary graphical models Different pruning strategy based on variational inference. 1. Compute messages xs mst xt

mvt xv 2. Form marginal b(xt) b(xt) 3. Retain 1-% of mass Sparse Belief Propagation used during training [Pal, Sutton, McCallum, ICAASP 2006] ar se am No be Sp Accuracy

BP Like beam search, but modifications from Variational Methods of Inference. Traditional beam Training Time Context Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations

- events Database Document collection Joint inference among detailed steps Leveraging Text in Social Network Analysis Actionable knowledge Prediction Outlier detection Decision support Outline Social Network Analysis with Topic Models Role Discovery (Author-Recipient-Topic Model, ART) Group Discovery (Group-Topic Model, GT) Enhanced Topic Models Correlations among Topics (Pachinko Allocation, PAM) Time Localized Topics (Topics-over-Time Model, TOT) Markov Dependencies in Topics (Topical N-Grams Model, TNG)

Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures

Recently Viewed Presentations

  • Presentation-standard

    Presentation-standard

    Produced by Josh Forde, Paul Crook, Peter Kirwan and Cuong Chau from Public Health England. Local sexual health and HIV clinics for supplying the HIV data Institute of Child Health PHE Centre for Infectious Disease Surveillance and Control (CIDSC) HIV...
  • GT Testing Training CogAT Online September/October 2016 Agenda

    GT Testing Training CogAT Online September/October 2016 Agenda

    Picture Analogies (K - 8th)-Verbal Analogies (9th - 12th) You only need to create a session for the first subtest, the system automatically proceeds to the next subtest, as long as proctor selects Continue Testing All . after each subtest...
  • Toulmin Model PPT

    Toulmin Model PPT

    The Toulmin Model A Tool for Understanding Argument Stephen Toulmin: British Philospher 1922 - 2009 Famous for his argument & reasoning model. Why learn this method? Know how to understand and identify the key pieces and functions of each piece...
  • Hosted by Drexel University Foundation for Innovation Courtesy

    Hosted by Drexel University Foundation for Innovation Courtesy

    Step Up to the Plate: Sports PR. Bonnie Clark, Vice President of Communications, Philadelphia Phillies. Michael Preston, Director of Public Relations,
  • Manufacturer Liability - International System Safety Society

    Manufacturer Liability - International System Safety Society

    Manufacturer Liability Professional Liability of Engineers ... Wallower v. Martin, 206 Va. 493, 497-498 (1965). City of Middlesboro v. Brown, 63 S.W.3d 179, 181 (Ky. 2001). ... American National Standard for Product Safety Signs and Labels (2007). Charles Haddon-Cave QC,...
  • Public Policy Update - NASADAD

    Public Policy Update - NASADAD

    Creation of new SAMHSA Policy Lab for MH and SUD (Sec 7001) Six (6) Provisions related to SUD and MH workforce (Secs 9021, 9022, 9023 and 9024) Four (4) provisions related to implementation of Mental Health Parity and Addiction Equity...
  • Business Challenges - Cisco

    Business Challenges - Cisco

    Now that I've talked to you about the benefits of the FlexPod concept, I'd like to talk to you about the unique advantages of FlexPod for VMware. The FlexPod for VMware configuration combines NetApp's leading storage solutions, with Cisco's UCS...
  • Test Master Look - ECCHO

    Test Master Look - ECCHO

    Reg CC - Indorsements. Indorsement. Bank (other than a paying bank) that handles a check during forward collection or a returned check shall indorse the check