Automatic Extraction of Gene and Protein Synonyms from ...

ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia University, New York, USA {hongyu, eugene} 212-939-7028 Significance and Introduction Genes and proteins are often associated with multiple names Apo3, DR3, TRAMP, LARD, and lymphocyte associated receptor of death Authors often use different synonyms Information extraction benefits from identifying those synonyms Synonym knowledge sources are not complete Developing automate approaches for

identifying gene/protein synonyms from literature Background-synonym identification Semantically related words Distributional similarity [Lin 98][Li and Abe 98][Dagan et al 95] beer and wine Mapping abbreviations to full forms Map LARD to lymphocyte associated receptor of death drink, people, bottle and make [Bowden et al. 98] [Hisamitsu and Niwa 98] [Liu and Friedman 03] [Pakhomov 02] [Park and Byrd 01] [Schwartz and Hearst 03] [Yoshida et al. 00] [Yu et al. 02]

Methods for detecting biomedical multiword synonyms Sharing a word(s) [Hole 00] cerebrospinal fluid cerebrospinal fluid protein assay Information retrieval approach Trigram matching algorithm [Wilbur and Kim 01] Vector space model cerebrospinal fluidcer, ere, , uid cerebrospinal fluid protein assaycer,ere, , say Background-synonym identification GPE [Yu et al 02]

A rule-based approach for detecting synonymous gene/protein terms Manually recognize patterns authors use to list synonyms Extract synonym candidates and heuristics to filter out those unrelated terms Apo3/TRAMP/WSL/DR3/LARD ng/kg/min Advantages and disadvantages High precision (90%) Recall might be low, expensive to build up BackgroundMachinelearning

Machine-learning reduces manual effort by automatically acquiring rules from data Unsupervised and supervised Semi-supervised Bootstrapping [Hearst 92, Yarowsky 95] [Agichtein and Gravano 00] Hyponym detection [Hearst 92] The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string. A Bambara ndang is a kind of bow lute Co-training [Blum and Mitchell 98] Method-Outline Machine-learning

Unsupervised Similarity [Dagan et al 95] Semi-supervised Bootstrapping Supervised SNOWBALL [Agichtein and Gravano 02] Support Vector Machine Comparison between machine-learning and GPE Combined approach Method--Unsupervised

Contextual similarity [Dagan et al 95] Hypothesis: synonyms have similar surrounding words N freq(t , w) Mutual information I (t , w) log 2 d freq ( t ) freq ( w ) min( I ( w, t1), I (W , t 2)) min( I (t1, w), I (t 2, w)

Similarity wlexicon sim(t1, t 2) wlexicon max( I ( w, t1), I ( w, t 2)) max( I (t1, w), I (t 2, w)) Methodssemi-supervised SNOWBALL [Agichtein and Gravano 02] Bootrapping Starts with a small set of user-provided seed tuples for the relation, automatically generates and evaluates patterns for extracting new tuples. {Apo3, DR3} {LARD, Apo3} {DR3, LARD} Apo3, also known as DR3 DR3, also called LARD

, also called , also known as Method--Supervised Support Vector Machine State-of-the-art text classification method Training sets: SVMlight The same sets of positive and negative tuples as the SNOWBALL Features: the same terms and term weights used by SNOWBALL Kernel function

Radial basis kernel (rbf) kernel function MethodsCombined Rational Machine-learning approaches increase recall The manual rule-based approach GPE has a high precision with lower recall Combined will boost both recall and precision Method Assume each system is an independent predictor Prob=1-Prob that all systems extracted incorrectly Evaluation-data Data

GeneWays corpora [Friedman et al 01] 52,000 full-text journal articles Preprocessing Gene/Protein name entity tagging Abgene [Tanabe and Wilbur 02] Segmentation Science, Nature, Cell, EMBO, Cell Biology, PNAS, Journal of Biochemistry SentenceSplitter Training and testing 20,000 articles for training

Tuning SNOWBALL parameters such as context window, etc. 32,000 articles for testing Evaluation-matrices Estimating precision Randomly select 20 synonyms with confident scores (0.0-0.1, 0.1-0.2, ,0.9-1.0) Biological experts judged the correctness of synonym pairs Estimating recall SWISSPROTGold Standard 989 pairs of SWISSPROT synonyms co-appear in at

least one sentence in the test set Biological experts judged 588 pairs were indeed synonyms and cdc47, cdc21, and mis5 form another complex, which relatively weakly associates with mcm2 Results Patterns SNOWBALL found Middle Conf Left 0.75 0.54 0.47 - <(0.55> - - <( 0.54> Right - Of 148 evaluated synonym pairs,

62(42%) were not listed as synonyms in SWISSPROT Results 1 Snowball SVM Similarity GPE Combined recall 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 score 0.6

0.7 0.8 0.9 Results precision 1 Snowball 0.8 SVM 0.6 GPE Combined 0.4 0.2 0 0.1 0.2

0.3 0.4 0.5 0.6 recall 0.7 0.8 0.9 Results System performance System Tagging Similarity Snowball SVM GPE Time 35 mins 7 hs 40 mins 2 hs

1.5 h Conclusions Extraction techniques can be used as a valuable supplement to resources such as SWISSPROT Synonym relations can be automated through machine-learning approaches SNOWBALL can be applied successfully for recognizing the patterns

Recently Viewed Presentations

  • Dual Credit Options Presented by TCC-Northeast Dual Credit

    Dual Credit Options Presented by TCC-Northeast Dual Credit

    Students will take the Reading, Writing, and Math portions of the TSI-A. Pre-Assessment Activity (PAA) ... Students are required to make a TCC grade of "C" or better in ALL Dual Credit courses to continue in the Dual Credit Program....
  • Did you know…. 2003 was the World Year of Fresh ... -

    Did you know…. 2003 was the World Year of Fresh ... -

    Celebrate 100 anniversary of "The Miraculous Year" of Einstein Promotion of public understanding of physics and physical sciences Increase number of students taking physics World Year of Physics Initiatives International National Community School Individual International Projects and Events Check International...
  • Get to 50 - Spelling Homework - Humble Independent School ...

    Get to 50 - Spelling Homework - Humble Independent School ...

    Get to 50 Spelling Homework Choice Menu Social Studies Uses the "History Alive" program Focuses on American History and US Geography Projects/Events Activity Fund Things to do Sign up for 'green' Eagle Eye - computers in 501, 503, 505 or...
  • VA Morning Report 8/8/14

    VA Morning Report 8/8/14

    Case Presentation. Patient is a 60 yr old male veteran presenting to Temple VA from OSH s/p 2 syncopal episodes. Patient has recorded drug-seeking behavior and previous admissions for syncope with no positive findings despite extensive work-up
  • Six Sigma Black Belt Training -

    Six Sigma Black Belt Training -

    (Fishbone) Environment Dust/Humidity Poor Lighting Space Limitations Methods Reworking Steel after Valves are Installed Need to Rinse Parts off after Sandblasting Lack of Communication QA to IT Rework Rinse Training Attention to Detail Poor Lighting Dust/Humidity Space Limitations Tools for...
  • Help Desk Tier 1 ETS2/FMMI Support

    Help Desk Tier 1 ETS2/FMMI Support

    A request for payment of the actual expenses incurred by the traveler. Will automatically trigger an upward/downward adjustment of the commitment in FMMI if the voucher total is greater than or less than the original AUTH total or when there...


    Presentation to the IESBA. 13 January 2015. Jeff Kaye & Peter van Veen. Transparency International UK. ... Corruption is a Global Problem. Corruption Perception Index (CPI) 2013. What do those in business think of each other? Bribe Payers Index 2011.
  • BioSense Introduction for CSTE Members

    BioSense Introduction for CSTE Members

    By April 2012: Retire BioSense 1.0. Followed EPLC process (internal to OSELS and the CDC enterprise) We are working closely with the associations. The environment is being built as we speak and you'll also see a demonstration of the interface...