Language Identification and Part-of-Speech Tagging

Language Identification and Part-of-Speech Tagging KEREN SOLODKIN BASED ON A PAPER BY SARAH SCHULZ AND MAREIKE KELLER DIGITAL HUMANITIES SEMINAR 2016 Plan

Introduction and Related Work Training Data Processing of Mixed Text Results Tools for Digital Humanities Conclusion and Future Work 2

Introduction Code Switching Two or more linguistic variety in a single conversation Highly frequent in spoken language and in social media Can also be observed in medieval writing Historical mixed text is unused source of information 3 Example

4 Introduction The Project Automatic language identification (LID) and POS tagging Mixed Latin-Middle English text Make tools available to Humanities scholars Analysis of code-switching rules within nominal phrases Historical multilingualism research

Computational linguistics 5 Related Work LID Lyu and Lyu (2008) Mandarin-Taiwanese Solorio and Liu (2008) Spanish-English. Yeong and Tan (2011) Malay-English.

6 Related Work POS tagging Solorio and Liu (2008) Rodrigues and Kbler (2013) Jamatia et al. (2015) 7

Training Data Macaronic sermons (Horner, 2006) Mixed Latin-Middle English text Annotate with language and part-of-speech information The annotated corpus comprises about 3000 tokens 159 sentences, average length of 19.4 tokens 8

Training Data Table 1: Labels annotated for LID along an explanation for each label and the occurrence in percent 9 Training Data Table 2: Labels annotated for POS tagging along with the explanation for each label and the occurrence in percent

10 Processing of Mixed Text Two models: POS tagging builds upon the results of the LID POS tagging and LID do not inform each other LID is a step to any further processing of mixed text LID needs to be solved with a high accuracy

11 Processing of Mixed Text LID Solorio and Liu (2008) No available lemmatizer for Middle English Include POS informed word lists for both languages Middle English Penn Parsed Corpora of Historical English Latin the Universal Dependency treebank

In case a word is found in one of the lists, its POS is added 12 Processing of Mixed Text CRF Classifiers Conditional Random Fields Take context into account Set of feature functions with weights

13 Processing of Mixed Text LID CRF classifiers are known to be successful for sequence labeling tasks Latin is characterized by a relatively restricted suffix assignment A context window of 5 tokens was used on all features

14 Processing of Mixed Text LID Features functions: 6. Character-unigrams prefix 1. Surface form

7. Character-bigrams prefix 2. POS tag Latin 8. Character-trigrams prefix 3. POS tag Middle English

9. Character-unigram suffix 4. POS from Middle English word list 10. Character-bigram suffix 5. POS from Latin word list 11. Character-trigram suffix 15

Processing of Mixed Text POS Tagging For POS tagging, the same features are used Information generated by the LID system (feature 12a) The performance is evaluated by the gold LID (feature 12b) Differences in the quality of LID influence the POS tagging quality 16

Processing of Mixed Text POS Tagging Features (continuation): 12.a LID label predicted by the LID system 13.b Gold LID label manually annotated for our corpus 17

Results The evaluation was a 10-fold cross-validation 90% for training 10% for testing The reported results are average over all tests 18

Results LID Majority baseline Latin featuring Middle English insertions A combination of Latin and perfect punctuation labeling Per class precision, recall and F-score for a class Macro-averages for the overall system 19

Results LID 20 Results LID Table 3: Performance of the CRF system for language identification compared to the baseline. Precision, recall and F-score per class and macro-average of all classes.

21 Results LID Table 4: Percentage of incorrectly labeled tokens per class along with the distribution of incorrect labels among the other labels. 22

Results POS Tagging Majority baseline The majority of the output of the monolingual Latin tagger Confidence baseline Choose the POS label of the monolingual tagger with a higher level of confidence In case the label indicates that a word is a foreign word, we choose the label from Middle English.

23 Results POS Tagging Table 5: Performance of the CRF system for POS tagging compared to the majority baseline (BL1), the confidence baseline (BL2). CRFbase: system with 11 basic features, CRFpredLID: system with predicted LID as an additional feature, CRFgoldLID system with gold-standard LID as an additional feature. Precision (P), Recall (R) and F-score (F) per class and macro-average of all classes.

24 Training Data Table 2: Labels annotated for POS tagging along with the explanation for each label and the occurrence in percent 25 Results POS Tagging

Table 5: Performance of the CRF system for POS tagging compared to the majority baseline (BL1), the confidence baseline (BL2). CRFbase: system with 11 basic features, CRFpredLID: system with predicted LID as an additional feature, CRFgoldLID system with gold-standard LID as an additional feature. Precision (P), Recall (R) and F-score (F) per class and macro-average of all classes. 26 Results POS Tagging The high average Recall of almost 80 is important for the

task Precision has lower priority The extracted phrases are manually inspected afterwards The CRFpredLID system shows an increase in performance The CRFgoldLID system yields the best performance The differences are not statistically significant 27

Results POS Tagging Table 6: Percentage of incorrectly labeled tokens per class along with the distribution of incorrect labels among the other labels for CRFpredLID system. 28 Results POS Tagging

29 Results POS Tagging 30 Results POS Tagging Incorrectly tagged words appear in POS sequences which rarely appear in the training data

Adding more training data will decrease errors of this kind 31 Results Training Data Size Data sparsity in general is an issue dealing with historical text Investigate how different sizes of the training set influence the results

800 tokens 1600 tokens 2400 tokens (the complete training set) 32 Results Training Data Size Table 7: Different portions of the training set along

with precision, recall and F-score for LID and POS tagging. 33 Tools for Digital Humanities The aim is not only to build a system Enable Humanities scholars to process their data easily A simple web service in Java The data is returned in a ICARUS format

Inspect the data Pose complex search requests Combining both language information and POS tag 34 Figure 1: Search interface of ICARUS returning results on a query for an English adjective followed by a Latin noun within the next 3 tokens. 35

Tools for Digital Humanities The method can easily be adapted to other languages Fitting monolingual taggers (TreeTagger) POS related word lists (if available) The code is publicly available on GitHub 36

Conclusion We saw the implementation and application of two systems developed for a specific purpose We got reasonable results given the very low size of training data We can extend the training data and correct some errors for example by adding monolingual Middle English data

37 Future Work Jointly modeling LID and POS tagging. Dependency parser for mixed text Get insights into the constraints on intra-sentential codeswitching 38

Conclusion and Future Work Collaboration between Humanities and Computer Science. A task-oriented tool development Immediate feedback on the performance Systems are applied to real-world data. The way to give Computer Science the chance to support other fields and find new and interesting challenges

39 Questions? 40

Recently Viewed Presentations

  • FETAL POSITION AND PRESENTATION 1 OVERVIEW This lecture

    FETAL POSITION AND PRESENTATION 1 OVERVIEW This lecture

    * BREECH PRESENTATION Breech pregnancy is a condition of pregnancyin which the fetus is not in the head-down position in the uterus. Breech presentation is the most common malpresentation, by about 36 weeks of pregnancy, the baby should have moved...
  • Jodie Misiak Federation Square March 7, 2003 Project

    Jodie Misiak Federation Square March 7, 2003 Project

    Federation Square Overview Federation Square Overview Jodie Misiak March 7, 2003 Project Evaluation (1.011) 1997 International Design Competition for redevelopment of Melbourne Waterfront London-based Lab architecture studio Melbourne-based Bates Smart Finance for project from State of Victoria ...
  • Unit 2 Grade 8 Social Studies Inquiry Questions

    Unit 2 Grade 8 Social Studies Inquiry Questions

    PEI Deptment of Education Other titles: Arial Calibri Constantia Wingdings 2 Flow 1_Flow 2_Flow 3_Flow Geographic Influences Inquiry Questions Debora O'Neil Anticipation Guide Significance Slideshow What do you know about Canadian geography? "Unity" by Michael O. Nowlan Landforms What are...
  • Diapositiva 1 - quimicadeanaisabel.weebly.com

    Diapositiva 1 - quimicadeanaisabel.weebly.com

    Esta expresión contiene el cuadrado de la concentración molar de agua en el denominador. Sin embargo, sólo una pequeña cantidad de agua reacciona para establecer el equilibrio, por lo que la concentración de agua permanece esencialmente constante.. La expresión de...
  • Computer Architecture A Quantitative Approach, Fifth Edition Chapter

    Computer Architecture A Quantitative Approach, Fifth Edition Chapter

    The vector and scalar registers have a significant number of read and write ports to allow multiple simultaneous vector operations. A set of crossbar switches (thick gray lines) connects these ports to the inputs and outputs of the vector functional...
  • You have multiple slide templates to choose from. The ones ...

    You have multiple slide templates to choose from. The ones ...

    "Got my first interview with a Hiring Manager in about 25 years. I want to thank CareerSource and O2O for getting me ready for this. I feel much more confident than I would have prior to that course." "Thankful to...
  • Caring For Our World Quiz - Seomra Ranga

    Caring For Our World Quiz - Seomra Ranga

    Caring For Our World Quiz An Irish Aid Awards Project from 5th class St. Patrick's NS Glencullen Dublin 18 ... O2 or Oxegyn What is the biggest rainforest in the world? The Amazon Ranforest What do you think will happen...
  • Chapter Chapter13Clickers Lecture Essentials of Oceanography Eleventh Edition

    Chapter Chapter13Clickers Lecture Essentials of Oceanography Eleventh Edition

    Different sea floor features exist in different oceanographic locations. Bathymetry. Measures the vertical distance from the ocean surface to mountains, valleys, plains, and other sea floor features. Measuring Bathymetry. Soundings.