Outline - University of Connecticut

StringBio 2018 Statistical Mitogenome Assembly with Repeats Fahad Alqahtani & Ion Mndoiu 10-19-2018 Outline Background and prior work

SMART pipeline Results Conclusions and future work Mitochondria: the powerhouse of the cell Cellular organelles within eukaryotic cells Convert chemical energy from food into adenosine triphosphate (ATP) The popular term "powerhouse of the cell" was coined by Philip Siekevitz in 1957 The second genome

Source:https://www.fbi.gov/about-us/lab/forensic-science-communications/fsc/july1999/dnalist.htm/dnaf1.htm Why sequence the mitogenome? Important role in disease Tuppen, Helen AL, et al. "Mitochondrial DNA mutations and human disease." Biochimica et Biophysica Acta (BBA)-Bioenergetics 1797.2 (2010): 113-128. Why sequence the mitogenome? Important role in disease Tracing maternal ancestry

Source: http://www.norwaydna.no/mtdna_en/ Why sequence the mitogenome? Important role in disease Tracing maternal ancestry Inferring human population migrations https://blog.23andme.com/ancestry/haplogroups-explained/ Why sequence the mitogenome?

Important role in disease Tracing maternal ancestry Inferring human population migrations Species tree reconstruction Kurabayashi, Atsushi, and Masayuki Sumida. "Afrobatrachian mitochondrial genomes: genome reorganization, gene rearrangement mechanisms, and evolutionary trends of duplicated and rearranged genes." BMC genomics 14.1 (2013): 633. Mitogenome assembly Most existing pipelines rely on reference genome or

mitogenome of related species Off-the-shelf de novo assemblers poorly suited for assembling mtDNA from WGS reads Mitochondrial reads often discarded due to much higher sequencing depth of mtDNA compared to gDNA Do not handle well circular genomes & repeats Prior work Tool Method MITOBim [Hahn at el 2013] Reference-based assembly NOVOPlasty [Dierckxsens at el 2017]

Norgal [Al-Nakeeb at el 2017] Input Requirements Trimmed and interleaved reads, and a reference genome De novo assembly Trimmed and interleaved reads, and a seed sequence (coi gene) A seed-extend based assembler Raw reads, insert size, read length, and a seed sequence (coi gene) De novo assembly

Raw reads Outline Background and prior work SMART pipeline Results Conclusions and future work SMART Statistical Mitogenome Assembly with RepeaTs

Input: Paired-end WGS reads Seed sequence (COI gene) Output: Complete/circular mitogenome (or largest scaffold) SMART workflow Adapter trimming Automatic detection of adaptors and trimming using Perl/C++ modules from the IRFinder package PE overlap allows very precise (single base resolution) adapter trimming

Middleton, Robert, et al. "IRFinder: assessing the impact of intron retention on mammalian gene expression." Genome biology 18.1 (2017): 51. Seed (COI) sequences A ~648bp region of Cytochrome c oxidase subunit 1 (COI) gene has been selected as a DNA barcode for taxonomic classification Barcode of Life Datasystem (BOLD) has >6M barcodes from 194K animal species, 67K plant species, 21k fungi & other species http://www.boldsystems.org/ Coverage based filter

Reads with 1 error OK Preliminary assembly Reads passing coverage filter assembled using Velvet De Bruijn Graph assembler https://en.wikipedia.org/wiki/Velvet_assembler Preliminary contig filtering Contigs aligned against eukaryotic mitogenomes using BLAST Keep contigs with significant hits only Read alignment

Using HISAT2 Fast and sensitive aligner for NGS reads Pulls out additional mitochondrial reads missed by coverage filter Secondary assembly Using SPAdes Based on multisized de Bruijn graph Robust to non-uniformities in read coverage Read alignment and SPAdes assembly repeated Until simplified contig graph is Eulerian, or max iterations reached Max-likelihood search Eulerian paths evaluated using likelihood model

implemented in ALE [Clark et al 2013] ALE likelihood Placement scoring: How well read sequences agree with the assembly Insert scoring: How well PE insert lengths match those we would expect Depth scoring: How well depth at each location agrees with depth expected after GCbias correction

K-mer scoring: How well k-mer counts of each contig match multinomial distribution estimated from entire assembly https://academic.oup.com/bioinformatics/article/29/4/435/199222 Bootstrapping & clustering Process repeated for n=10 bootstrap samples Rotation invariant pairwise distances computed using fitting alignment ML sequences clustered using hierarchical clustering Consensus computed for each cluster A

A B MITOS annotation Galaxy interface @ neo.engr.uconn.edu/?toolid=SMART Outline

Background and prior work SMART pipeline Results Conclusions and future work Coverage filter accuracy 2.5M reads Ground truth determined by bowtie2 alignment to known reference Species Sample_ID

TPR PPV F-Score Human HG00501 0.750 0.443

0.557 Human HG00524 0.454 0.147 0.222 Human HG00581

0.779 0.516 0.620 Human HG00635 0.771 0.240

0.366 Chimpanzee SRR490082 0.715 0.207 0.321 Goat 0.875 0.220

0.352 ERR219544 1KGP human datasets Other datasets Sample mtDNA sequence length (bp) LASTZ

pairwise % identity MUSCLE pairwise % identity ClustalW pairwise % identity MAFFT pairwise % identity

Balearica regulorum 16,742 98.0 98.3 98.3 98.3 Grus japonensis

16,615 98.4 97.8 97.8 97.8 Xenopus laevis 17,922

98.0 95.9 96.1 95.7 Other datasets Sample mtDNA sequence length (bp)

LASTZ pairwise % identity MUSCLE pairwise % identity ClustalW pairwise % identity MAFFT pairwise %

identity Pan Troglodytes 16,085 97.5 94.7 94.7 94.7

Mus Musculus 15,802 99.97 96.9 96.7 96.9 Canis lupus

16,580 97.1 96.7 96.7 96.7 Other datasets Sample mtDNA sequence

length (bp) LASTZ pairwise % identity MUSCLE pairwise % identity ClustalW pairwise % identity MAFFT

pairwise % identity Capra aegagrus hircus 16,098 99.98 96.7 96.7

96.7 Saccharina japonica 37,671 100 99.8 99.8 99.8

Outline Background and prior work SMART pipeline Results Conclusions and future work Conclusions

SMART is an automated pipeline for de novo mitogenome assembly from WGS reads Based on statistical framework Probabilistic read classifier based on coverage Likelihood maximization for resolving ambiguities in assembly graph Assembly confidence estimated by bootstrapping Produces complete/circular assemblies even in presence of repeats Available via galaxy interface at neo.engr.uconn.edu/?toolid=SMART Ongoing work

Large-scale pipeline validation 47 frog species from [Zhang et al 2013] Comparison with other tools (MITOBim, NOVOPlasty, and Norgal) Reconstruction of plant mitochondrial and chloroplast genomes Extension to long read sequencing technologies (PacBio, Nanopore) Thank you for you attention! Any questions?

Recently Viewed Presentations

  • Multicriteria Decision Aid: the Outranking Approach Multicriteria decision

    Multicriteria Decision Aid: the Outranking Approach Multicriteria decision

    Different scales. Quantitative or qualitative criteria. Mono- and Multi-decision maker decision problems Mono-decision maker : Single stakeholder (decision maker). Single evaluation table and preference structure. Multi-decision maker: Multiple stakeholders (including decision maker(s)). Multiple evaluation tables and preference structures.
  • TPO We will examine Africas diversity using pictures

    TPO We will examine Africas diversity using pictures

    Example of an Acrostic Poem. Austin has different types of buildings from skyscrapers to artistic statues. U. se of technology sparks an economic boom. S. o many different religions are in the Austin area. T. he culture is very different...
  • Writing a Character Analysis Essay - Central Bucks School ...

    Writing a Character Analysis Essay - Central Bucks School ...

    What is a Character Analysis? Character analysis is when you evaluate a character's traits, their role in the story, and the conflicts they experience.. When analyzing, you will want to think critically, ask questions, and draw conclusions about the character...
  • Distributing media products to audiences

    Distributing media products to audiences

    Distributing media products to audiences. L.O. - How are media products distributed to audiences? ... Sonia Livingstone (2002) suggested that the new digital age would bring the era of 'death of the schedule' ...
  • Folklore: Myths, Legends, Fables, and Folktales

    Folklore: Myths, Legends, Fables, and Folktales

    Folk tales characteristics: Characters are ordinary humans or animals that act like humans; often the humans are peasants or of the lower class and they have better values than the richer class. Time ordered structure. Repetition of words, phrases, themes,...
  • ECEN 4616/5616 Optoelectronic Design Class website with past

    ECEN 4616/5616 Optoelectronic Design Class website with past

    We can, of course, calculate much more than the "Modulation Transfer Function" illustrated here, including: The Impulse Response Function, or "Point Spread Function (PSF): The size and shape of the image of a point source.
  • Integrating Homeless and Health Policy in the State of California

    Integrating Homeless and Health Policy in the State of California

    HCFC Organizational Chart ... Rosalind Sago. HCFC Priorities. Implement Homeless Emergency Aid Program (HEAP) Implement Housing First Policy. Implement SB 918-Homeless Youth Act. Explore development of a state-level homeless data integration system.
  • Further Sequences and Series - Dornoch Mathematics

    Further Sequences and Series - Dornoch Mathematics

    Further Sequences and Series ... When solving the equation f(x) = 0 graphically, we would normally draw the graph of y = f(x) and focus on where the curve cuts the x axis. ... Arial Calibri Default Design MathType 5.0...