Winter School on Mathematical Methods in Life Sciences

Introduction to Modeling and Algorithms in Life Sciences Ananth Grama Purdue University http://www.cs.purdue.edu/homes/ayg Acknowledgements

To various sources, including Profs. Mehmet Koyuturk, Michael Raymer, Wiki sources (pictures), and other noted attributions. To the US National Science Foundation and the Center for Science of Information. Central Dogma of Molecular Biology http://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology

Central Dogma of Molecular Biology Mostly valid with some exceptions: Reverse Transcription: Retroviruses such as Feline Leukemia, HIV RNA Replication: RNA to RNA transfer in viruses Direct Translation: DNA to Protein (typically in cell fragments)

Protein Synthesis Transcription: a DNA molecule is converted into a complementary strand of RNA This RNA is also called messenger RNA (mRNA) since it acts as an intermediary between DNA and the Ribosomes Ribosomes are parts of cell that synthesize

proteins from mRNA Eukaryotic Transcription Synthesizing Proteins: Translation mRNA is decoded by the Ribosome to produce specific proteins (polypeptide chains) Polypeptide chains fold to make active proteins

The amino acids are attached to transfer RNA (tRNA) molecules, which enter one part of the ribosome and bind to the messenger RNA sequence. http://en.wikipedia.org/wiki/File:Peptide_syn.png Translation

Translation Some Numbers Human DNA has: 3 billion base pairs The length of DNA in a cell is 1m! This is packed into a nucleus of 3 10 microns Each chromosome (46 in all) is about 2 cm on average.

Some Numbers Models for Lactose Intolerance Haemophilia-A Haemophilia-A

Haemophilia-A is caused by clotting factor VIII deficiency. Factor VIII is encoded by the F8 gene. Analyzing Sequences

Sequences: An Evolutionary Perspective Evolution occurs through a set of modifications to the DNA These modifications include point mutations, insertions, deletions, and rearrangements Seemingly diverse species (say mice and humans) share significant similarity (80-90%) in their genes

The locations of genes may themselves be scrambled Gene Duplication Gene duplication has important evolutionary implications Duplicated genes are not subject to evolutionary

pressures Therefore they can accumulate mutations faster (and consequently lead to specialization) Inversions

Para and pericentric inversions Transposition A group of conserved genes appears in a transposed fashion at a different location

Genomic Sequences Sanger Sequencing Next-Generation Sequencing Illumina Solexa Helicos Solid Roche/454

Sanger Sequencing Sanger Sequencing Sanger Sequencing Helicos NGS

Helicos NGS Helicos NGS Illumina Solexa Illumina Solexa

Illumina Solexa Dealing with Reads From fluorescence to nucleotides (Phread) Error correction Mapping to reference genomes Assembly

Error Correction NGS reads range from 50 300 bps (constantly changing) Error rates range from 1 3% Errors are not uniformly distributed over the read Correcting errors is a critical step before mapping/ assembly

Error Correction Needed coverage on the genome k-mer based error correction Suffix Trees Short Read Alignment Given a reference and a set of reads, report at least

one good local alignment for each read if one exists Approximate answer to: where in genome did read originate? What is good? For now, we concentrate on: Fewer mismatches is

TGATCATA better GATCAA Failing to align a lowTGATATTA GATcaT quality base is better than failing to align a high-quality base

better than TGATCATA GAGAAT better than TGATcaTA GTACAT Pop et al.

Indexing Genomes and reads are too large for direct approaches like dynamic programming Indexing is required Suffix tree Suffix array

Seed hash tables Many variants, incl. spaced seeds Choice of index is key to performance Indexing Genome indices can be big. For human:

> 35 GBs > 12 GBs > 12 GBs Large indices necessitate painful compromises

1. Require big-memory machine 2. Use secondary storage 3. Build new index each run 4. Subindex and do multiple passes Burrows-Wheeler Transform

Reversible permutation used originally in compression T BWT(T) Burrows Wheeler

Matrix Last column Once BWT(T) is built, all else shown here is discarded Matrix will be shown for illustration only Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994

Burrows-Wheeler Transform Property that makes BWT(T) reversible is LF Mapping ith occurrence of a character in Last column is same text occurrence as the ith occurrence in First column Rank: 2

BWT(T) T Rank: 2 Burrows Wheeler Matrix

Burrows-Wheeler Transform To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to is last per LF Mapping Final T Could be called unpermute or walk-left algorithm

FM Index Ferragina & Manzini propose FM Index based on BWT Observed: LF Mapping also allows exact matching within T LF(i) can be made fast with checkpointing and more (see FOCS paper) Ferragina P, Manzini G: Opportunistic data structures with applications. FOCS. IEEE Computer Society; 2000.

Ferragina P, Manzini G: An experimental study of an opportunistic index. SIAM symposium on Discrete algorithms. Washington, D.C.; 2001. Exact Matching with FM Index To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc)

Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to is last character as if it were qc Exact Matching with FM Index In progressive rounds, top & bot delimit the range of rows beginning with progressively longer suffixes of Q

Exact Matching with FM Index If range becomes empty (top = bot) the query suffix (and therefore the query) does not occur in the text Backtracking Consider an attempt to find Q = agc in T = acaacg:

g c gc does not occur in the text Instead of giving up, try to backtrack to a previous position and try a different base

Sequencing Find maximal overlaps between fragments: ACCGT CGTGC TTAC TACCGT

--ACCGT-----CGTGC TTAC-----TACCGT TTACCGTGC Consensus sequence determined by vote Quality Metrics The coverage at position i of the target or

consensus sequence is the number of fragments that overlap that position Target: No coverage Two contigs

The Maximum Overlap Graph Overlap multigraph Each directed edge, (u,v) is weighted with the length of the maximal overlap between a suffix of u and a prefix of v TACGA a

1 2 ACCC CTAAAG b

1 c 1 1

d GACA 0-weight 0-weightedges edgesomitted! omitted!

Paths and Layouts The path dbc leads to the alignment: Superstrings Every path that covers every node is a superstring Zero weight edges result in alignments like: GACA-----------GCCC------------TTAAAG

Higher weights produce more overlap, and thus shorter strings The shortest common superstring is the highest weight path that covers every node Graph formulation of SCS Input: A weighted, directed graph

Output: The highest-weight path that touches every node of the graph NP Hard, Use Greedy Approximation Greedy Example 7 4

5 2 2 2 3 1

6 So we have sequences now! Find genes in sequences.

Query: AGTACGTATCGTATAGCGTAA What does it do? Find similar gene in other species with known

function and reason from it Align sequences with known genes

Find the gene with the best match

Recently Viewed Presentations

  • Une Journe au Collge WALT: To be able

    Une Journe au Collge WALT: To be able

    Ex4. Copy the sentence in full by completing it with the correct word or phrase in italics. The answers are in the text. Score out of 8… 8 = 4a, 7 = 4b, 6 = 4c, 5 = 3a, 4...
  • An element is a substance made from the

    An element is a substance made from the

    Tripod Gauze mat Heat mat Beaker Test tube Boiling tube Test tube holders Conical flask Test tube rack Evaporating dish Filter funnel Elements are arranged in rows called periods and columns called groups. The periodic table has 8 groups: Group...
  • Thriving in a Matrix Organization - PMI Heartland

    Thriving in a Matrix Organization - PMI Heartland

    "A 'mixed' organizational form in which normal hierarchy is 'overlaid' by some form of lateral authority, influence or communication… there are usually two chains of command, one long functional lines and the other along project lines."
  • Welcome to Second Grade!

    Welcome to Second Grade!

    6 Day Rotation Schedule. Please refer to your district calendar for the day number. Included in your packets is a Special Area schedule. Due to the rotating schedule pay careful attention to PE and Library days so that students are...
  • Social Psychology - Faculty Server Contact

    Social Psychology - Faculty Server Contact

    Interdependent View of the Self Gender Differences in Defining the Self Women Men Knowing Ourselves Through Introspection Introspection Focusing on the Self: Self-Awareness Theory Self-Awareness Theory Causal Theories Judging Why We Feel the Way We Do: Telling More than We...
  • Oral Health and Multiple Sclerosis Eleonore D. Paunovich,

    Oral Health and Multiple Sclerosis Eleonore D. Paunovich,

    Oral Health and Multiple Sclerosis Eleonore D. Paunovich, DDS, MS, FASGD, DASCD [email protected] Director, Clinical Oral Health Programs
  • Understanding and Protecting Youth

    Understanding and Protecting Youth

    Understanding and Protecting Youth BSA Mission Statement What is the purpose of Venturing? High adventure Leadership skills Citizenship Community service Social experiences Fitness Understanding Young Adults The first things we notice about young people: Size Behavior Teenagers deal with opposing...
  • Chapter 1

    Chapter 1

    MKTG Lamb, Hair, McDaniel 2008-2009 15 CHAPTER Advertising and Public Relations Designed by Amy McGuire, B-books, Ltd. Prepared by Deborah Baker, Texas Christian University