Introduction to Machine Learning Aristotelis Tsirigos email ...
Biocomputational Puzzles Dennis Shasha Courant Institute, New York Univ [email protected] Collaborations with labs of (bio, Duke): Philip Benfey (bio, NYU): Gloria Coruzzi , Ken Birnbaum, Laurence Lejay, Peter Palenchar, Rodrigo Gutierrez (media lab/infoviz): Crispy, Brad Paley Overview Simple Primer in Genomics
Use of form and function for prediction. Activist Data Mining Visualization tool for multi-experiment data A gallery of challenging problems. Lessons from successful collaborations. 2 Genomics a primer Before genomics: isolate and manipulate genes one at a time. However, interactions are important (mouse has 40,000 genes; we have 50,000, mostly the same; are we 20% different?) Genomics: quantitative view of an entire species. Proteomics: all proteins of a species. Omics anything described on one or more entire species. 3
Tool: sequencing Sequencing (find all the DNA of a species). About 15 million species. We have sequences of about 200. What does it buy you? Figure out which genes do what based on homology (sequence similarity). 4 Rate of Full Genome Sequencing Genomes Sequenced 250 200 150 100 50
0 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Year 5 Hybridization, sticking gene segments mRNA (cDNA) sample dye glass slide with thousands of spots/cells scan and process image light
The microarray revolution: GLOBAL gene expression although noisy in many and complicated ways (sample preparation, dyes, slide spotting, hybridization, image processing), they allow us to record the simultaneous activity of a very large number of genes, and thus tackle a whole host of questions on gene function, regulation and interactions: Which genes are involved in the (complex) reactions to certain stimuli/offenses? What are the (complex) genetic signatures of certain diseases, and can they inform disease taxonomies? What are the (complex) consequences of knock-outs? 6 Gene networks. Overview
Simple Primer in Genomics Use of form and function for prediction Activist Data Mining Visualization tool for multi-experiment data A gallery of challenging problems. Lessons from successful collaborations. 7 Two Uses of Homology 1) Gene X does function F in species S1. Perhaps homologous gene X does F in species S2. (Know gene and its function) 2) Can we infer function of a gene without knowing the function of any homologous gene? 8 Our Approach (simplified)
Species S1 and S2 have trait T. Species S3 and S4 are missing T. Gene G is in S1 and S2 but not in S3 or S4. So Gene G may be responsible for T. 9 Orthologs Orthologs: homologs present in different organisms and whose common ancestor predates the split between the species Paralogs: genes related by duplication within a genome A A A A B
B B B 10 Binary Character Matrix Refinement 2 (dont require equality) Gc represents the set of genomes with COGc. Gt represents the set of genomes with trait t. So Perfect Match implies that set of genomes represented by a COG = set of genomes having the trait t. X is a threshold value that is adjusted between 0 and 1 to set the stringency of the algorithm. X=1 implies Perfect Match. Definitions Gc represents the set of genomes (= species) with COGc. Gt represents the set of genomes with trait t.
So Perfect Match implies that set of genomes represented by a COG = set of genomes having the trait t. X is a threshold value that is adjusted between 0 and 1 to set the stringency of the algorithm. X=1 implies Perfect Match. 15 31 Bacterial Flagellar COGs in database A well-developed model Complex system, thus less likely to be homoplastic Easy to assay 16 Summary of Algorithms for Flagella COGs Total Number of COGs: 2885 Total Number of COGs reported by
algorithms: 77 Total Number of Already Known Bacterial Flagella COGs: 31 Total Number of Flagella COGs reported by algorithm: 29 (~94%) Total Number of putative flagella COGs found by algorithm: 5 18 Knockout Technique Strategy 1. Amplify internal fragment of target gene by PCR 2. Clone into pMUTIN 3. Transform B. subtilis--Single cross-over event disrupts target gene (All mutations were confirmed by PCR) 4. Transcriptional fusion is created with genes promoter and lacZ
19 Motility in Swim Plates (LB with .25% agar) Overnight Overnightgrowth growthatat37C;Swim 37C. Swimmedium medium(LB (LB+ +0.25% 0.25%agar). agar). Similar Similarresults resultsatat20C 20C(4(4days) days)and and30C 30C(2(2days).
days). B. subtilis yuxH 168 B. subtilis yqeW 168 No visible difference between the wild-type and the ylqH strains, but the yqeW and yuxH strains did not swim out as far from the inoculation point compared to the wild-type. 20 Lessons from Flagella Work Doesnt matter that approach was
quite simple. What mattered was that it made sense and led to high probability predictions. Could in fact be generally useful. Ref: Mitchell Levesque, Dennis Shasha, Wook Kim, Michael G. Surette, and Philip N. Benfey ``Trait-To-Gene: A Computational Method for Predicting the Function of Uncharacterized Genes' Current Biology, vol. 13, 129-133, January 21, 2003 21 Overview
Simple Primer in Genomics Use of form and function for prediction. Activist Data Mining Visualization tool for multi-experiment data A gallery of challenging problems. Lessons from successful collaborations. 22 New topic: How to do Data Mining? Classical approach: Wait for data to appear
Find patterns in it. Hope they are actionable. Works well when data is pertinent, e.g. Amazons other books recommendation, extrapolation of trends. 23 Activist Data Mining Propose initial experiments to explore subspace of some predefined search space Evaluate the results Propose new experiments, evaluate, propose, evaluate, propose . Iterative and adaptive 24 Which is Better for Natural Science? Classical is obviously right when you have
no control over data generation. When you do, active data mining (active learning) may work much better. Arises naturally when you have a tight collaboration. 25 Activist Data Mining Philosophy Passive Approach: Natural scientists do experiments. Computer Scientists help to glean something from it. Activist Approach: Computer scientists help (1)Design experiments (2)Analyze results (3)Design new experiments based on results 26
Activist Data Mining Philosophy (Reminder) Passive Approach: Natural scientists do experiments. Computer Scientists help to glean something from it. Activist Approach: Computer scientists help (1)Design experiments (2)Analyze results (3)Design new experiments based on results Our particular methodology: Adaptive Combinatorial Design Our innovation: applying combinatorial design in an interative way. 27 What is combinatorial design? Disciplined sampling. Suppose you are a thief Combinatorial Safe: 10 switches with 3 settings each. Over 59,000 (3^10) possible
configurations. However there is a certain pair of switches (you dont know which pair) and a certain pair of values of those switches that will open the safe. Illustration: S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 C A Challenge: Open the safe in as few switch configurations as possible. How many? How to do? 28 Scientific Goal We want to describe the factors (e.g. light, carbon, nitrogen.) that determine whether plants will produce critical amino acids and how those factors interact.
C:N C4:N2 30 Design Space Inputs: *Light *Starvation to Various Nutrients *Carbon *Inorganic N (NO3/NH4) *Organic N (Glu) *Organic N (Gln) If inputs are take binary values (first approximatio binary (+/-) inputs= 26 or 64 input combinations r treatments)
Use 2-factor combinatorial design to reduce umber of treatment combinations required to cover he experimental space, assuming that important teractions will have to do with two factors. 31 Combinatorial design finds six conditions to explore every pairwise interaction. Want to discover important factors. EXPT 1 NO PIVOT LANE ILLUMIN STARVE CARBO N
NO3NH4 GLU GLN 1 DARK Y 0 L H H
2 DARK N L 0 0 H 3 LIGHT N
0 0 H 0 4 LIGHT Y L L 0
0 5 LIGHT Y L 0 H H 6 DARK
N 0 L 0 0 Notice: for each pair of input factors and combination of values from those factors, some experiment has that combination, e.g. Light No carbon; Starve No Glu. After doing this experiment, 32 certain factors suggest themselves as worth further study:
Adapting to Results Find most important inputs in order to see their effects in more detail. That is, we focus our search space on those inputs that are likely to exert the most influence over outputs of interest. 33 Adaptation Following No Pivot Design Key to activist data mining is adapting to results of experiments already done. Many ways to do this, e.g.Tong, S. & Koller, D. Active Learning for Structure in Bayesian Networks. Seventeenth International Joint Conference on ArtificialIntelligence, 863-869 (2001). Advocates pool-based active learning. Pool of unlabeled instances (dont know output value). An active learner chooses which
instance to query next in hope it will reduce set of possible answers. Ideker, T. E., Thorsson, V. & Karp, R. M. Discovery of regulatory interactions through34 perturbation: inference and experimental Three questions of particular interest 1. Is any single factor so important that its presence determines the outcome regardless of the other contexts? (e.g. Light in context X is repressive compared with Dark in context Y for all X, Y) 2. Is a factor important enough that it has an effect for any particular context? (e.g. for all X, Light in context X is repressive compared with Dark in X) 3. Is a factor consistently important when compared with a fixed background? (e.g. for all X, is Light in context X repressive
compared with background?) 35 Pivot Design 1: Start with no pivot design EXPT 1 NO PIVOT LANE ILLUMIN STARVE CARBO N NO3NH4
GLU GLN 1 DARK Y 0 L H H 2
DARK N L 0 0 H 3 LIGHT N 0
0 H 0 4 LIGHT Y L L 0 0
5 LIGHT Y L 0 H H 6 DARK N
0 L 0 0 Create dark and light pairs by just setting Illumin to light and dark respectively. 36 Pivot Design 2: Dark Design EXPT 1 NO PIVOT LANE
ILLUMIN STARVE CARBO N NO3NH4 GLU GLN 1 DARK Y
0 L H H 2 DARK N L 0 0
H 3 DARK N 0 0 H 0 4 DARK
Y L L 0 0 5 DARK Y L 0
H H 6 DARK N 0 L 0 0 Exactly the same as no pivot tests but with DARK everywhere. Requires only three more experiments than in
no pivot case. 37 Pivot Design 3: Light Design EXPT 1 NO PIVOT LANE ILLUMIN STARVE CARBO N NO3NH4
GLU GLN 1 LIGHT Y 0 L H H 2
LIGHT N L 0 0 H 3 LIGHT N 0
0 H 0 4 LIGHT Y L L 0 0
5 LIGHT Y L 0 H H 6 LIGHT N
0 L 0 0 Exactly the same as DARK tests but with Light everywhere. Again, three more experiments than in no pivot case. Important: First experiment for light = First Experiment for Dark except for Illumination itself. Differs only in pivot. Minimal pair. 38 What Accomplished A set of well-spaced minimal pairs, differing only in the pivot. Suggests answers for first two questions:
Is any single factor so important that its presence determines the outcome regardless of the other contexts? (e.g. Light in context X is repressive compared with Dark in context Y for all X, Y). Pivot design shows know for this biological system. Is a factor important enough that it has an effect for any particular context? (e.g. for all X, Light in context X is repressive compared with Dark in X) 39 Pivot design suggests yes for this biological Half-pivot Light against a fixed background EXPT 1 NO PIVOT
LANE ILLUMIN STARVE CARBO N NO3NH4 GLU GLN 1 LIGHT Y
0 L H H 2 LIGHT N L 0 0
H 3 LIGHT N 0 0 H 0 4 LIGHT
Y L L 0 0 5 LIGHT Y L 0
H H 6 LIGHT N 0 L 0 0 7
DARK N 0 0 0 0 Exactly the same as LIGHT tests but with one added background. Allows us to create a circuit (binary in this case because inputs are binary.) 40 Adaptive Experimental Design along Borders
Because combinatorial design explores only a (well spread) subset of possibilities, the apparent effects of factors may depend on other factors that havent been explored. After constructing boolean circuits, software suggests experiments to clarify border between inductive and non-inductive, e.g., Starvation_Y, Carbon_N, NH4NO3_N, GLU_Y, GLN_Y 41 Combinatorial Design vs. Random Sampling Practical Question: Adaptive Combinatorial Design is a sampling method. How well does it work compared to random sampling? Simulation experiment: Create simulated data with a single important attribute and microarray-quality noise (factor of 2 to 5 change in biological repliates). Empirical Conclusions: Random and Adaptive
Combinatorial Design did equally well at identifying the important attribute (T-test), however Random falsely identified other attributes as important about 4 times more often than Adaptive Combinatorial Design. (see cdtables.doc) 42 Steps of Methodology No Pivot: Small set of well-spaced experiments to find most important influences on a target. Also, a good method in genomics applications to find clusters because of good spacing. Small? 10 inputs with 4 values gives a no pivot of about 30 experiments. Pivot: Can find out whether an input is likely to have an effect regardless of context (for all X, for all Y) or for every context (for all contexts X) Half-pivot: For comparison with a fixed background
Border Adaptation: Study differences between repressinve case and non-repressive one to 43 Applicable to Many other Situations Tuning an Algorithm Repeatedly and Online: Cant explore whole parameter search space each time so use combinatorial design to sample the search space and then use border adaptation to fine-tune the result. Regression testing: Given many input parameters to software, cant test them all. This is a disciplined approach. No pivot idea only. 44 Inspiration of this approach Combinatorial design: Inspired by work in software testing by
David Cohen, Siddhartha Dalal, Michael Fredman and Gardner Patton at Bellcore/Telcordia. Their problem: how to test a good set of inputs to a program to discover whether there are any bugs. Not program coverage, but input coverage. Not all input combinations, but all combinations of every pair of of input variables (no pivot design). 45 Hypothesis: every input combination should give How This Could Help You Use this approach: Pose an experimental setting of interest to you. (Names of input variables, possible values). Describe a no pivot design for your setting.
Based on that result, describe a pivot design to isolate the exact effect of a specific input. Get a good sense of whether the pivot is decisive by itself or has a consistent strong influence. Theoretical Guarantee: For k-factor design, if there is a set of k values that dominates the result, you will find it. 46 Safecracking Solution (X = Dont care)
BCABCABCAB 7: CACBACBACB 8: CBACBACBAC 9: CCBACBACBA 10: X A A A B B B C C C 11: X A A A C C C B B B 12: X B B B A A A C C C 13: X B B B C C C A A A 14: X C C C A A A B B B 15: X C C C B B B A A A 47 Further Reading: combinatorial design widely used in biology Universal DNA tag systems: a combinatorial design scheme
Recomb 2000 Amir Ben-Dor, Richard Karp, Benno Schwikowski and Zohar Yakhini. Experimental design for gene expression microarrays, Biostatistics, 2:183-201.Kerr and Churchill(2001), Normal: N microarrays will be used to test N conditions against a common reference. Authors propose to use the colors to compare N conditions against one another in a looping fashion: 1 with 2, 2 with 3, n with 1. Result: deconvolves certain effects (e.g. binding affinity of reference dye. 53 Overview
Simple Primer in Genomics Use of form and function for prediction. Activist Data Mining Visualization tool for multi-experiment data A gallery of challenging problems. Lessons from successful collaborations. 54 Sungear Multifactor Visualization Joint work with Rodrigo Gutirrez, Manny Katari, Brad Paley, Chris Poultney, and Gloria Coruzzi Typical Genomic Questions Multiple experiments (multiple time points, multiple conditions), many Go categories, or other features of genes: want to know
when certain Go categories are highly represented. Many species, want to know which genes have presence in many species and perhaps which GO categories 56 Accepted Way to Compare Results: Venn Diagram A B C Intersect(A, B) 57 Venn Diagram Doesnt Work Beyond 3, e.g. Intersect(D,B)
A B D C Intersect(A, B) 58 Computational Desires Simple, responsive interface Visualize lots of experiments (more than 3) Many ways to query
Many different data representations 59 Sungear Design Generalizes Venn diagrams Visual outline is a polygon having anchors on borders and gears in the interior. Each gears points to associated anchors. Linked views to hierarchies, lists, and graphs, so can simultaneously update data depending on user queries (selection events). 60 61 Sungear Principle Sungear is stupid Doesnt care what kind of data it is
representing, though there is built-in support for genes (because of links to GO and to cytoscape). Basic Sungear representation could be used to describe anything from yachting gear to demographics. 62 Genes that respond to N in leaves and C in roots form the largest group (cnlo) 63 PII and other genes involved in Nmetabolism are among these 566 64 HYPOTHESIS: Most of the regulated genes are involved in metabolism.
65 this is not the case for other processes 66 Genes that are regulated by N & L together 67 Gene networks of NL-responsive genes 68 Demos Growth stages showing when genes are transcribed (N-reg AtGenExpSeedDev)
Blast comparison of Arabidopsis against most fully sequenced organisms. Nitrogen, carbon, light, organ showing regulation -- relative expression (cnlo) Interspecies comparisons that might show which kinds of genes are missing in gymnosperms, for example (Vicogenta) 69 Sungear Sweet Spot Collections of of data about some common entity (genes, people, goods, whatever) whose interaction you want to visualize. Biologists like the visual intuition this gives them: size matters, position matters. 70 Overview
Simple Primer in Genomics Use of form and function for prediction. Activist Data Mining Visualization tool for multi-experiment data A gallery of challenging problems. Lessons from successful collaborations. 71 Overview of Problems Data integration: integrate data from different labs. Provenance: remember where conclusions come from.
Faster execution of primitives: alignment, folding Data Mining/Machine Learning: networks 72 Data Integration Different labs produce data having different attribute names, different semantics and various statistical thresholds. Putting this all together is now done by hand (i.e. perl hacking). A really good attack on this problem would be extremely useful. 73 Data Integration Issues Keep raw data: Gene expression is a floating point number. What do you consider induced?
Keep meta-data: Chips for measurement are different. Growth conditions. Some Researchers: Zoe Lacroix, Louiqa Raschid; Phokion Kolaitis and WangChiew Tan; Peter Buneman; Phil Bernstein; Alon Halevy 74 Provenance where does the data come from? Fact: some scientists/labs/equipment are better than others. I wouldnt trust SNP (human genetic variation) data from some labs ever anon If your data mining model is based on bad data, it will produce garbage even if your algorithm is good. 75 Provenance Issues
Record where data came from. Record where conclusions came from (truth maintenance). If data or your belief about data changes, then do something. Link with statistical quality control. 76 Curation Challenge (related) Databases are updated by people. Often information is copied from one database to another, sometimes selectively. Reasons are in the curators mind. Suppose you record every change made. Can you infer intent of the changes? Could you undo all suspicious changes? Dave Lomets immortalDB. 77
Faster Execution of Primitives Remember: key resource is people time, so biologists eyes glaze over logarithmic time complexity factors that save them 10 seconds. For some apps, e.g. phylogeny reconstruction and molecular modeling, computation is the bottleneck. Goal: to handle more species or larger molecules. 78 Fast Execution: issues Many of the problems require lots of simulation, so be ready for numerical problems, e.g. multi-body problems done right. Other problems are NP-complete (e.g. multiple alignment), so find out biologists
tolerance for sub-optimal results. Ref: AntiClustAl: Multiple sequence alignment by antipole clustering, by DiPietro et al. 2005, Data Mining in Bioinformatics 79 Data Mining/Machine Learning Example Network inference: Many edges connect genes (metabolic, protein-protein, coregulation). Can one infer new edges from existing ones? Can one predict outcome of an experiment/ mutation? 80 Multinetwork differently labeled edges between nodes Gene A
Gene B Gene D Gene C 81 Transcription factor (TF)binding: very important A single TF binds to a single cis element (motif) Source: U.S. Department of Energy Genomics (http://doegenomestolife.org) 82 Properties of Edges Edges have numerical value, e.g. strength of correlation. Some edges are directed. Others are not. Seek rules of the form: if two genes g1
and g2 are coregulated, g1 is a transcription factor and there is a proteinprotein interaction between them, then g1 regulates g2. (just an example) 83 Posed as a Machine Learning Problem Given metabolic edges (directed), proteinprotein interaction (undirected), list of transcription factors, coregulations, expression data for transcription factor knockouts, (1) infer new edges; (2) figure out the effect of knocking out a transcription factor. DB Researchers: Jagadish, Frank Olken, Mona Singh 84 Other Database researchers in biology Example: Raymond Ng has analyzed gene
expression for cancer. Jiawei Han has applied machine learning to graph matching. Buneman Penn group and now Edinburgh are interested in archiving. Christos Faloutsos anything having to do with signal processing. 85 Metabolic network for Arabidopsis problem is messy Nitrate nitrate transporters Nitrate nitrate reductase Nitrite nitrite reductase
Ammonium 86 Expression and/or growth data Feedback loop through experiments Metabolic & Developmental networks TIME Treatment and/or Developmental Organs Cells Aim1
Aim2 Machine learning algorithm Adaptive adjustment of evidence weights and scores. Gene Networks Aim3 Expression and/or growth data ( mutant vs. wild type ) Inferred regulator Metabolic regulator
Simple Primer in Genomics Use of form and function for prediction. Activist Data Mining Visualization tool for multi-experiment data A gallery of challenging problems. Lessons from successful collaborations. 88 Lesson 1: caring about data Biologists look at data. Computer scientists (even database people) dont. Data is noisy. Good news: qualitative results are enough. E.g., Ibuprofen fights inflammation. Ideal: experimental results + algorithms testable likely hypotheses. 89 Lesson 2: Hard to get
assumptions right Computer scientists solve puzzles. Solve problems with as little regard for semantics as possible (sorting, databases, statistical packages). Paradigm: design an algorithm, scientists implement it and celebrate the creator. Works for physics, but seldom for biology; biologists trust experiments. 90 Lessons from a Fruitful Collaboration III Remember that people time is important. Computer scientists should give fast turnaround (two days or less) on simple tasks. Interesting tasks are not far behind. Reserve those for your graduate students.
91 Lessons from a Fruitful Collaboration IV Meet every week. Dont be afraid to be ignorant. Computer scientists should get involved in experimental design (be activist). 92 Closing words Fun problem. Fun people. Great food. You have a lot to offer, but the problems dont come neatly packaged. When you do discover a problem, keep your solution simple fewer assumptions. Remember to be lucky: the experiments have to work out for biologists to
Teaching High Emphasis/High Utility Content for the TASC Math Exam. February 8th, 2017 - 2:00PM - 3:30PM ... Some extend the units in the framework as well as provide additional support and practice to students wherever they need it. Framework...
PCensus. Sample business plans and documents. Business Plan Pro and Business in a Box. Stay Connected. Feel free to contact us anytime! Cindy Allen. Communications Officer902-426-6286. [email protected] Paul . Gérin. Business Development Officer902-426-7377.
Pie charts - Once again, pie charts are a poor choice for data visualization. In this instance, the percentage is shown directly on the pie chart - which helps, but the user still has to look back and forth across...
In school we teach the children to use . millimetres (mm), centimetres (cm), metres (m) for length. Millilitres (ml) and litres (l) to measure capacity. Grams (g) and Kilograms (KG) for weight. Degrees Celsius for temperature. For this the children...
Browsing; bite sized bits of info ('sliced and diced') textbooks to match student need; satisfy short term loan requirements Computer ownership - high in some disciplines e.g. computing, technology, business & mgt - subjects that are well served by ebooks...