Gene Annotation and GO SPH 247 Statistical Analysis of Laboratory Data Slide Sources www.geneontology.org Jane Lomax (EBI) David Hill (MGI) Pascale Gaudet (dictyBase)

Stacia Engel (SGD) Rama Balakrishnan (SGD) May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 2 The Gene Ontologies A Common Language for Annotation of Genes from Yeast, Flies and Mice and Plants and Worms and Humans and anything else!

May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 3 Gene Ontology Objectives GO represents categories used to classify specific parts of our biological knowledge: Biological Process Molecular Function Cellular Component GO develops a common language applicable to any organism GO terms can be used to annotate gene

products from any species, allowing comparison of information across species May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 4 Expansion of Sequence Info May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

5 Entering the Genome Sequencing Era Eukaryotic Genome Sequences Year Genome Size (Mb) # Genes Yeast (S. cerevisiae) 1996 12

6,000 Worm (C. elegans) 1998 97 19,100 Fly (D. melanogaster) 2000 120 13,600

Plant (A. thaliana) 2001 125 25,500 Human (H. sapiens, 1st Draft) 2001 ~3000 ~35,000

May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 6 Baldauf et al. (2000) Science 290:972 May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 7

Comparison of sequences from 4 organisms MCM3 MCM2 CDC46/MCM5 CDC47/MCM7 CDC54/MCM4 MCM6 These proteins form a hexamer in the species that have been examined May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 8

http://www.geneontology.org/ May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 9 Outline of Topics Introduction to the Gene Ontologies (GO) Annotations to GO terms GO Tools Applications of GO May 14, 2010 SPH 247 Statistical Analysis of

Laboratory Data 10 What is Ontology? 1606 1700s Dictionary:A branch of metaphysics concerned with the nature and relations of being. Barry Smith:The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality. May 14, 2010

SPH 247 Statistical Analysis of Laboratory Data 12 So what does that mean? From a practical view, ontology is the representation of something we know about. Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things. Sriniga Srinivasan, Chief Ontologist, Yahoo!

The ontology. Dividing human knowledge into a clean set of categories is a lot like trying to figure out where to find that suspenseful black comedy at your corner video store. Questions inevitably come up, like are Movies part of Art or Entertainment? (Yahoo! lists them under the latter.) -Wired Magazine, May 1996 May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 14 The 3 Gene Ontologies

Molecular Function = elemental activity/task the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity Biological Process = biological goal or objective broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions Cellular Component = location or complex subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

15 Example: Gene Product = hammer Function (what) Process (why) Drive nail (into wood) Carpentry Drive stake (into soil) Gardening Smash roach

Pest Control Clowns juggling object Entertainment May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 16 Biological Examples Biological Process

May 14, 2010 Molecular Function SPH 247 Statistical Analysis of Laboratory Data Cellular Component 17 Terms, Definitions, IDs term: MAPKKK cascade (mating sensu Saccharomyces) goid: GO:0007244 definition: OBSOLETE. MAPKKK cascade involved in definition: MAPKKK cascade involved in transduction of transduction of mating pheromone signal, as described in

mating pheromone signal, as described in Saccharomyces Saccharomyces. definition_reference: PMID:9561267 comment: This term was made obsolete because it is a gene product specific term. To update annotations, use the biological process term 'signal transduction during conjugation with cellular fusion ; GO:0000750'. SPH 247 Statistical Analysis of May 14, 2010 18 Laboratory Data Ontology Includes: 1. A vocabulary of terms (names for concepts) 2. Definitions

3. Defined logical relationships to each other May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 19 chromosome organelle nucleus [other types of chromosomes]

May 14, 2010 [other organelles] nuclear chromosome SPH 247 Statistical Analysis of Laboratory Data 20 Ontology Structure Ontologies can be represented as graphs, where the nodes are connected by edges Nodes = terms in the ontology Edges = relationships between the concepts

node edge node May 14, 2010 node SPH 247 Statistical Analysis of Laboratory Data 21 Parent-Child Relationships Chromosome Cytoplasmic

chromosome Mitochondrial chromosome Nuclear chromosome Plastid chromosome A child is a subset or instances of a parents elements May 14, 2010 SPH 247 Statistical Analysis of

Laboratory Data 22 Ontology Structure The Gene Ontology is structured as a hierarchical directed acyclic graph (DAG) Terms can have more than one parent and zero, one or more children Terms are linked by two relationships is-a part-of is_a May 14, 2010 SPH 247 Statistical Analysis of

Laboratory Data part_of 23 Directed Acyclic Graph (DAG) chromosome organelle nucleus [other types of chromosomes] [other organelles]

nuclear chromosome is-a part-of May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 24 http://www.ebi.ac.uk/ego May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

25 Evidence Codes for GO Annotations http://www.geneontology.org/GO.evidence.html May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 26 Evidence codes Indicate the type of evidence in the cited source* that supports

the association between the gene product and the GO term *capturing information May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 27 Types of evidence codes Experimental codes - IDA, IMP, IGI, IPI, IEP Computational codes - ISS, IEA, RCA, IGC Author statement - TAS, NAS Other codes - IC, ND

May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 28 IDA Inferred from Direct Assay direct assay for the function, process, or component indicated by the GO term Enzyme assays In vitro reconstitution (e.g. transcription) Immunofluorescence (for cellular component) Cell fractionation SPH 247 Statistical

of (forAnalysis cellular component) May 14, 2010 Laboratory Data 29 IMP Inferred from Mutant Phenotype variations or changes such as mutations or abnormal levels of a single gene product

Gene/protein mutation Deletion mutant RNAi experiments Specific protein inhibitors Allelic variation May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 30 IGI Inferred from Genetic Interaction

Any combination of alterations in the sequence or expression of more than one gene or gene product Traditional genetic screens - Suppressors, synthetic lethals Functional complementation Rescue experiments

An entry in the with column is recommended May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 31 IPI Inferred from Physical Interaction Any physical interaction between a gene product and another molecule, ion, or complex 2-hybrid interactions Co-purification

Co-immunoprecipitation Protein binding experiments May 14, 2010 An entry in the with column is recommended SPH 247 Statistical Analysis of Laboratory Data 32 IEP Inferred from Expression Pattern Timing or location of expression of a gene

Transcript levels Northerns, microarray Exercise caution when interpreting expression results May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 33 ISS Inferred from Sequence or structural Similarity

Sequence alignment, structure comparison, or evaluation of sequence features such as composition Sequence similarity Recognized domains/overall architecture of protein An entry in the with column is recommended May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 34

RCA Inferred from Reviewed Computational Analysis non-sequence-based computational method large-scale experiments genome-wide two-hybrid genome-wide synthetic interactions integration of large-scale datasets of several types text-based computation (text mining) May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 35

IGC Inferred from Genomic Context Chromosomal position Most often used for Bacteria - operons Direct evidence for a gene being involved in a process is minimal, but for surrounding genes in the operon, the evidence is well-established May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 36 IEA Inferred from Electronic Annotation

depend directly on computation or automated transfer of annotations from a database Hits from BLAST searches InterPro2GO mappings No manual checking Entry in with column is allowed (ex. sequence ID) May 14, 2010

SPH 247 Statistical Analysis of Laboratory Data 37 TAS Traceable Author Statement publication used to support an annotation doesn't show the evidence Review article Would be better to track down cited reference and use an experimental code May 14, 2010

SPH 247 Statistical Analysis of Laboratory Data 38 NAS Non-traceable Author Statement Statements in a paper that cannot be traced to another publication May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 39

ND No biological Data available Can find no information supporting an annotation to any term Indicate that a curator has looked for info but found nothing Place holder Date May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 40 IC

Inferred by Curator annotation is not supported by evidence, but can be reasonably inferred from other GO annotations for which evidence is available ex. evidence = transcription factor (function) IC = nucleus (component) May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 41 Choosing the correct evidence code Ask yourself:

What is the experiment that was done? May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 42 http://www.geneontology.org/GO.evidence.html May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 43

Using the Gene Ontology (GO) for Expression Analysis What is the Gene Ontology? Set of biological phrases (terms) which are applied to genes: protein kinase apoptosis membrane May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

45 What is the Gene Ontology? Genes are linked, or associated, with GO terms by trained curators at genome databases known as gene associations or GO annotations Some GO annotations created automatically May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

46 GO annotations GO database gene -> GO term associated genes genome and protein databases May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

47 What is the Gene Ontology? Allows biologists to make inferences across large numbers of genes without researching each one individually May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 48 Eisen, Michael B. et al. (1998) Proc. Natl. Acad. Sci. USA 95, 14863-14868 May 14, 2010 Copyright 1998 by the National Academy of Sciences

SPH 247 Statistical Analysis of Laboratory Data 49 GO structure GO isnt just a flat list of biological terms terms are related within a hierarchy May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

50 GO structure gene A May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 51 GO structure This means genes can be grouped

according to userdefined levels Allows broad overview of gene set or genome May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 52 How does GO work? GO is species independent some terms, especially lower-level, detailed terms may be specific to a certain group e.g. photosynthesis

But when collapsed up to the higher levels, terms are not dependent on species May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 53 How does GO work? What information might we want to capture about a gene product? What does the gene product do? Where and when does it act? Why does it perform these activities?

May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 54 GO structure GO terms divided into three parts: cellular component molecular function biological process May 14, 2010 SPH 247 Statistical Analysis of

Laboratory Data 55 Cellular Component where a gene product acts May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 56 Cellular Component May 14, 2010

SPH 247 Statistical Analysis of Laboratory Data 57 Cellular Component May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 58 Cellular Component

Enzyme complexes in the component ontology refer to places, not activities. May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 59 Molecular Function activities or jobs of a gene product glucose-6-phosphate isomerase activity May 14, 2010 SPH 247 Statistical Analysis of

Laboratory Data 60 Molecular Function May 14, 2010 insulin binding SPH 247 Statistical Analysis of insulin receptor Laboratory Data activity 61 Molecular Function

May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data drug transporter activity 62 Molecular Function A gene product may have several functions; a function term refers to a single reaction or activity, not a gene product. Sets of functions make up a biological process.

May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 63 Biological Process a commonly recognized series of events cell division May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

64 Biological Process May 14, 2010 transcription SPH 247 Statistical Analysis of Laboratory Data 65 Biological Process regulation of gluconeogenesis May 14, 2010

SPH 247 Statistical Analysis of Laboratory Data 66 Biological Process May 14, 2010 limb development SPH 247 Statistical Analysis of Laboratory Data 67 Biological Process

May 14, 2010 courtship behavior SPH 247 Statistical Analysis of Laboratory Data 68 Ontology Structure Terms are linked by two relationships is-a part-of May 14, 2010

SPH 247 Statistical Analysis of Laboratory Data 69 Ontology Structure cell membrane chloroplast mitochondrial membrane May 14, 2010 is-a

part-of chloroplast membrane SPH 247 Statistical Analysis of Laboratory Data 70 Ontology Structure Ontologies are structured as a hierarchical directed acyclic graph (DAG) Terms can have more than one parent and zero, one or more children May 14, 2010

SPH 247 Statistical Analysis of Laboratory Data 71 Ontology Structure Directed Acyclic Graph (DAG) - multiple parentage allowed cell membrane chloroplast mitochondrial

membrane May 14, 2010 chloroplast membrane SPH 247 Statistical Analysis of Laboratory Data 72 Anatomy of a GO term id: GO:0006094 name: gluconeogenesis namespace: process def: The formation of glucose from noncarbohydrate precursors, such as

pyruvate, amino acids and glycerol. [http://cancerweb.ncl.ac.uk/omd/index.html] exact_synonym: glucose biosynthesis xref_analog: MetaCyc:GLUCONEO-PWY is_a: GO:0006006 is_a: GO:0006092 May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data unique GO ID term name ontology definition synonym database ref

parentage 73 GO tools GO resources are freely available to anyone to use without restriction Includes the ontologies, gene associations and tools developed by GO Other groups have used GO to create tools for many purposes: http://www.geneontology.org/GO.tools May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

74 GO tools Affymetrix also provide a Gene Ontology Mining Tool as part of their NetAffx Analysis Center which returns GO terms for probe sets May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 75 GO tools

Many tools exist that use GO to find common biological functions from a list of genes: http://www.geneontology.org/GO.tools.microarray.shtml May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 76 GO tools Most of these tools work in a similar way: input a gene list and a subset of interesting genes tool shows which GO categories have most

interesting genes associated with them i.e. which categories are enriched for interesting genes tool provides a statistical measure to determine whether enrichment is significant May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 77 Microarray process

Treat samples Collect mRNA Label Hybridize Scan Normalize Select differentially regulated genes Understand the biological phenomena involved May 14, 2010 SPH 247 Statistical Analysis of

Laboratory Data 78 Traditional analysis Gene 1 Apoptosis Cell-cell signaling Protein phosphorylation Mitosis Gene 3 Growth control Gene 4 Mitosis Nervous system

Oncogenesis Pregnancy Protein phosphorylation Oncogenesis Mitosis May 14, 2010 Gene 2 Growth control Mitosis Oncogenesis Protein phosphorylation Gene 100

Positive ctrl. of cell prolif Mitosis Oncogenesis Glucose transport SPH 247 Statistical Analysis of Laboratory Data 79 Traditional analysis gene by gene basis requires literature searching time-consuming May 14, 2010

SPH 247 Statistical Analysis of Laboratory Data 80 Using GO annotations But by using GO annotations, this work has already been done for you! GO:0006915 : apoptosis May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

81 Grouping by process Apoptosis Gene 1 Gene 53 Positive ctrl. of cell prolif. Gene 7 Gene 3 Gene 12 May 14, 2010 Mitosis

Gene 2 Gene 5 Gene45 Gene 7 Gene 35 Glucose transport Gene 7 Gene 3 Gene 6 Growth Gene 5 Gene 2 Gene 6

SPH 247 Statistical Analysis of Laboratory Data 82 GO for microarray analysis Annotations give function label to genes Ask meaningful questions of microarray data e.g. genes involved in the same process, same/different expression patterns? May 14, 2010 SPH 247 Statistical Analysis of

Laboratory Data 83 Using GO in practice statistical measure how likely your differentially regulated genes fall into that category by chance 80 70 60 50 40 30 20 10 0

microarray 1000 genes May 14, 2010 experiment 100 genes differentially regulated SPH 247 Statistical Analysis of Laboratory Data mitosis

apoptosis positive control of glucose transport cell proliferation mitosis 80/100 apoptosis 40/100 p. ctrl. cell prol. 30/100 glucose transp. 20/100 84 Using GO in practice However, when you look at the distribution of all genes on the microarray: Process mitosis apoptosis

p. ctrl. cell prol. glucose transp. May 14, 2010 Genes on array 800/1000 400/1000 100/1000 50/1000 # genes expected in 100 random genes 80 40 10 5

SPH 247 Statistical Analysis of Laboratory Data occurred 80 40 30 20 85 AmiGO Web application that reads from the GO Database (mySQL) Allows to browse the ontologies

view annotations from various species compare sequences (GOst) Ontologies are loaded into the database from the gene_ontology.obo file Annotations are loaded from the gene_association files submitted by the various annotating groups Only Non-IEA annotations are loaded May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 86 AmiGO http://www.godatabase.org

Node has children, can be clicked to view children May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 87 Some basics Node has children, can be clicked to view children Node has been opened, can be clicked to close Leaf node or no children Is_a relationship Part_of relationship

pie chart summary of the numbers of gene products associated to any immediate descendants of this term in the tree May 14, 2010 . SPH 247 Statistical Analysis of Laboratory Data 88 Searching the Ontologies May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

89 Term Tree View May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 90 Click on the term name to view term details and annotations May 14, 2010 SPH 247 Statistical Analysis of

Laboratory Data 91 Term details links to representations of this term in other databases Annotations from various species May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 92 Annotations associated with a term

Annotation data are from the gene_associations file submitted by the annotating groups May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 93 Searching by gene product name May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 94

Advanced search May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 95 GOST-Gene Ontology blaST Blast a protein sequence against all gene products that have a GO annotation Can be accessed from the AmiGO entry page (front page)

May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 96 GOst can also be accessed from the annotations section May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 97

Analysis of Gene Expression Data The usual sequence of events is to conduct an experiment in which biological samples under different conditions are analyzed for gene expression. Then the data are analyzed to determine differentially expressed genes. Then the results can be analyzed for biological relevance. May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 98

Biological Knowledge Expression Experiment Statistical Analysis Biological Interpretation May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 99 The Missing Link

Biological Knowledge Expression Experiment Statistical Analysis Biological Interpretation May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 100 Gene Set Enrichment Analysis

(GSEA) Given a set of genes (e.g., zinc finger proteins), this defines a set of probes on the array. Order the probes by smallest to largest change (we use p-value, not fold change). Define a cutoff for significance (e.g., FDR pvalue < .10). Are there more of the probes in the group than expected? May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 101 P-value

0.0947 Not in gene set In gene Set Not 30 3 significant 91%/75% 9%/38% Total 33 Significant 10

5 15 67%/25% 33%/62% Total May 14, 2010 40 8 SPH 247 Statistical Analysis of Laboratory Data 48 102

GSEA for all cutoffs If one does GSEA for all possible cutoffs, and then takes the best result, this is equivalent to an easily performed statistical test called the Kolmogorov-Smirnov test for the genes in the set vs. the genes not in the set. Programs on www.broad.mit.edu/gsea/ However this requires a single summary number for each gene, such as a p-value. May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 103

An Example Study This study examined the effects of relatively low-dose radiation exposure invivo in humans with precisely calibrated dose. Low LET ionizing radiation is a model of cellular toxicity in which the insult can be given at a single time point with no residual external toxic content as there would be for metals and many long-lived organics. The study was done in the clinic/lab of Zelanna Goldberg May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

104 The study design Men were treated for prostate cancer with daily fractions of 2 Gy for a total dose to the prostate of 74 Gy. Parts of the abdomen outside the field were exposed to lower doses. These could be precisely quantitated by computer simulation and direct measurements by MOSFETs. May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

105 A 3mm biopsy was taken of abdominal skin before the first exposure, then three more were taken three hours after the first exposure at sites with doses of 1, 10, and 100 cGy. RNA was extracted and hybridized on Affymetrix HG U133 Plus 2.0 whole genome arrays. The question asked was whether a particular gene had a linear dose response, or a response that was linear in (modified) log dose (0, 1, 10, 100 -> 1, 0, 1, 2). May 14, 2010

SPH 247 Statistical Analysis of Laboratory Data 106 Why is this difficult? For a single patient, there are only 4 data points, so the statistical test is not very powerful. With 54,675 probe sets, very apparently significant results can happen by chance, so the barrier for true significance is very high. This happens in any small sized array study. May 14, 2010 SPH 247 Statistical Analysis of

Laboratory Data 107 There are reasons to believe that there may be inter-individual variability in response to radiation. This means that we may not be able to look for results that are highly consistent across individuals. One aspect is the timing of transcriptional cascades. Another is polymorphisms that lead to similar probes being differentially expressed, but not the same ones.

May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 108 Gene 1 Gene1 Gene 2 3 Hours Gene 2 May 14, 2010

Gene 3 SPH 247 Statistical Analysis of Laboratory Data Gene 3 109 The ToTS Method For a gene group like zinc finger proteins, identify the probe sets that relate to that gene group. This was done by hand in the Goldberg lab for this study. Ruixiao Lu in my group is working to automate this.

ToTS = Test of Test Statistics May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 110 For each probe set, conduct a statistical test to try to show a linear dose reponse. This yields a t-statistic, which may be positive or negative. Conduct a statistical test on the group of t-statistics, testing the hypothesis that the average is zero, vs. leaning to up-regulation

or leaning to down-regulation This could be a t-test, but we used in this case the Wilcoxon test. May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 111 This can be done one patient at a time, but we can also accommodate inter-individual variability in a study with more than one individual by testing for an overall trend across individuals This is not possible using GSEA,

so the ToTS method is more broadly applicable. This was published in October, 2005 in Bioinformatics. May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 112 Integrity and Consistency For zinc finger proteins, there are 799 probe sets and 8 patients for a total of 6,392 different dose-response t-tests The Wilcoxon test that the median of these is zero is rejected with a calculated p-value of

0.00008. We randomly sampled 2000 sets of probe sets of size 799, and in no case got a more significant result. We call this an empirical pvalue (0.000 in this case). This is needed because the 6,392 tests are all from 32 arrays May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data 113 May 14, 2010 SPH 247 Statistical Analysis of Laboratory Data

114 Patient 1 2 3 4 5 6 7 8 All May 14, 2010 Direction Up

Down Down Up Up Up Up Up Up SPH 247 Statistical Analysis of Laboratory Data EPV 0.125 0.044 0.001 0.000

0.003 0.000 0.000 0.039 0.000 115 Major Advantages More sensitive to weak or diffuse signals Able to cope with inter-individual variability in response Conclusions are solidly based statistically Can use a variety of types of biological knowledge May 14, 2010

SPH 247 Statistical Analysis of Laboratory Data 116 Exercise Take the top 10 genes from the keratinocyte gene expression study and map their go annotations using AMIGO or the R tools. Are there any obvious common factors? Do you think this would work better if you looked at all the significant genes and all the GO annotations, or would this be too difficult? May 14, 2010

SPH 247 Statistical Analysis of Laboratory Data 117