An Automated System for Deep Proteome Annotation
Gary Van Domselaar, Savita Shrivastava,
Paul Stothard and David S. Wishart
Department of Computing Science and Biological Sciences
University of Alberta
Edmonton AB T6E 2E9

[email protected]
[email protected]

Most biological databases in existence today are
focused on a narrow biological domain. As such, they
are unable to address biological questions outside of
that domain. Researchers wishing to address broad
biological questions must manually compile data from
several biological data sources.
This poster describes our progress on the development
of an automated system for deeply annotating the
proteomes of model organisms (and others), and an
intuitive data mining and data visulization system that
provide detailed information for broad biological queries.
The Deep Annotaition system is part of the PENCE
Proteome Analyst project.

Internal Processing

Deeply Annotated Model Organisms
Human, Mouse, E. coli,
D. melanogaster,
S. cerevisiae, C. elegans, etc.

Sequence Data



Prior to the advent of high throughput sequencing, most biologists would
annotate or characterize genes and proteins manually one at a time.
However for genome scale annotation it is too consuming to predict the
properties of each protein sequence or to organize the results of many
prediction tools by hand. Furthermore due to the enormous volume of
biological information, the sheer number of different data sources, and their
growing heterogeneity, an 'information labyrinth' has been created, where
one can easily lose ones way on such a quest for information. Clearly a
high degree of automation is required to cope with the analysis of the huge
number of sequences generated by genome sequencing projects, and to
ensure consistent and reproducible results. This automation could free the
expert to verify and refine these analyses and to follow up new discoveries.
A number of systems have been developed over the past few years that
permit automated genome-wide or proteome-wide annotation, such as The
ENSEMBL system, PEDANT, Magpie, GeneQuiz, and Proteome

Local Datbases:

Sequence Data




1. The system accepts proteomic or genomic data. If the user
submits genomic data, gene predictions can be performed with
Glimmer or Genscan.
2. The unnanotated sequences enter the Proteome Annotation
3. Sequences are compared against existing deeply annotated
databases. Sequences with sufficient homology inheret
appropriate annotations. Other annoations are computed locally.
4. Annotations unavailable locally are obtained by querying
servers and databases across the Internet.
5. The annotated sequence data is added to the database of
annotated organisms and made available for viewing and

Sequence Data

6. Annoations are viewable over the Web using CGView for
circular chromosomes, and LGView for linear chromosomes.
Broad queries can be made across organisms for an arbitrary
subset of available annotations.


Mining Software

The above-mentioned systems are web-based tools designed to identify
genes, parse data, translate sequences, search against public databases,
identify domains or motifs and perform predictive analyses. Many of these
packages provide user-customizable searches and graphical, hyperlinked
output. The level of interpretation or inference offered by these annotation
systems varies widely, with some offering only raw data in a consolidated
format and others inferring function or ontology through detailed analysis.

The workflow engine, database comparison, data input / output
and html rendering systems are in place. A number of
annotation computing modules have been implemented (Pfam,
PROSITE, Protein Name Finder, Orthologues, Paralogues,
Molecular Weight, PI, Subcellular Location Prediction, and
Function Prediction). Many more are being written. We are
currently working on improving the data storage and querying
systems. An initial release has been planned for mid-summer
An early test version of the output (on H. influenzae) is
available at:

A common problem for many existing automated annotation system is that
the depth of annotation about any given gene or protein is quite limited or
shallow, typically consisting of 10-15 piece of information. We are
working on an automated system (The Proteome Analyst System) for
deeply annotating the proteomes of model organisms, and developing an
intuitive data mining and data visualization system. Deep annotation
means that the proteome/genome is annotated to a level that includes such
items as predicted protein location, 2D or 3D structure, detailed or specific
functions, post-translational modifications, expression levels, interacting
partners, domains, active sites, substrates, ligands, pathways, cofactors,
copy numbers, etc.. An example of the kind of "deep" annotation can be
seen on Cybercell database. This deep annotation project contains a
software engineering component that integrates existing data and methods
to perform a scientific analysis of the integrated data. The results of this
kind of project are of interest from the scientific point of view and from the
software engineering point of view. This deep annotation system may be
used to support a wide range of biologists and could be a platform for
further developments. Since the similarity of functions between related
proteins varies substantially depending on the species context and
evolutionary distance, the relevant analysis and annotations also differ
between the kingdoms (viruses, archaebacteria, protista, fungae, animalia,
eubacteria, plantae). The major challenge of this project is to develop
custom analysis pipelines for each kingdom.



External Processing

Proteome Analyst:

Recently Viewed Presentations