Overview |
Project Report |
People |
Publications |
Links
Overview
The GeneWeaver project involves the development of a flexible
system for automatic genome analysis and annotation.
The project has been funded by the
BBSRC/EPSRC Bioinformatics Initiative from April 1998 to March
2001.
A number of stages are involved in genome analysis, these
include:
- Assembly of contigs generated by sequencing machines.
- Detection of open reading frames (ORFs) in the assembled
genome.
- Assignment of functional descriptions to the proteins.
- Assignment of structural features to the proteins.
- Detection of regulatory units such as promoters, enhancers and
silencers.
- Construction of metabolic pathways for the organism by
considering the different gene products.
Other groups have already written successful bioinformatics
software to perform the analysis required for a number of these
steps. GeneWeaver provides an architecture which integrates these
applications into a single system which can automatically analyse
genomes and also efficiently manage the data generated.
The development of a large system which integrates heterogenous
components requires a number of guiding principles. GeneWeaver is
guided by the following principles:
- It should be easy to integrate new analysis methods. It should
also be easy to extend the data types handled by the system since
new methods and new functionality may require additional data
types.
- It should be possible to distribute both tasks and data across
a network. A number of benefits arise from this. Computationally
intensive tasks can be load-balanced amongst a number of machines
and tasks which require specialized hardware such as large backup
storage can be run on appropriate machines.
- Primary data sources, such as sequence databases, should be
constantly monitored and any changes to the data should be
automatically incorporated. Essentially, the environment should be
constantly monitored and the system should react to changes which
occur.
- All data in the system should have dates of last modification
and last update. Also dependencies of some data on other data need
to be tracked at a fine level of granularity. This will allow the
system to update automatically all relevant information when a
particular item of data changes without reanalyzing the complete
genome.
- All data in the system should have a degree (or may have many
degrees) of confidence associated with it. It has been pointed out
(Karp, 1998) that a key deficiency of current sequence databases is
the lack of a reliability score attached to the functional
annotation. This results in further annotation being based on
annotation which may be unreliable.
- The system should consist of loosely-coupled modules which can
be easily combined to give additional functionality. The software
interface to these modules should be open so that third-party
modules can be incorporated.
- All data should contain histories of how it was derived from
other data. This provides an audit trail which a manual annotator
can use to determine the likely accuracy of any unusual cases.
These requirements are very naturally satisfied by the model of
multiple interacting software agents. Software agents are becoming
increasingly popular, with a correspondingly large number of
available texts (eg. Bradshaw, 1997; Knapik and Johnson, 1998) and
a range of different varieties of agent architectures. GeneWeaver
is based on a multi-agent system in which each agent takes on a
particular responsibility or expertise. For example an agent may be
responsible for keeping a non-redundant database updated, managing
a genome and its data or performing homology searches (using
whatever methods the agent chooses as appropriate to a particular
situation). The agents coordinate their activities by sending
messages to each other to accomplish overall tasks. Each agent can
be viewed as an individual program with the following
properties:
- Persistence Agents run continuously so that
the system can react to changes in the data.
- Reactivity Agents can respond to a changing
environment. For example, if an external primary database changes,
the agents can react to it.
- Autonomy Agents are able to function without
human intervention. This also ensures that the system is robust
since no agent can rely on something being successfully done by
another agent since the other agent is autonomous. All agents thus
need to be designed to cope with failure in others.
- Pro-activeness Agents behave in a goal
directed fashion. So an agent may be told to determine a homologue
for a particular protein but will not be explicitly told to run a
particular method such as PSI-BLAST. Expertise in particular tasks
is encapsulated into particular agents which simplifies system
development.
- Social ability Agents interact with other
agents by communicating in a high-level language.
A multi-agent system should provide a very flexible and open
architecture which allows annotation of genomes, kept as up-to-date
as possible.
Bradshaw, J.M. eds. (1997) Software Agents American
Association for Artificial Intelligence, Menlo Park,
California.
Karp, P.D. (1998) What we do not know about sequence analysis
and sequence databases. Bioinformatics,
14, 753-754.
Knapik, M. and Johnson, J. (1998) Developing Intelligent
Agents for Distributed Systems McGraw-Hill, New York.
|