Overview |
Project Report |
People |
Publications |
Links
GeneWeaver
"The Development of Intelligent Software Agents for Genome
Analysis and Protein Structure Prediction"
Introduction
One of the most important and pressing challenges faced by
present-day biological scientists is to move beyond the task of
genomic data collection in the sequencing of DNA, and to make sense
of that data so that it may be used, for example, in the
development of therapies to address critical genetic disorders.
The process of identifying genes and predicting the structure of
the encoded proteins involves computer-based tasks, including:
- scanning sequence databases for similar sequences,
- collecting the matching sequences,
- constructing alignments of the sequences, and
- inferring the function of the sequence from annotations of the
matched proteins (for which the function is already known).
Predicting the three-dimensional structure of the proteins
requires analyses of the collected sequence data by a range of
different programs, which sometimes agree but often do not (though
they typically provide confidence scores that enable relatively
easy interpretation).
Many tools are available to perform these tasks, but they are
typically standalone programs that are not integrated with each
other. The expert users perform each stage manually and combine
them in appropriate ways. For example, the process of trying to
find a matching sequence might result in finding an annotated gene,
but the annotations include much spurious information as well as
the important functional information. The problem here is
distilling this relevant information, which is not difficult for an
expert, but which might be problematic for a less experienced user.
With the amount of data that is being generated, this kind of
expertise is critical.
Both the primary data and some of the programs are accessible
only over the Internet — by either electronic mail or the WWW
(and increasingly the latter). This requires the different sources
of information and the different programs to be managed
effectively, and further complicates the efficient processing of
the genomic data.
The raw data has been accumulating at an unprecedented pace, and
a range of computational tools and techniques have been developed
by bioinformaticians, targetted at the problems of storing and
analysing that data. In this sense, much has already been achieved,
but these tools are labour-intensive and usually require expert
manual direction and control, imposing huge restrictions on the
rate of progress. Essentially, however, the problems involved are
familiar from other domains — vast amounts of data and
information, existing programs and databases, complex interactions,
distributed control — pointing strongly to the adoption of a
multi-agent approach.
Project Aims
An agent-based architecture consists of a number of distributed
and autonomous software programs known as agents. These
interact using a standard communication language that allows them
to cooperate with the aim of accomplishing their overall goals.
Each agent can 'wrap' heterogeneous data and methods, presenting it
to the community of agents in a uniform manner by way of the common
agent-communication language. Potentially they provide a framework
within which distributed and autonomous resources can be managed
and integrated. The principal objective of this project was the
application of agent-based concepts to the management and
integration of automatic genome analysis and protein structure
prediction.
A large number of resources, both data and methods, are freely
available over the Internet and potentially can be used for
bioinformatics tasks such as genome annotation. Unfortunately the
actual use of these data and methods, particularly for automatic
systems, is hampered by several factors:
- Data and methods are distributed across the network and are
under the control of third parties who need to be able to modify
them as they wish. Any system wishing to integrate and use these
resources must take into account the constant changes in the data
and that these resources are autonomous and unpredictable
(termination of a resource for instance).
- The data is largely heterogeneous both in terms of data formats
and the terminologies (or ontologies) employed. This applies both
to primary data stored in various databases and to the data used
for input and output by different methods.
- Access to the various distributed methods using web servers has
greatly increased the availability of methods to human users, but
they are not entirely satisfactory for a variety of reasons. For
example, direct communication between two servers is problematic
(and is a common requirement when automatically integrating
methods), due to web servers' reliance on natural language
interpretation. Furthermore, web servers 'hide' important data,
such as the version of any underlying databases employed, which is
essential information when considering automatic updating and
permits the resolution of inconsistencies which arise when
employing a number of servers.
Some of these factors are very wide-ranging and complex, for
instance the development of consistent ontologies to be employed
for bioinformatics. These require a community-wide approach, for
example the Gene-Ontology initiative. Our original proposal did not
aim to cover such aspects but limited itself to the application of
agent-based techniques to facilitate data and method management for
the application fields mentioned above.
The GeneWeaver Architecture
At the start of the project, a number of agent architectures
already existed, and we examined the possibility of re-using one of
them. None of them was suitable for our requirements, and we thus
decided to undertake the development of a specific architecture for
the bioinformatics domain, which we named GeneWeaver.
GeneWeaver is a multi-agent system aimed at addressing many of
the problems in the domain of genome analysis and protein structure
prediction. It comprises a community of agents that interact with
each other, each performing some distinct task, in an effort to
automate the processes involved in, for example, determining gene
function. Agents in the system can be concerned with management of
the primary databases, performing sequence analyses using existing
tools, or with storing and presenting resulting information. The
important point to note is that the system does not offer new
methods for performing these tasks, but organises existing ones for
the most effective and flexible operation.
Adoption of a suitable agent-based language was seen to be
crucial since it acts as a common language between all the agents
in the system. We began with an established agent-based language
(KQML) which we modified to form the BioAgent Language (BAL). Such
a language is the only thing which 'couples' agents together by
allowing one agent to influence another agent towards a particular
goal, and the agents are otherwise completely autonomous in
nature.
One of the primary aspects of the design of GeneWeaver was to
make a single agent responsible for both the provision and
management of each resource. The prototype system includes three
primary database agents (SWISSPROT, PIR and PDB) which provide a
number of data services to other agents. The current data services
include simple querying of the data and allowing agents to
'subscribe' to data. The BAL language has been designed to allow a
variety of different data exchange and querying languages to be
employed, any two agents involved in an interaction needing to use
one which is common to both of them. This permits easy future
extension as standards for data exchange of biological data, for
instance XML, emerge.
A second important feature of GeneWeaver is that the database
agents automatically update their data (currently using FTP sites)
and inform any subscribed agents of relevant changes. The prototype
system employs a non-redundant database agent that provides similar
data services as the primary database agents but updates its data
by subscribing to sequences managed by the primary database agents.
Two calculation agents (PSI-BLAST and MEMSAT) are included, which
register meta-data about what particular goals they can achieve
(for instance 'can derive membrane topology for protein sequence')
together with more general data on their methods' accuracy and
speed. This allows the calculation agents to be used by other
agents in two manners: either directly by commanding the agent to
carry out a particular method or by giving a general goal such as
'derive X'.
Calculation agents manage their own methods, so the PSI-BLAST
agent updates the underlying databases employed on a regular basis
using the other database agents, and the MEMSAT agent re-trains
itself using new membrane proteins derived from the SWISSPROT
agent. Essentially the agents use services provided by other agents
in the community to improve their own services. This can be viewed
a rather specialised form of learning. These automatic mechanisms
also give the system a novel level of data and method
consistency.
The system is open, and new calculation agents may join the
community at any time. Even when other agents in the community do
not know the exact nature of the new agents, their services may be
employed since they are described in general terms of 'can derive
X'.
GeneWeaver contains a number of genome agents for simple
bacterial genomes. These use FTP to maintain up-to-date copies of
their data and use the calculation agents to annotate their
data.
The GeneWeaver system is based on a uniform agent model, with a
large degree of common code shared between the agents. The
differences in behaviour between agents results from the initial
loading of different components, such as 'skills' which perform
particular actions and motivations' which drive the agent to follow
particular goals. The uniform design structure adopted for all the
agents should greatly facilitate future expansion of the agent
community since new types of agents may be implemented with only
small additional amounts of code.
Results
The prototype system has demonstrated the feasibility of this
novel approach and has revealed a number of benefits, some not
envisaged in the original proposal. It succeeds in providing a
limited degree of genome annotation (protein membrane
classification and homology) in which the methods can re-train
themselves as newly discovered data becomes available. It enables
the integration of distributed databases and methods while
permitting them to remain under the control of third-parties since
the system assumes all agents are autonomous. It should be noted
that this architecture seems particularly appropriate for the
recently established ideas of GRID-based computing
infrastructures.
It is clear that an architecture such as GeneWeaver requires
considerable investment in design and development. This gives rise
to a substantially more complex system, mainly due to the
decentralised control that is inherent, but which offers greater
flexibility. One of the consequences of this feature of multi-agent
systems, recognised within the agent community, is the need for the
wider adoption and development of standards. For example, the FIPA
standard for agent-based communication, together with the FIPA-OS
as an open-source agent framework, potentially permits much more
rapid development of agent systems for bioinformatics in the future
(although this remains untested).
In the original proposal, we envisaged a working system for
genome analysis and protein structure prediction. We have developed
such a prototype system that demonstrates the principal concepts
and benefits. Further work on this system is still ongoing at INRA
in France, field-testing and extending the prototype system so that
it may be used to carry out a first-pass annotation of a number of
novel Lactococcus genomes sequenced at INRA.
Whether the encapsulation of databases should be agent-oriented
is a moot point. Certainly, related work in the database
integration is relevant, and the database community is addressing
these issues with a somewhat different tack. Nevertheless, the
agent approach offers an over-arching design paradigm, and offers
much more in those areas where databases are not relevant,
particularly for calculation agents for tool support.
Conclusion
The GeneWeaver system has provided a demonstration of the
suitability of the agent approach in bioinformatics to provide
solutions to problems with dramatically large amounts of data and a
vast array of tools to be encapsulated. While only a limited range
of tools has been included in the prototype system, the success of
the architectural design points to its effectiveness in larger
scale systems. Indeed the underlying principles are the subject of
further work at INRA, the focus of an EPSRC E-Science proposal, and
the topic of a workshop in 2002.
Future work might aim to extend the range of calculation agents
and then to assess the entire system in relation to activity of
human domain experts. Further refinements both to the architecture
and the individual agent control mechanisms could them be
investigated.
|