 |
| |
 |
| |
|
| |
 |
| |
 |
| |
|
| |
 |
My
DNA, a fabulously long chain of amino acids, a copy of which
is contained in every one of my cells, contains a large
percentage of the information required to produce me, Ben
Goertzel.
This is an amazing thing, really.
Extract my DNA from any one of my cells, and feed it into
a “human producing machine,” and out comes a
clone of Ben Goertzel, lacking my knowledge and experience,
but possessing all my physical and mental characteristics.
Of course, we don’t have a human producing machine
of this nature just yet, but the potential is there: DNA
seems to encode most of the information required to produce
a human being.
This is the glory and the romance underlying the Human Genome
project, a huge initiative launched in 1990, which aims
to chart the whole human genome, to map every single amino
acid in the DNA of some sample of human beings. No one could
doubt the excitement of this quest: It has a simplicity
and grandeur similar to that of putting a man on the moon.
Once you get past the excitement and mystique and into the
details, however, the Human Genome Project slowly begins
to seem a little less tremendous. One realizes that the
actual mapping of the genome is only a very small part of
the task of understanding how people are made, and that,
in fact, the design of the “human-producing machine”
is a much bigger and more interesting job than the complete
mapping of examples of the code that goes into the machine.
In other words, embryology is probably a lot subtler than
genetics, and in the end, much like putting a man on the
moon, the Human Genome Project is a task whose scientific
value is not quite equal to its cultural and psychological
appeal.
But science moves fast these days…. As the excitement
of having mapped the human genome fades into a matter-of-fact
acceptance, the genetics community is looking ahead to what’s
called post-genomic biology. The next big challenge: figuring
out how the genetic code actually does anything. How do
these sequences of amino acids decode themselves, making
use of other molecules in their environment to create organisms
like us, or even simpler one-celled organisms? The completion
of the human genome project was one of those ends that was
actually a beginning. It put us in a position where we’re
able to finally start asking the really interesting questions.
This is a very exciting area of research – and a tremendously
difficult one as well. As yet there are no tales of tremendous
triumph – only some minor victories, a lot of hard
work and furious innovation, and the tremendous promise
of infinite victories to come. But the progress made so
far has many lessons to teach – for example, regarding
the remarkably tight interrelation between computer technology
and biological research. At the end of the chapter I’ll
briefly discuss some of the work my colleagues and I are
now doing, applying advanced AI technology to the integrated
analysis of various types of biological data, with a focus
on genetics and proteomics.

The
Human Genome Project originally was planned to last 15 years,
but rapid technological advances have accelerated the expected
completion date to 2003. The project goals are multifold:
to identify all the more than 100,000 genes in human DNA,
determine the sequences of the 3 billion chemical base pairs
that make up human DNA, store this information in databases,
and develop tools for the analysis of this huge amount of
data. Some resources have also been devoted to exploring
the ethical, legal, and social issues that may arise from
the project.
Of course there were many milestones along the path to completion
of the Human Genome Project. Befitting the accelerating
pace of scientific progress, most of these occurred not
long before the completion of the sequencing of the genome
itself. For instance, I recall the day in mid-2000 when
newspapers announced the mapping of Chromosomes 16 and 19
on the human genome. Human chromosome 19 contains about
2% of the human genome, including some 60 genes in a gene
family involved in detoxifying and excreting chemicals foreign
to the body. Chromosome 16 contains about 98 million bases,
or some 3% of the human genome, including genes involved
in several diseases, such as polycystic kidney disease (PKD),
which is suffered by about 5 million people worldwide and
is the most common potentially fatal disease caused by a
defect in a single gene.
Since that time more and more similar results have piled
up. The initial rough map of the genome is getting refined,
bit by bit, using sophisticated “gene recognition”
software that identifies sequences of base pair amino acids
that represent genes, along with lots of good old biological
intuition.
Clearly, these are major advances in gene mapping, with
potential implications for helping remedy diseases. But
-- what do they really mean?
An analogy may be instructive. Suppose a team of scientists
goes to another planet, and discovers a lot of really long
strips of paper lying around on the ground, each one with
strange markings on it. Suppose they then notice some big
steel machines, with slots that seem to be made to accept
the strips of paper. After some experimentation, they figure
out how the machines work: You feed the strip of paper in
one end, and then after a few hours, the machine spits out
a completely functional living organism. Amazing!
So, the scientists embark on a project to figure out what’s
going on here. Of course, they have no idea what’s
going on inside the machines, and all their efforts to bust
the machines open meet with failure. So, instead, they devote
themselves to completely recording all the markings on the
strips of paper in their notebooks, hoping that eventually
the patterns will come to mean something to them. When they
achieve 10%, then 20%, then 50% completion of their task
of recording these meaningless patterns in their notebooks,
they declare themselves to have made significant scientific
progress.
And occasionally, along the way, they make some small discoveries
about the impact that the markings have on the organisms
the machine produces. If you snip off the first 10% of the
strip, the organism produced is more likely to be defective
than if you strip off the last 10%. The region of the strip
that’s 2000 to 3000 markings from the end seems to
have something to do with the organism’s head: it
seems to be very different for organisms with very different
heads, and so forth. But these kinds of general observations
don’t really get them very far toward an understanding
of what the amazing steel machines are actually doing.
If you’re somewhat familiar with computers, a variation
on this analogy may be instructive. Consider a large computer
program such as Microsoft Windows. This program is produced
via a long series of steps. First, a team of programmers
produces some program code, in a programming language (in
the case of Microsoft Windows, the programming language
is C++, with a small amount of assembly language added in).
Then, a compiler acts on this program code, producing an
executable file – the actual program that we run,
and think of as Microsoft Windows. Just as with human beings,
we have some code, and we have a complex entity created
by the code, and the two are very different things. Mediating
between the code and the product is a complex process –
in the case of Windows, the C++ compiler; in the case of
human beings, the whole embryological and epigenetic biochemical
process, by which DNA grows into a human infant.
Now, imagine a “Windows Genome Project,” aimed
at identifying every last bit and byte in the C++ source
code of Microsoft Windows. Suppose the researchers involved
in the Windows Genome Project managed to identify the entire
source code, within 99% accuracy. What would this mean for
the science of Microsoft Windows?
Well, it could mean two different things.
1)
If they knew how the C++ compiler worked, then they’d
be home free! They’d know how to build Microsoft Windows!
2)
On the other hand, what if they not only had no idea how
to build a C++ compiler, but also had no idea what the utterances
in the C++ programming language meant? In other words, they
had mapped out the bits and bytes in the Windows Genome,
the C++ source code of Windows, but it was all a bunch of
gobbledygook to them. All they have a is a large number
of files of C++ source code, each of which is a nonsense
series of characters. Perhaps they recognized some patterns:
older versions of Windows tend to be different in lines
1000-1500 of this particular file. When file X is different
between one Windows version and another, this other file
tends to also be different between the two versions. This
line of code seems to have some effect on how the system
outputs information to the screen. Et cetera.
Our
situation with the Human Genome Project is much more like
Option 2 than it is like Option 1.
The scientists carrying out the Human Genome Project are
much like the scientists in my first parable above, who
are busily recording the information on the strips of paper
they’ve found, but have no idea whatsoever what’s
going on inside the magical steel machines that actually
take in the strips of paper and produce the alien animals.
Moving beyond analogies, let’s talk briefly about
a real project related to the Human Genome Project: the
Fly Genome Project. In the 24 March 2000 issue of Science
magazine, in a series of articles jointly authored by hundreds
of scientists, technicians, and students from 20 public
and private institutions in five countries, the almost-complete
mapping of the genome of the fruit fly Drosophila melanogaster
was announced. Hurray! Some other species of fly have also
been similarly mapped.
The fruit fly Drosophila has a big history in genetics;
its study has yielded a long series of fundamental discoveries,
beginning with the proof, in 1916, that the genes are located
on the chromosomes. Now all of its 13,601 individual genes
have been enumerated.
This achievement may have some practical value. In a set
of 289 human genes implicated in diseases, 177 are closely
similar to fruit fly genes, including genes that play roles
in cancers, in kidney, blood, and neurological diseases,
and in metabolic and immune-system disorders.
But, my point is: OK, we have the fruit fly genome mapped,
to within a reasonable degree of accuracy. Now what? Wouldn’t
it be nice to understand the process by which this genome
is turned into an actual fly?
The Human Genome Project includes in its umbrella a focus
on data analysis. This refers mainly to designing and implementing
computer programs that study the huge sequences of amino
acids that biologists have recorded, and look for patterns
in these sequences. This is fascinating work, but it is
a long way from a principled understanding of how DNA is
turned into organisms.
For example, Luis Rocha and his colleagues at Los Alamos
National Labs are working on identifying regions of the
genome that are similar to each other, based on statistical
tests. This kind of similarity mining gives biologists a
hint that two parts of the genome may work together at some
stage during the process of forming an organism. Similar
statistical methods may be useful for recognizing where
genes begin and end in a collection of amino acid sequences
– a problem that’s surprisingly tricky, and
may require comparison of human sequences with sequences
from related species such as the mouse or the fruit fly.
The relation between 1-D sequences of amino acids and 3-D
structures formed from these sequences is hard for scientists
to understand even on the simplest level. The big problem
here is what’s known as “protein folding.”
Many structures in DNA encode instructions for the formation
of proteins. But no one knows how to predict, from the series
of molecules making up a protein, what that protein is going
to look like once it folds up in three-dimensional space.
This is important because many proteins that look very different
on the one-dimensional, molecular-sequence level may look
almost identical once they’ve folded up in 3 dimensions.
Thus, by focusing on sequence-level analysis, researchers
may be scrutinizing differences that make no difference.
Currently, only very few 3-D protein motifs can be recognized
at the sequence level.
Basically, we barely understand the simplest stages of the
production of 3-dimensional structures out of DNA, let alone
the complex self-organizing processes by which DNA gives
rise to organisms. This is OK – mapping DNA is still
of some value even in this situation – but it must
be clearly understood. In practical terms, our lack of knowledge
of embryological process greatly restricts the use we can
make of observed correlations between genes and human characteristics
such as diseases. There are diseases whose genetic correlates
have been known for decades, without any serious progress
being made toward treatment. For DNA researchers to announce
that they’ve mapped the portion of the human genome
that is correlated to a certain disease, doesn’t mean
very much in medical terms.
Does all this mean that the Human Genome Project is bad
– wasted money, useless science? Of course not. However,
it does suggest that perhaps the government is allocating
its research money in an imbalanced way. By pushing so hard
and so fast for a map of the human genome, while not giving
a proportionate amount of research money to studies in embryology
and the general study of self-organizing pattern formation,
the US government is guaranteeing that we are going to arrive
at a map of the human genome that we cannot use in any effective
way.
And this brings us to some very deep and fascinating questions
in the philosophy of science. As the biological theorist
Henri Atlan pointed out in an essay written right around
the start of the Human Genome project, the mapping of the
human genome is a very reductionist pursuit. In fact it
is almost the definition of reductionism -- the construction
of a finite list of features characterizing human beings.
All of humanity, reduced to a list of amino acids in order
– imagine that! Wow!
On the other hand, the formation of organisms out of DNA
is a very non-reductionist process, which biologists from
the last century attributed to a “vital force”
underlying all living beings. Modern scientists have still
not come to grips with the scientific basis for this apparent
vital force, which builds life out of matter. There are
disciplines of science – cybernetics, systems theory,
complexity science – which attempt to solve this problem,
but these have not been funded nearly as generously as gene
mapping, and they have not been linked in any serious way
with the work on data analysis of genetic sequences. I believe
that the study of embryology has the potential to overthrow
many of our established ways of doing science, by shifting
the focus of attention to complex, self-organizing processes
and the emergence of structure. But this “complexity
revolution” is something that the scientific establishment
seems determined to put off as long as it possibly can.
In this sense, one can see the Human Genome Project as an
outgrowth of modern cultural trends extending beyond the
domain of science. It’s an expression of the quest
for understanding, and also of the illusion that reductionism
is the path to understanding. It’s an expression of
our inability as a culture to come to grips with the wholeness
of life and being, and focus on the seemingly magical processes
by which life is formed from the nonliving, and structure
emerges from its absence.
But, the wonderful thing about science is that it’s
self-correcting. Ultimately science is all about the data
and the conclusions that can be drawn from it. We’ll
go ahead collecting data on the human genome, but year by
year, the biological communit will place more and more focus
on how the genome interacts with its chemical environment
to self-organize into the organism. Some new biologists
coming into the field already have the feeling that gene
sequencing is old hat. New technologies like microarrays
allow us to study – only partially and haltingly right
now, but it’s a start – how genes interact and
interregulate in the actual living process of the cell.
I think of this as “the new genetics” –
genetics that reaches up and tries to be systems biology.
And it time it will succeed. Eventually, as research along
these lines matures, we really will understand not just
what amino acids make up a human being’s genetic material,
but how a human being is made.

It’s
hardly shocking that post-genomic biology is enabled by
advanced computer technology every step of the way. After
all, most branches of physical science have become thoroughly
computerized – little of modern chemistry and physics
could exist without computers. But it’s instructive
to see just how many roles computers have played in the
new genetics. Firstly, it’s only because of recent
computer engineering and robotics driven advances in experimental
apparatus design that we are able to gather significant
amounts of data about how genes build organisms. New “microarray”
technologies like DNA chips (built like silicon chips) and
spotted microarrays (built with robot arms) allow us to
collect information regarding the expression of genes at
different times during cell development. But this data is
too massive and too messy for the human mind to fully grasp.
Sophisticated AI software, used interactively by savvy biologists,
is needed to analyze the results.
It’s not hard to see what the trend is here. Biological
experiments, conducted using newfangled computer technology,
are spinning us an increasingly detailed story of the microbiological
world – but it’s a story that only increasingly
advanced AI programs will be able to understand in full.
Only by working with intelligent software will we be able
to comprehend the inner workings of our own physical selves.
Gene therapy, the frontier of modern medicine, relies on
the ability to figure out what combinations of genes distinguish
healthy cells from diseased cells. This problem is too hard
for humans to solve alone, and requires at very least advanced
statistical methods, at most full-on computer cognition.
The upshot? Rather than fearing AI’s as movies like
2001 have urged us to do, we may soon be thanking AI programs
for helping find the cure for cancer.
Artificial intelligence programs have never even come close
to equaling humans’ common sense about the everyday
world. There are two main reasons for this. First, most
AI programs have been written to excel only in one specialized
kind of intelligence – like playing chess, or diagnosing
diseases -- rather than to display general intelligence.
And second, even if one does seek to create an AI program
with general intelligence, it still is just a software program
without any intuition for the human world. We homo sapiens
sapiens have a special feeling for our physical and social
environment -- for simple things like the difference between
a cup between a bowl, or between happiness and contentment.
AI programs, even those that push towards general intelligence,
can’t help lacking this intuition.
But the world of molecular biology is not particular intuitive
to human beings. In fact it’s complex and forbidding.
It has much of the ambiguity of everyday life – there
is not as much agreement as one would think about the meanings
of various technical terms in genetics and molecular biology.
But this ambiguity is not resolved by a simple tacit everyday
understanding, only by a very advanced scientific intuition.
The number of different patterns of genetic structure and
activity boggles even the ablest human mind. In this domain,
an artificial intelligence has much more to offer than in
the world of everyday human life. Here in the microworld,
human intuition is misleading as often as it is valuable.
Artificial intuition can be tuned specifically to match
the ins and outs of introns and exons, the turns and twists
of DNA.

The
new genetics has many aspects, but perhaps the most exciting
of them all is the emerging study of gene and protein expression.
The terminology here is both evocative and appropriate:
Just as with a person, it’s not what a gene does when
it’s just sitting there that’s interesting,
it’s what a gene does when put in a situation where
it can express itself!
At any given moment, most genes are quiet, doing nothing.
But some are expressed, some are active. Now, using the
new experimental tools, we can tell which. We can see how
many genes are expressed at a given moment … and then
a little later … and then a little later. In this
way we can make a kind of map of genetic dynamics as it
evolves. And by analyzing this map, using advanced computer
software, a lot of information about how genes go about
their business can be understood. Which genes tend to stimulate
which other genes. Which ones tend to act in groups. Which
ones inhibit which other ones, preventing them from being
expressed. And by applying the same analysis tools to proteins
instead of tools, one can answer the same questions about
proteins, the molecules that genes create and send around
to do the actual business of building cells. These kinds
of complex interactions between genes, and between genes
and proteins, are the key to the decoding of genomes into
organisms – which is, after all, what genomes are
all about.
All this complexity is implicit in the genetic code itself,
but we don’t know how to interpret the code. With
microarrays, we can watch the genetic code interpret itself
and create a cell, and by analyzing the data collected in
this process, we can try to figure out exactly how this
process of interpretation unfolds. And the potential rewards
are great– the practical applications are tremendous,
from drug development to disease diagnosis, genetic engineering
and beyond.
It’s a straightforward enough idea, but the practical
pitfalls are many. A huge host of tools from mathematics
and computer science have been unleashed on the problem,
both by researchers at major academic firms, and by companies
like Rosetta Inpharmatics (recently acquired by Merck, the
major pharmaceutical firm) and Silicon Genetics, a gutsy
and clever California start-up. New data analysis techniques
come out every couple months, each one with its own strengths
and weaknesses.

It
would be hard to overestimate the revolutionary nature of
the new experimental tools – microarrays -- underlying
the gene expression revolution. And the same tools, with
minor variations, are also being made work for proteomic
analysis, the study of protein expression. For the first
time, with these new devices, biologists are able to study
thousands or even millions of different molecules at once,
and collect the results in a systematic way.
Chemists have long had methods for carrying out many simultaneous
chemical reactions. Most simply, trays can be created with
96 or 384 wells, each containing a different chemical and
a unique bar code. The last few years, however, have seen
the development of methodologies that push far further in
this direction –making possible experiments that scientists
only a few years ago would have called impossible. The application
of these new methodologies to the analysis of gene and protein
data has led to a new area of research that may be called
massively parallel genomics and proteomics.
Most of the work done so far has been in genomics; the extension
to proteomic analysis is more recent. So I’ll talk
about microarrays as used for genomic analysis; the proteomics
case is basically the same from the point of view of data
analysis, though vastly more difficult from the point of
view of experimental apparatus biomechanics. (Many proteins
are much more difficult than DNA to induce to stick on the
surfaces used in these instruments.)
There are several types of microarrays used in genomics,
but they all embody a common methodology. Single stranded
DNA/RNA molecules are anchored by one end to some kind of
surface (a chip or a plate depending on the type of apparatus).
The surface is then placed in a solution, and the molecules
affixed to the chip will seek to hybridize with complementary
strands (“target molecules”) floating in the
solution. (Hybridization refers to the formation of base
pairs between complementary regions of two strands of DNA
that were not originally paired).
Affymetrix’s technology, pioneered by Dr. Stephan
Fodor, involves making DNA chips in a manner similar to
the manufacture of semiconductor chips. A process known
as “photolithography” is used to create a huge
number of molecules, directly on a silicon wafer. A single
chip measuring 1.28 cm X 1.28 cm can hold more than 400,000
“probe” molecules. The procedure of gene chip
manufacture has been fully automated for a while now, and
Affymetrix manufactures 5-10,000 DNA chips per month.
Affymetrix DNA chips have a significant limitation in terms
of the size of the molecules that can be affixed to them.
So far they’re normally used with DNA/RNA segments
of length 25 or less. Also, they are very expensive. It
currently costs about $500,000 to fabricate the light masks
for a new array design, so their technology is most appropriate
when the same chip needs to be used again and again and
again. The main example of this kind of use case is disease
diagnosis.
On the other hand, spotted microarrays, first developed
by Pat Brown at Stanford, are ordinary microscope slides
on which robot arms lay down rows of tiny drops from racks
of previously prepared DNA/RNA samples. At present this
technology can lay down tens of thousands of probe molecules,
at least an order of magnitude off from what Affymetrix
can do. The advantage of this approach is that any given
DNA/RNA probe can be hundreds of bases long, and can, in
principle, be made from any DNA/RNA sample.
Note the key role of computer technology in both of these
cases. Affymetrix uses a manufacturing technique derived
from the computer hardware industry, which depends on thorough
and precise computer control. Spotted microarrays depend
as well on the inhuman precision of robot arms, controlled
by computer software. Massively parallel genomics, like
the mapping of the human genome itself, is a thoroughgoing
fusion of biology and computer science – only here
the emphasis is on computer engineering and hardware, whereas
gene mapping relied upon fancy software algorithms.
There are other approaches as well. For instance, Agilent
Technologies, a spin-off from HP, is manufacturing array
makers using ink-jet printer technology. Their approach
is interesting in that it promises to make practical the
synthesis of a single instance of a given array design.
Lynx Corporation is pursing a somewhat Affymetrix-like approach,
but circumventing Affymetrix’s patents by using addressable
beads instead of a silicon wafer. And so forth. Over the
next few years we will see a lot of radical computer-enabled
approaches to massively parallel genomics, and time will
tell which are most effective.
So how are these massively parallel molecule arrays used?
Let’s suppose that, one way or another, we have a
surface with a number of DNA/RNA molecules attached to it.
How do we do chemical reactions and measure their results?
First, the target molecules are fluorescently labeled, so
that the spots on the chip/array where hybridization occurs
can be identified. The strength of the fluorescence emanating
from a given region of the surface is a rough indicator
of the amount of target substance that bound to the molecule
affixed to that region. In practical terms, what happens
is that an image file is created, a photograph of some sort
of the pattern of fluorescence emanating from the microarray
itself. Typically the image file is then “gridded”,
i.e. mapped into a pixel array with a pixel corresponding
to each probe molecule. Then, there is a bit of black art
involved in computing the hybridization level for a spot,
involving various normalization functions that seem to have
more basis in trial-and-error than in fundamentals.
This data is very noisy, however. To get more reliable results,
researchers generally work with a slightly more complex
procedure. First, they prepare two related samples, each
of which is colored with a different fluorescent substances
(usually, one green, one red). They then compare the relative
amounts of expressed DNA/RNA in the two samples. The ratio
of green/red at a given location is a very meaningful number.
Using this ratio is a way of normalizing out various kinds
of experiment-specific “noise”, assuming that
these noise factors will be roughly constant across the
two samples.
But even this ratio data is still highly noise-ridden, for
a number of reasons beyond the usual risk of experimental
error or manufacturing defects in the experimental apparatus.
For one thing, there are many different factors influencing
the strength of the bond formed between two single stranded
DNA/RNA molecules, such as the length of the bonded molecules,
the actual composition of the molecules, and so forth. Errors
will occur due to the ability of DNA to bind to sequences
that are roughly complementary but not an exact match. This
can be controlled to some extent by the application of heat,
which breaks bonds between molecules – getting the
temperature just right will break false positive bonds and
not true positive ones. Other laboratory conditions besides
temperature can have similar effects. Another problem is
that the “probe molecules” affixed to the surface
may fold up and self-hybridize, thus rendering them relatively
inaccessible to hybridization with the target.
All these issues mean that a single data point in a large
microarray data set cannot be taken all that seriously.
The data as a whole is extremely valuable and informative,
but there are a lot of things that can go wrong and lead
to spurious information. This means that data analysis methods,
to be successfully applied to microarray data, have got
to be extremely robust with respect to noise. None of the
data analysis methods in the standard statistical and mathematical
toolkits pass muster, except in very limited ways. Much
more sophisticated technology is needed – yes, even
artificially intelligent technology, software that can build
its own digital intuition as regards the strange ways of
the biomolecular world.
The payoff for understanding this data, if you can do it,
is huge. These data can be used for sequencing variants
of a known genome, or for identifying a specific strain
of a virus (e.g. the Affymetrix HIV-1 array, which detects
a strain of the virus underlying AIDS). They can be used
to measure the differences in gene expression between normal
cells and tumor cells, which helps determine which genes
may cause/cure cancer, or identify which treatment a specific
tumor should respond to best. They can measure differences
in gene expression between different tissue types, to determine
what makes one cell type different than another. And, most
excitingly from a scientific viewpoint, they can be used
to identify genes involved in cell development, and to puzzle
out the dynamic relationships between these genes during
the development process.
We’ve
seen that the actual experimental apparatuses being used
in postgenomic biology all come in one way or another out
of the computer industry. And that the analysis of large,
noisy, complex data sets like the ones microarrays produce
can only be carried out by sophisticated computer programs
running on advanced machines – no human being has
the mind to extract subtle patterns from such huge, messy
tables of numbers. There is also another crucial dependency
on computer technology here: the role of the internet. The
biology community has come to use the Net very heavily for
data communication – without it, there is no way research
could proceed at anywhere near its current furtive pace.
Perhaps you’re a bit of a computer hacker and you
want to try out your own algorithms on the data derived
from microarray experiments on the yeast genome during cell
development. Well, you’re in luck: the raw data from
these experiments are available online at http://cmgm.stanford.edu/pbrown/sporulation/additional/spospread.txt.
Download it and give it a try! Or check out Rosetta’s
site, www.rii.com, and download some sample human genome
expression data. Or, perhaps your interests are less erudite,
and you’d simply like to view the whole human genome
itself? No problem, check out the Genome Browser at http://genome.ucsc.edu/goldenPath/hgTracks.html.
But gene sequence information, and the quantitative data
from gene expression experiments, is only the beginning.
There’s also a huge amount of non-numerical data available
online, indispensable to researchers in the field. When
biologists interpret microarray data, they use a great deal
of background knowledge about gene function – more
and more knowledge is coming out every day, and a huge amount
of it is online for public consumption, if you know where
to look. Current automated data analysis tools tend to go
purely by the numbers, but the next generation of tools
is sure to boast the ability to integrate numerical and
non-numerical information about genes and gene expression.
As preparation for this, biologists in some areas are already
working to express their nonquantitative knowledge in unambiguous,
easily computer-comprehensible ways.
This exposes the dramatic effect the Net is having on scientific
language. Yes, the net is rushing the establishment of English
as the world’s second language, but something more
profound than that is happening simultaneously. The Net
demands universal intercomprehensibility. In biological
science, this feature of Internet communications is having
an unforeseen effect: it’s forcing scientists working
in slightly different areas to confront the ideosyncracies
of their communication styles.
Compared to ordinary language, scientific language is fairly
unambiguous. But it’s far from completely so. An outsider
would understandably assume that a phrase like “cell
development” has a completely precise and inarguable
meaning – but biology is not mathematics, and when
you get right down to it, some tribes of researchers use
the term to overlap with “cell maintenance”
more than others do. Where is the borderline between development
and maintenance of a cell? This issue and hundreds of others
like it have come up with fresh vigor now that the Internet
is shoving every branch of biological research into the
faces of researchers in every other branch. As a result
of the Internetization of biological data, a strong effort
is underway to standardize the expression of non-numerical
genetic data.
One part of this is the Gene Ontology Project, described
in detail at http://www.geneontology.org/ . In the creation
of this project, one thorny issue after another came up
– a seemingly endless series of linguistic ambiguities
regarding what would at first appear to be very rigid and
solid scientific concepts. What is the relation of “development”
versus “maintenance”, what does “differentiation”
really mean, what is the relation of “cell organization”
and “biogenesis”, etc. The outcome of this quibbling
over language? A much more precise vocabulary, a universal
dictionary of molecular biology. Ambiguity can’t be
removed from the language used to describe cells and molecules,
but it can be drastically reduced through this sort of systematic
effort. And the result is that genes from different species
can be compared using a common unambiguous vocabulary. The
fly, yeast, worm, mouse and mustard genomes have all been
described to a significant extent in standardized Gene Ontology
language, and the human genome can’t be far behind.
Soon enough, every gene of every common organism will be
described in a “Gene Summary Paragraph”, describing
qualitative knowledge about what the gene does in carefully
controlled language -- language ideally suited for digestion
by AI programs.
The standardization of vocabulary for describing qualitative
aspects of genes and proteins is a critical part of the
computerization of biological analysis. Now AI programs
don’t have to have a sensitive understanding of human
language to integrate qualitative information about gene
function into their analyses of gene sequences and quantitative
gene expression data. It’s only a matter of years
– perhaps even months, in some cutting-edge research
lab -- before the loop is closed between AI analysis of
genomic data and the automated execution of biological experiments.
Now, humans do experiments, use sophisticated algorithms
to analyze the results, and then do new experiments based
on the results the algorithms suggest. But before too long,
the human will become redundant in many cases. Most of the
experiments are predominantly computer-controlled already.
The software will analyze the results of one experiment,
then order another experiment up. After a few weeks of trial
and error, it will present us humans with results about
our own genetic makeup. Or, post the results directly to
the Web, where other AI’s can read them, perhaps faster
than humans can.

All
this abstract, complicated technology conspires to provide
practical solutions to some very real problems. Genetic
engineering is one of the big potential uses. Understanding
how genes work to build organisms, we’ll be able to
build new kinds of organisms. Frankenfoods, and eventually
new kinds of dogs, cats and people, thus raising all kinds
of serious ethical concerns.
But there are also applications that are ethically just
about unquestionable. From an economic point of view, the
main value of microarrays and related technologies right
now is as part of that vast scientific-financial machine
called the drug discovery process. The path from scientific
research to the governmental approval of a new drug is a
long, long, long one, but when it’s successfully traversed,
the financial rewards can be immense.
Gene therapy is a new approach to curing diseases, and one
that hasn’t yet proved its practical worth in a significant
way. Although it hasn’t lived up to the over-impatient
promises that were made for it 10 years ago, biologists
remain widely optimistic about its long-term potential –
not only for curing "classic" hereditary diseases,
but also widespread diseases such as cancer and cardio-vascular
diseases. The concept is scientifically unimpeachable. Many
diseases are caused by problems in an individual’s
DNA. Transplanting better pieces of DNA into the cells of
a living person should be able to solve a lot of problems.
A great deal of research has been done regarding various
methods to implant genes with the desired characteristics
into body cells. Usually the injected gene is introduced
within the cell wall, but resides outside the nucleus, perhaps
enmeshed in the endoplasmic reticulum. Fascinatingly, the
result of this is that the gene is still expressed when
the appropriate input protein signal is received through
the receptors in the cell wall even though the gene is not
physically there in the nucleus with the rest of the DNA.
Aside from the practical issues of how to get the DNA in
there in various circumstances, though, there’s also
the major issue of figuring out what DNA is responsible
for various diseases, and what to replace it with. To understand
this, in the case of complex diseases, requires understanding
how DNA is decoded to cause cells of various types to form.
And this is an understanding that has been very, very hard
to come by. The presence of gene and protein expression
data from microarray experiments, and sophisticated bioinformatics
software, renders it potentially achievable, though still
by no means trivial. More precise microarrays and more intelligent
data analysis software may render the problem downright
straightforward in 5 or 10 or 20 years from now. No one
knows for sure.
One thing biologists do, in trying to discover gene therapies,
is to compare the genetic material of healthy and disease-affected
individuals. A key concept here is the “genetic marker”
– a gene or short sequence of DNA that acts as a tag
for another, closely linked, gene. Such markers are used
in mapping the order of genes along chromosomes and in following
the inheritance of particular genes: genes closely linked
to the marker will generally be inherited with it. Markers
have to be readily identifiable in the organism the DNA
builds, not just in the DNA – some classic marker
genes are ones that control phenomena like eye color.
Biologists try to find the marker genes most closely linked
to the disease, the ones that occur in the affected individuals
but not in the healthy ones. They narrow the markers’
locations down step by step. First they find the troublesome
chromosome, then they narrow the search and try to find
the particular troublesome gene within that chromosome….
It used to be that genetic markers were very hard to find,
but now that the human genome is mapped and there are technologies
like microarrays, things have become a good bit simpler.
Some markers are now standard -- and Affymetrix sells something
called the HuSNP Mapping Array, a DNA microarray with probes
for many common markers across the human genome already
etched on its surface, ready for immediate use. If you have
samples of diseased tissue, you can use this microarray
to find whether any of a large number of common markers
tend to coincide with it. In the past this would have required
thousands or millions of experiments, and in many cases
it would have been impossible. Now it’s easy, because
we can test in parallel whether any of a huge number of
gene sequences is a marker for a given disease-related gene.
Right now, scientists are using this approach to try to
get to the bottom of various types of cancer, and many other
diseases as well.
If a disease is caused by one gene in particular, then the
problem is relatively simple. One has to analyze a number
of tissue samples from affected and healthy people, and
eventually one’s computer algorithms will find the
one gene that distinguishes the two populations. But not
all diseases are tied to one gene in particular –
and this is where things get interesting. Many diseases
are spread across a number of different genes, and have
to do with the way the genes interact with each other. A
disease may be caused by a set of genes, or, worse yet,
by a pattern of gene interaction, which can come out of
a variety of different sets of genes. Here microarrays come
in once again, big time. If a disease is caused by a certain
pattern of interaction, microarray analysis of cell development
can allow scientists to find that pattern of interaction.
Then they can trace back and find all the different combinations
of genes that give rise to that pattern of interaction.
This is a full-on AI application, and it pushes the boundaries
of what’s possible with the current, very noisy microarray
data. But there’s no doubt that it’s the future.
Gene therapy itself is still in its infancy, and so is microarray
technology, and so is AI-driven bioinformatics. But all
these areas are growing fast – fast like the Internet
grew during the 1990’s. Very fast. Exactly where it’s
all going to lead, who knows. But it’s a pretty sure
bet that the intersection of medicine, genetics, proteomics,
computer engineering, AI software, and robotics is going
to yield some fascinating things. We’re beginning
to see the rough outlines of early 21’th century science.

One
of the more interesting figures in the history of microarray
technology is Stephen Fodor, the founder of Affymetrix.
Fodor is not interesting in the same way that, say, Hugo
de Garis is – he’s not a colorful character,
wild-eyed with far-off visions. Quite the opposite: he illustrates
the ease with which, in the modern biotech industry, radical
scientific innovation and the conservative ways of the corporate
world have come to seamlessly interact.
The biotech revolution, like the computer revolution with
which it has become thoroughly entangled, involves an unprecedentedly
tight interaction between the worlds of science, engineering
and business. And this interaction is not just about partnerships
between organizations, it’s about individual human
beings stretching their minds and personalities to encompass
diverse, often divergent perspectives.
There’s the creative and exploratory world-view of
the scientist, in which solid experimental results or elegant
theories are the proof of success. There’s the pragmatic
and functional perspective of the engineer, in which a high-quality
working system is worth more than anything. And then there’s
the sometimes cut-throat vantage of the businessman, in
which the bottom line is always a financial one, and Bill
Gates is vastly more valuable than Einstein. Typically,
historically, these different orientations toward life have
resided in different people’s brains. But more and
more people each year, in order to achieve their goals,
are being forced to internalize all these perspectives,
and weave them together into an integrative approach.
In the domain of biotechnology, Fodor wonderfully exemplifies
this emerging synthesis. He is a scientist whose groundbreaking
scientific/engineering achievements led him into the business
world, where he’s now managing the development and
marketing of technology based on his initial breakthroughs.
His firm, Affymetrix, was one of the most promising of the
biotech start-ups of the late 90’s, and shows no sign
of slowing down.
Fodor’s career began like that of a typical overachieving
young bioscientist. He received his B.S. chemistry in 1978
and his M.S. in biochemistry/biophysics in 1982, both from
Washington State University – a solid school, though
not a world leading institution. He moved on to Princeton
University for his PhD, which he received in 1985. Following
a post-doctoral fellowship at Berkeley, he wound up at Affymax
Research Institute, where his group led the development
of new technologies, oriented towards creating very dense
arrays of biomolecules by combining photolithographic methods
with traditional chemical techniques. The advantage of packing
biomolecules together in very dense arrays is that one can
then study a large number of molecules all at once, in a
single experiment, as opposed to traditional experimental
biology approaches in which one studies one or a handful
of at a time. This work was an interesting example of interdisciplinary
crossfertilization of ideas, Fodor’s chemistry and
biophysics background spurring him to think about the problem
differently than traditional biologists would.
As the work became more and more promising, the potential
commercial possibilities became more and more clear. If
one could affix a large number of different segments of
DNA or protein to some surface, in a tightly packed array,
then one could effectively experiment on all of them at
once, gathering millions of times as much data as was possible
using traditional approaches, where one worked with many
fewer pieces of DNA or protein at a time. There were still
a lot of technical issues to be worked out, but, the viability
of the idea was clear. With this in mind, in 1993, Fodor
and a group of other Affymax scientists decided to form
the firm Affymetrix, dedicated to the creation and dissemination
of radical new technology for genomic and proteomic data
gathering, based on the research of Fodor and his colleagues.
When Affymetrix was founded, Fodor was Scientific Director,
but over time, he found himself becoming more and more involved
with the business side of the company, and in 1997 became
President and Chief Executive Officer. And the company would
seem to have benefited significantly from having a leader
with a passion for all aspects of its operations, scientific,
engineering, marketing and financial. It’s still the
technology, and its potential to transform bioscience as
a whole, that gets Fodor most excited. But, a consummate
realist, he has realized that focusing on the technology
alone is not the optimal way of going about the process
of transforming bioscience. Getting the technology out there
in use in as many places as possible is just as critical
as making the technology effective.
The original motivation for the gene chip work was to create
a device that would hold thousands of molecules in place
so they could be tested simultaneously to determine which
ones were viable drug candidates. Fodor saw how, as he put
it, the DNA or protein molecules stuck on the chip could
act "as thin strips of molecular Velcro." By seeing
which molecules stick to which other ones, one can discover
all sorts of things about genes -- detecting mutations,
revealing information about diseases or treatments, figuring
out which genes interact with which other ones during cell
development, etc.
All this began as a chancy, complex experimental procedure
and is now fully automated; Affymetrix manufactures 5-10,000
DNA chips per month.
And in his spare time, among other things, Fodor is thinking
about the ethical aspects of genomics, a discipline that
is well-known as an ethical minefield, with new issues like
stem cell research and human cloning popping up every day.
In the ethics of genetic research, commercial, scientific
and engineering perspectives intersect with humanistic and
even spiritual issues, and Stephen Fodor and others with
his diverse background are uniquely positioned to deal with
such issues in an integrative way.
At a 1999 Princeton University symposium on bioethics, he
observed that “Having a commercial background brings
a different bent to the ethics around the subject….”
To illustrate this, he offered an amusing anecdote. “I
was talking to a friend of mine,” he said, “whose
father used to run a dry cleaning company near New York
City. Every day when the clothes came in, he would go through
the pockets of the clothes, and see what he found in there.
One day he found a hundred dollar bill. He said this raised
a serious ethical question -- whether he was going to tell
his partner. [i.e., whether he was going to share the $100
with his partner or not] So, ethics is in the eye of the
beholder….”
This little story has the empirical directness of the scientist
about it. As a pragmatic businessman, Fodor has long since
realized that humanistic sentiments don’t make the
business world go around. As a rule, the only principle
that can be relied upon to mean anything to a corporation
is the maximization of shareholder value. As a scientist,
he sees this situation quite plainly as an empirical fact.
Why, then, as an ethically concerned individual, is he relatively
unworried by the consequence of this cut-throat attitude
for the development of biotechnology? It’s simple.
He believes that the power of the technology to do good
is far greater than its power to do damage.
One ethical worry associated with genetic analysis is that
ambitious parents will use it to overengineer their progeny
– killing a fetus if, for example, its genes indicate
that it won’t be sufficiently musically or athletically
talented. Some people find this unproblematic, others find
it repellent. As for Fodor, when asked if there should be
regulations on using DNA chips for prenatal screening, he
hems and haws a bit, observing that “Prenatal screening
is a bigger question than just these chips….”
His main concern in this connection is “… personal
privacy. I’m not an advocate whatsoever of the possibility
of health care organizations doing screening and databasing
and letting you know what the options are. I think the best
case is that people get the information themselves and decide
what to do with it, that the information is in their control.
The levels of privacy I think have to be worked out.”
While information privacy is an important concern, it’s
perhaps idealistic to think that genetic information, among
all medical data, is going to be kept from the vast health
care establishment. All in all, where this sort of issue
is concerned, one gets the impression that Fodor is slightly
bored and not 100% engaged. Rather than worrying about what-ifs,
his focus is on building the technology and doing the best
things he can with it, and what society as a whole makes
of it, is indeed beyond his control. The key point, in his
view, is the vastness and diversity of “wonderful
commercial and scientific possibilities. We’re in
the early days of this…. There’s a tremendous
number of medical and scientific applications…. What
are the good things you can do? What are the values you
can create for people going forwards?” This is what
gets Stephen Fodor excited, not worrying about negative
possibilities.
With a salary and bonus package pushing a half million dollars
a year, and many tens of millions in stock options (much
of it fully vested), Fodor is certainly profiting personally
from his turn towards the commercial world. And he is clearly
experiencing many money and business oriented distractions,
such as a recent lawsuit against Incyte (a particularly
perplexing lawsuit given Affymetrix’s ongoing partnership
with this firm) But in practical terms, in spite of the
inevitable over-busyness of his multifaceted role, he is
doing his best to work toward realizing beneficial applications
of his technology as well as toward personal and corporate
profit.

So
far the facts would seem to support Fodor’s views
about the potential for good in his technology. It is tremendous.
The medical applications of DNA chips may well be revolutionary.
As Fodor says, "Affymetrix was founded on the belief
that understanding the correlation between genetic variability
and its role in health and disease would be the next step
in the genomics revolution." And the results to back
up this vision have started coming in.
For instance, 2 years ago researchers at the Whitehead Institute
used DNA chips to distinguish different forms of leukemia
based on patterns of gene activity found in cancerous blood
cells. This approach has led to real practical benefits,
for example in some cases reversing the incorrect diagnoses
made by other, cruder methods. And this is only the barest
beginning. As Dr. Lander of the Whitehead Insitute says,
"the research program aims to lay a foundation for
the `post-genome' world, when scientists know the complete
sequence of DNA building blocks that make up the human genome."
Mapping not only what is in the genome, but what the things
in the genome do, is the real secret to comprehending and
ultimately curing cancer and other diseases.
One of the more interesting developments in the medical
application of DNA chips is the creation of the Affymetrix
spin-off company, Perlegen Sciences Inc. Perlegen’s
goal is to use DNA chips to help understand the dynamics
underlying various diseases – startin out with the
rare disease “ataxia telangiectasia” (A-T),
with which the two sons of Perlegen co-founder Brad Margus
are afflicted.
Ataxia is a word for loss of muscular coordination; telangiectasia
refers to the small blood vessels that pop up on the skin
and eyes A-T victims. A-T typically affects youths; 40%
of A-T children develop cancer, and few live past their
20s. Margus was the boss of a $100 million-a-year shrimp
processing company when he discovered his sons were afflicted
with A-T -- and, in a remarkably systematic and dedicated
fashion, began to devote more and more of his life to researching
the biological foundations of the disease. He helped raise
millions of dollars for research on A-T and its genetic
basis, a quest that ultimately led him to Stephen Fodor.
Affymetrix array chips, it seemed to Margus and his bioscientist
collaborators, could be used to study the way different
individuals with A-T would react to different medications.
It could vastly accelerate the drug discovery process, by
allowing so many experiments to be run in parallel. Of course,
this is exactly the kind of humanistically valuable application
of DNA chip technology that makes Stephan Fodor happiest.
It didn’t take much effort to convince Fodor that
Affymetrix should help Margus in his quest, by helping to
form Perlegen.
With humanistic applications like this swirling all around
him, it’s not hard to see why Fodor is relatively
unruffled by the ethical dilemmas that some find in genetic
research. Are there potential dangers in this technology?
To be sure. But there is also tremendous potential to help
people. And so far there is no doubt that the positive far
outweighs the negative. DNA chips have helped find cures
for diseases, and they haven’t harmed anybody.
Of course, as Fodor says, this is just the beginning. We’ve
mapped the genome, and now, baby step by baby step, we’re
starting to understand the process by which strands of genetic
material interact with other molecules to form organisms
like us. As we move along this path of understanding, we’ll
be able to cure more and more diseases, and more dramatic
possibilities for genetic screening and genetic modification
will open up. One can only hope that the optimism and focus
on positive applications that Dr. Stephen Fodor embodies
will continue to carry the day.
“…these
new tools and high throughput techniques have unleashed
a flood of biological data - information that continues
to double in size every 12 months. …Looking forward,
we are confident that informatics will represent the next
quantum leap in drug discovery. We expect the market for
informatics to reach 4 Billion by 2004”
Michael
Clulow, UBS Warburg
“…There’s
a concern on the part of biotech and pharmaceutical companies
that they’re paying millions of dollars to generate
millions of data points but not getting the value out of
that data because they can’t analyze it with contemporary
tools”
David K. Stone, AGTC Funds
I always thought genetics was fascinating – what curious
young nerd wouldn’t? – but it wasn’t until
mid-2000 that I began to seriously consider genetics and
proteomics as a research area I might want to focus on.
Well before Webmind Inc. folded, I had grown seriously disenchanted
with the application areas toward which we’d chosen
to orient our products. Business success was proving elusive
in spite of the fact that our products outperformed the
competition’s, and – a largely separate issue
– it didn’t seem that our products were evolving
in directions that would make maximal use of our most original
technology, the Webmind AI Engine. Financial prediction
was fun, and the Webmind MP seemed to work outstandingly
well -- but the essence of our approach was the use of news
to predict market movements, and I wanted to get away from
natural-language-processing-centric applications. The other
products we were making – Webmind Classification System
and Webmind Search (a search engine that was never released
but was used internally within the company) – were
even more human-language-centric. But the more we worked
on our AI system, the more we on the R&D side realized
that starting out with human-language-based products was
putting the cart before the horse. We needed an application
domain that had a rich variety of nonlinguistic data, that
the system could reason about, building up a domain-specific
knowledge base that could then be used to experientially
ground linguistic knowledge – little by little, step
by step, much as a human baby grounds its early linguistic
knowledge in its observations about the nonlinguistic physical
world it’s embedded in.
The finance domain did have its strong points – words
about market movements could be correlated with actual observed
market movements, for example, providing an elementary form
of symbol grounding. But too often, in financial texts,
the language was imprecise and evocative rather than precisely
descriptive. More and more often, my thoughts began shifting
to biology. There was so much biological data being generated
– and it was so diverse. It was exciting to think
of all this data being fed into an integrative AI system,
a system capable of using it to draw new and interesting
conclusions.
Of course, it didn’t take long to realize that biological
data wasn’t really an ideal application domain either.
It’s a great testing ground for integrative cognition,
and even perception, but there are too few opportunities
for an AI system to act. Actions such as sending information
to human users are obviously present, and on a much slower
time scale, a bio-focused AI can control robot arms running
biological lab experiments. But all this is nothing similar
to the intense perception/action/cognition interactivity
that a baby gets from the physical world. So the idea of
biological data as an application domain definitely doesn’t
displace the EIL, Baby Novamente approach. But it is a worthy
complement, and I’ve spent a decent portion of my
time over the last year thinking about how to apply Novamente
to analyze genetics data, and designing products and running
prototype data analysis experiments in this direction.
Whether these products will ever get built, and this line
of research continued, is not certain at this point. At
the time of writing (early 2002), we’re seeking business
funding for Biomind LLC, a company focused on these Novamente-based
biology applications – but our quest for funding may
fail, or we may be pushed in a different direction for one
reason or another. But no matter how the cards fall, the
process of exploring the potential applications of Novamente
to genetics and biology generally will not have been fruitless.
In working out these potential Novamente applications, we
have seen a great deal of the future of AI-enhanced biology
– and it’s an exciting future indeed.
The amount of data modern biologists are collecting is truly
immense. About 100 microorganisms have been completely sequenced
with many more in the pipeline. The human genome and other
eukaryotic genomes such as yeast, Drosophila, and C. elegans
are now available online. New sequencing projects begin
almost daily. Microarrays and mass spectrometry produce
massive datasets, which lead to massive data analysis problems.
With each genomic sequence, there are more genes, more RNAs,
more proteins, more phenotypes, and more data in databases.
It is a blessing to have such data, but only if it can be
accessed, integrated, and used to develop new knowledge.
In May 2002, a couple months into my post-Webmind-Inc. phase,
I realized that this was a mission worthy of a real AI.
And what better way to start off an AI with a good attitude
toward humans, than to have it focus its early energies
on analyzing human cells, with a view toward curing human
diseases and helping humans to live longer?
There is no doubt that existing biological databases contain
the secrets to hundreds of undiscovered drugs. What I gradually
realized during 2001 was what was needed draw these secrets
out. Something simple yet elusive: data analysis software
that automatically deploys this massive data pool within
the experimental data analysis process. This new kind of
feedback between wet lab work and advanced data analysis,
once it’s achieved, will lead to a raft of new discovery,
making the pharmaceutical progress of the last 10 years
seem like the merest beginning. As a single, very important
example, current tools make it very difficult to find sets
of genes that can collectively function as drug targets
(sites where gene-therapy drugs can act to interfere with
a particular disease process) – whereas an integrative
data analysis framework will in time make this kind of discovery
routine.
Throughout Fall 2001 I talked extensively about these ideas
with Maggie Werner-Washburne, a deeply insightful yeast
geneticist in the University of New Mexico biology department.
The more we talked, the more I realized what an excellent
Novamente application this could be. For no single magic
bullet, no one bioinformatic trick, is going to provide
the deep, dynamic, goal-directed information integration
that modern biology requires. What is needed is a combination
of four ingredients:
·
database integration
· visualization tools
· natural language processing (NLP) tools that extract
information from research papers, adding information to
databases
· automated inference tools that synthesize information
from different databases
The
biggest conclusion I drew from our conversations was this:
Whomever can deliver these ingredients in a user-friendly
package will be the one leading bioscience into the new
millennium. This realization crystallized the vague bioinformatics
ideas my Novamente colleagues and I had been tossing around.
We began designing an ambitious bioinformatics software
system called Biomind, which – if and when it’s
completed -- will deploy the Novamente AI system toward
the goal of helping biologists understand their experimental
data in the context of the massive amount of general biological
“background information” that now exists.
From a pure AI point of view, Biomind, like any practical
application, is a bit of a digression from the straight
road toward real AI. However, unlike any application we’d
worked on before, all of us on the Novamente team find it
tremendously scientifically fascinating in its own right.
And Danny Hillis’s point about the value of practical
applications is not to be overlooked. We are learning a
lot right now by stress-testing Novamente cognition on bio
databases.
One difference between Biomind work and pure Novamente “real
AI” work is that the focus in the former case is not
entirely on artificial intelligence, but equally much on
intelligence augmentation – helping biological scientists
to use their expertise and intuition to follow pathways
to discovery. Biomind is not intended, in the short term,
to thing about biology better than the biologists do. It’s
intended to integrate broad information from databases better
than biologists do, so that rather than spending their time
sifting through huge databases and journal paper archives,
biologists can spend their time thinking about biology.
(Which they will continue to do better than Biomind, at
least for a decade or two or three!)
To get a little more nitty-gritty, what we’re planning
with Biomind falls into two categories: database-building,
and AI-powered data analysis.
The database-building part is the most straightforward.
We plan to use Novamente to integrate the knowledge contained
in public biological DB’s, creating a massive database
called the BiomindDB. And then, on the analytical side,
we are creating a set of data-mining processes called the
Biomind Toolkit – whose techniques, far from just
analyzing each dataset on its own in the manner of other
bioinformatics products, will analyze each dataset in the
context of all the information in the BiomindDB.
And, in the slightly longer term, we can build BiomindDB’s
not only for public biological data, but also for the proprietary
data of client firms. Right now, many of the more progressive
pharmaceutical and genomics/proteomics research companies
are involved in massive projects of internal database integration.
Large amounts of money are being poured into these projects,
but the end result is neither a body of new knowledge nor
a new and better approach to drug discovery. Rather, the
end result is a common user interface for diverse databases.
This is a valuable thing, but what is perhaps most valuable
about it is that it sets the stage for Biomind and similar
systems. Once a biotech firm has enabled their scientists
to talk to all their databases through a common interface,
they are ready to allow their own data to transform their
own discovery process. They are ready:
·
To create a new database consisting of knowledge formed
by combining pieces of information from their various databases
(a private BiomindDB)
· To analyze their experimental data (such as gene
expression and mass spec data) making full use of the biological
background knowledge contained in both the public BiomindDB
and their proprietary BiomindDB
What
is really fascinating here is that the BiomindDB, if and
when we complete it, will be a valuable biological database
that does not contain any novel “primary information.”
All the information in it will be derived indirectly either
from other databases or from research papers. However, it
will contain novel pieces of information that are synthesized
by combining pieces of information found elsewhere. Indeed
this is its reason for being.
Very basic examples of information found in the BiomindDB
would be:
·
Genes that are similar to each other (overall, or along
specific “axes” such as having similar sequences,
similar promoter elements, or similar involvements in pathways
and regulatory networks)
· Proteins that are similar to each other (overall,
or along specific “axes” as with genes)
The
ability to submit a gene or protein (or a set of genes or
proteins) as a query, along with an axis of similarity,
and receive back a list of similar genes/proteins, is a
simple but remarkably powerful functionality.
The BiomindDB will also contain more specific information,
of course. To give the details would require a distractingly
long biology lesson, but some examples for the bio-savvy
reader would be:
·
Transcription factors that are activated by transport from
the cytoplasm to the nucleus
· Enzymes with specific cofactors that are expressed
in response to starvation
· Genes that are induced in response to some but
not all starvations or stresses
· Enzymes active in different pathways and that are
activated through a specific signal transduction pathway
· Proteins with homologs in procaryotic systems that
interact and are required for survival under some conditions
· Proteins with homologs only in the fungi, that
are coexpressed and interact with some of the same proteins
· Sets of interacting proteins, whose genes are all
induced by the TOR pathway
All
this is knowledge that a biology PhD could derive by reading
the research literature and scanning relevant databases
and keeping careful notes in a spreadsheet. But this kind
of data surfing and integration can be very time-consuming
and tedious, with the result that few scientists do it as
thoroughly as they should. The task is often offloaded to
assistants or (in the academic context) graduate students
who lack the background knowledge to do a truly thorough
and insightful job. Having this sort of integrative knowledge
at one’s fingertips will be extremely valuable, and
just as indispensable to the discovery path as a new source
of primary physical data.
What will the Biomind Toolkit do with all this knowledge??
It will do a huge number of things, varying based on the
particular kind of experimental data being fed
into
it. The case we’ve worked with predominantly so far
is gene expression data, generated by gene chips and spotted
microarrays as discussed above.
Current toolkits for analyzing gene expression data, such
as the excellent GeneSpring and Spotfire products, focus
on the production of clusters or category models. Clusters
are produced by statistical analysis of the expression profiles
of genes, grouping together genes with similar expression
profiles. Category models are produced when one provides
the software with gene expression profiles of cells falling
into different categories (e.g. cancerous vs. non-cancerous).
A category model may tell you, for instance, that a certain
set of 25 genes generally are more active in the cancerous
cells than in the noncancerous cells.
The two big problems with these clustering and categorization
tools are:
·
The clusters and category models are produced often “wrong,”
i.e., not biologically meaningful
· Even when meaningful, the clusters and categories
found are often too large. For instance, having a set of
25 genes identified as important is not nearly as useful
as having a set of 3-5 genes identified as important. Because
it may take weeks of wet lab work to explore each potentially
important gene in detail.
Two
of the main initial functions that we’re building
into the Biomind Toolkit are aimed at overcoming these problems.
The first problem is overcome by invoking knowledge from
the BiomindDB to guide the clustering and category model
building process. For instance, if it’s known that
two genes are involved in many of the same metabolic or
regulatory pathways, this should bias the clustering process
to suspect that perhaps these genes should be clustered
together.
The second problem is overcome by what we call “post-clustering/categorization-analysis.”
Suppose clustering or category-model-building has produced
a set of genes of likely interest. One then wants to explore
this gene set in detail. The Toolkit will give you a report
summarizing the surprising and significant properties of
this set of genes. For instance, it may note that there
are many more fungal essential genes in the set than would
be expected. It may note that of the 25 genes in the set,
5 are probably in the same signal transduction pathway,
and 3 others are extremely similar to elements of this set
of 5, based on various criteria. It may identify another
gene, not in the set, which produces a protein that interacts
with the proteins generated by several genes in the set.
In this way the user’s attention is guided to 9 genes
rather than 25, one of which was not even produced by the
original clustering/categorization algorithms. The user
may then use their own knowledge to infer that, if the one
extra gene suggested by the Toolkit is interesting, perhaps
a couple other similar genes may be interesting too. So
it may search the BiomindDB to find genes similar to this
recommended genes. This kind of post-clustering/categorization
analysis can be done without the Toolkit and BiomindDB,
but it may literally take days or weeks of tedious work.
The
Biomind Toolkit, as it currently exists in prototype form,
has several different clustering methods built into it,
including k-means, EM, and other standard techniques. It
also has a less-well-known method that we have found to
be uncommonly effective on gene expression data, which is
force-directed clustering as embodied in the VxInsight product,
a piece of software created at Sandia Labs and commercialized
by Viswave Inc. This technique is particularly valuable
in that it provides a user-friendly topographical visualization
of gene expression data, in which each cluster of similar
genes appears as a “hill” and related clusters
are depicted as nearby hills. In our experience, this is
a particularly intuitive metaphor for biologists to use
to explore gene expression data, and we feel that this aspect
of the Toolkit is sure to be a hit with customers. A landscape
visualization of raw gene expression data, followed by a
landscape visualization of the data as interpreted in the
context of the BiomindDB, will be an extremely instructive
part of the discovery process.
The third initial function of the Toolkit, approximate regulatory
network inference, goes beyond what current products do.
It is applicable specifically to time course gene expression
data, e.g. data reporting the expression levels of genes
at 20-100 points during the cell cycle. Currently not that
many labs are creating time course data sets of sufficient
length to enable meaningful approximate regulatory network
inference, but the presence of a software tool capable of
carrying out this sort of inference will have an influence
on experimental methodol | | |