Predicting How Genes and Proteins Function

By Maureen Curran and Doug Ramsey

Adam Godzik
AFP 2006 co-chair Adam Godzik
San Diego, CA, September 8, 2006  -- The growth rate in high-throughput DNA sequencing has been exponential, but there are too few experimental biologists to analyze the flood of sequences to understand how they function. Indeed, scientists do not know the function – the biological significance -- of roughly 40 percent of known DNA sequences.

"One of the most challenging problems in computational biology is how to provide information about these millions and millions of sequences,” said Adam Godzik, professor and director of the Bioinformatics and Systems Biology program at the Burnham Institute for Medical Research. “Biologists are getting good at predicting the functions of genes, but there are not enough researchers to go through the 17-20 million proteins now discovered, so the only solution is to extract knowledge from experts and write computer programs that can do it automatically.”

This relatively new field of automated function prediction (AFP) was the focus of an international conference at Calit2 on the UC San Diego campus last week. AFP 2006 ran from Aug. 30 through Sept. 1. Eight keynote speeches and 19 contributed talks reflected the wide variety of new approaches now emerging for predicting gene and protein function. (See further below for direct links to streaming video of all the keynote presentations at AFP.)

The conference was organized and co-chaired by Godzik and Iddo Friedberg, a postdoctoral researcher in Godzik's lab. "No single method of computational function annotation will tell us all we need to know about the function of a biomolecule," said Friedberg. "We are therefore required to put those methods together in order to get comprehensive answers. The first step towards doing that is to get people to talk, and that is what this meeting was all about."

Wodak and Brenner
Keynote speakers Shoshana Wodak and Steven
Brenner chat during the AFP poster session.
Phil Bourne
Protein Data Bank director Phil Bourne (left).
Yoav Freund
UCSD computer science professor Yoav Freund
(right) with Nir ben-Tal

Researchers from UCSD and other institutions stressed the value of an integrative approach. "We need to find ways of working together, to bootstrap using the different and complementary approaches, to advance the power of our predictions," said UCSD associate vice chancellor for research John Wooley. "The large turnout for this meeting by computational biologists from around the world reflects a shared recognition that approaches that serve to integrate the disparate methods and provide a community-centric perspective are essential for us to uncover the functional information buried within the treasure trove of sequences."

One of the keynote speakers, Terry Gaasterland of the Scripps Institution of Oceanography, was impressed with the variety of approaches to computational function prediction aired at AFP 2006. "I really like the idea of bringing to bear all of the different approaches and applying them to the large bodies of putative protein sequences, and trying to figure out what they do," said Gaasterland, Director of the Scripps Genome Center. "I say 'putative', because we don't really know if these genes are real proteins."

Calit2 was a logical place to hold the AFP meeting.  Godzik is a member of the Steering Committee of CAMERA, Calit2's new joint venture in marine microbial metagenomics with J. Craig Venter Institute, funded by the Gordon and Betty Moore Foundation. CAMERA is developing cyberinfrastructure to handle masses of metagenomic data, and automating the process of function prediction will be an essential ingredient if the project is to succeed.

UCSD's Wooley, who was instrumental in pulling together the Calit2-led CAMERA project, believes that the AFP meeting underscored the need for sophisticated computational tools and cyberinfrastructure for understanding the information implicit in genes. "We have the DNA sequences, but until we get to knowing the function of the proteins encoded by the genes, the information means little to us and cannot be applied to benefit society," said Wooley. "Automated function prediction has been a key step in the application of conventional genomics, but it is absolutely essential for metagenomics."

AFP 2006 organizer and co-chair
Iddo Friedberg (center)
"Taxpayers have paid a lot of money to decode the human genome and the genomes of a lot of model organisms, and half of what we found, we know almost nothing about," observed Russ B. Altman, professor of genetics, bioengineering and medicine at Stanford University, who co-organized the three-day event and gave one of the keynotes. "This conference is focused on methods to predict the function of the 50 percent of genes that we know very little about. There is an urgent need to understand what those molecules are doing in the cell: Are they potential drug targets in the future? Are they doing something we can interfere with and affect the response to disease?"

According to Altman, AFP attendees discussed the need for testing the predictive powers of various techniques. One suggested solution: underwrite a Grand Challenge competition in which experimental biologists would have the answer about the function of a gene, and AFP researchers would be invited to use their respective techniques to publish and analyze their predictions. Noted Altman: "This could be one of the best ways to remove biases."

AFP for Structural Genomics and
Adam Godzik, Burnham Institute 
Length: 37:45  [video]
CorrIE: Probabilistic Proteins
Sequence Annotation Based on
Functional Classifications

Christos Ouzounis, Euro Bionfo Inst
Length: 36:59 [video]
Problems and Proposals in Protein
Molecular Function Prediction

Steven Brenner, UC Berkeley
Length: 36:54 [video]
Clustering Protein Microenvironments
for Automated Function Prediction
Russ Altman, Stanford
Length: 22:19 [video]
Revisiting Function Prediction at CASP
Anna Tramontano, University of Rome
Length: 36:11 [video]
Identifying Meaningful Functional
Modules in the Yeast Protein-Protein
Interaction Network
Shoshana Wodak, University of Toronto
Length: 55:01 [video]
Function Prediction in the RNA World
Terry Gaasterland, UCSD Scripps Institution
of Oceanograpy
Length: 42:01 [video]
Novel Ways to Think about Protein
and its Impact on Function
Phil Bourne, UC San Diego and SDSC
Length: 44:53 [video]
Researchers also discussed the possible creation of a central collection point - a repository - to pull together all function predictions and the methodologies used to arrive at them.

According to Burnham Institute's Godzik, predicting the function of newly-sequenced genes and proteins is hampered by the lack of a standard way to annotate DNA sequences. "The speed at which sequences are being collected is exponential, while the ability to annotate them is, at best, linear," said Godzik. "With new technologies such as structural genomics and metagenomics, and their very large data sets, many more people are realizing that it's a problem."

"The projects that pump out sequences do some annotation, but it is totally uneven, there is no consistency, and up to half of the annotations are just plain wrong," added Godzik. "NIH rules for annotation date back to when sequencing was a project in itself. The rule was that whoever deposits a sequence also annotates it. But, for instance, high-throughput sequencers are not also high-throughput annotators. So we are discussing ways to develop a uniform set of standards."

"What is interesting about this meeting is that all of the keynote speakers have known each other for more than ten years," added Gaasterland. "This is a field and community of researchers that has evolved over time, and it's interesting to see that this is a group willing to forge ahead with their research, even when the specific extent (and even the name) of their field remains somewhat fuzzy."

According to AFP 2006 co-chair Adam Godzik, there is substantial support for making the conference an annual event: "We hope to make a profound difference in this field, inspired by the impact that CASP (Critical Assessment of techniques for protein Structure Prediction) has had on validating and improving computational methods for predicting the structure of proteins."

Related Links
AFP 2006
Burnham Institute for Medical Research