By Stephanie Sides
Hilton Head Island, South Carolina, October 18, 2006 -- As befits the emerging discipline of metagenomics, this week’s meeting at Hilton Head Island, South Carolina, is characterized by a very broad range of topics and frequent surprise at the results of data analysis. The conference featured a presentation and poster on the CAMERA project, and it provided the opportunity for the CAMERA development team to discuss progress and make future plans in more depth.
The sessions, with typically three presentations per, have focused on such topics as marine microbial genomics, cancer genomics, infectious diseases, environmental genomics, and synthetic biology. Some 300 people are in attendance.
The element of surprise perhaps was expressed most dramatically by Mitchell Sogin, Woods Hole Oceanographic Institute, in discussing the International Census of Marine Microbes, the goal of which is to report “what is known, what is unknown but knowable, and what may be unknowable with respect to the diversity of marine microorganisms: “We’re underestimating diversity dramatically,” he said, “which is frightening.”
The session on emerging genomic technologies held late Monday afternoon, led by Calit2 director and CAMERA PI Larry Smarr, featured two members of Calit2: Mark Ellisman, professor of Neuroscience and Bioengineering at UCSD, and Paul Gilna, executive director of the CAMERA project.
Ellisman discussed leveraging convergent revolutions in biological science and information technology to pursue his vision of enabling better understanding of the brain by linking data about macroscopic brain function to its molecular and cellular underpinnings. A big part of the challenge, he said, is that a scientist analyzing a single brain at one-micron resolution has to cope with 4.5 petabytes. “Even so, that doesn’t get you near the cellular level,” he said. “We still need to get to 20-nanometer resolution.”
He discussed his team’s work building cyberinfrastructure to provide access, through a flexible web interface, to instruments such as a one-of-its-kind, ultra-high-voltage electron microscope in Osaka, Japan, and linking that with computational and storage capabilities for data collection and analysis.
Ellisman also described his Biomedical Informatics Research Network (BIRN) project, started seven years ago, which involves a long list of important universities across the U.S. , the U.K. , and Japan . This network allows researchers to contribute and share data sets, including those related to Alzheimer’s disease and schizophrenia, to support more wide-ranging studies.
“This project is about half computer scientists and half biological scientists with the occasional physical scientist,” said Ellisman. This project has also helped push development of specialized endpoints, leveraging technology developed in Smarr’s OptIPuter project, called “OptIPortals,” which enable displaying large amounts of data, for example, from multiple microscopes simultaneously and high-definition video streams to support distributed collaboration.
How does this system work? A database is created at each participating site, this data is linked conceptually to a shared ontology describing the relationship between sets of information, the data is situated in a common spatial framework, and users use a “mediator” to navigate and query across the data. “We’ve adapted technology from the geosciences,” said Ellisman. “Like that world, we use location and ontological information across scales.”
While Ellisman works at the cellular level, he invited conference attendees to think about how to contribute, for example, structural genomics data to extend the range of scales of data available through the system.
Paul Gilna then described the CAMERA project and the context for its importance. “Where are we on the growth curve?” he asked. “The Sargasso Sea experiment, which has already yielded one billion base pairs of non-redundant sequence, demonstrates the power of environmental metagenomics.” He added that whole-genome sequencing is exploding, and the Venter Institute is sequencing biological material being collected from some 150 sites being sampled from the world’s oceans in its Global Ocean Survey (GOS) project. “Genomic data is growing rapidly, but metagenomic data will vastly increase the amount of data available for analysis,” he said. Hence the need for the CAMERA project.
CAMERA development is driven by user needs, focusing on data sets, applications, tools, and workflows. Data sets that will be made available include the GOS and Sargasso Sea sets, related environmental (including terrestrial) data sets form Joint Genome Institute (JGI), the Moore Foundation-funded microbial genomes, and research community-submitted data sets (on the model of the BIRN project described by Ellisman).
Site metadata will include location of sampling site (lat/long, country, water depth), its physical and chemical characteristics, and the experimental parameters about how the data was taken. Over time, the plan is to include access to satellite imagery, oceanographic databases, high-definition, real-time video streams, and microscopy images such as described in Ellisman’s work.
Tools and workflows will include BLAST, clustering, HMM/Profile, neighborhood analysis, multiple sequence alignments, and assembly.
Development includes building out the server room at Calit2 where the CAMERA complex resides. The current 32-node server will be expanded to some 800 CPUs in the next couple of months. Interactive access to the CAMERA complex will be available via optical networking at 10 Gbps, “much faster than the ‘shared party line’ of today’s commercial Internet,” said Gilna. The core architecture will be a flat file server and database farm accessed by a web portal. The system is expected to grow to some 1,000 CPUs, with tens of thousands additional CPUs available through the NSF-funded TeraGrid project.
Like Ellisman, Gilna pointed to the value of OptIPortals, typically consisting of 20 dual-CPU nodes and 20x24” monitors providing a quarter of a teraflop in processing speed, 5 TBytes of storage, and 45 megapixels of display real estate based on the Scalable Adaptive Graphics Environment, work led by Jason Leigh at the Electronic Visualization Lab at the University of Illinois at Chicago. “Considerable effort has been made to shrink wrap the instruction set to make this environment available in your lab,” said Gilna. The cost is about $50K.
The benefit of this to conference attendees, said Gilna, is to provide a way, for example, to interactively view microbial genomes. Several members of the CAMERA Science Advisory Board, seeing the promise of such technology, are implementing these systems in their labs.
Outreach, said Gilna, is an important part of the project so that the CAMERA infrastructure stays current with the needs of the scientific community. To that end, the Scientific Advisory Board has already met, early adopters will be identified by December, workshops will target users and visualization tools that might be developed, talks and posters will be presented at scientific meetings, partnerships are being established with other metagenomic organizations such as JGI, and training and user support are being planned.
Projecting into the future, Gilna said, “We’ll be seeing real-time genomic streaming of data and feeding it back to guide the experimental process. You’ve seen the movie; soon you’ll be able to click here and see the genome.”