By Stephanie Sides
Tampa, FL, November 15, 2006 -- Yesterday Calit2 director and CAMERA PI Larry Smarr gave an invited presentation in the National LambdaRail exhibit at SC06, “Metagenomics at Light Speed,” describing how optical networking is beginning to support a new Calit2 research project in environmental metagenomics.
The OptIPuter project is linking high-resolution “OptIPortals” over dedicated optical networking channels, such as provided by the U.S. National LambdaRail and the Global Lambda Integrated Facility, to support access to and interaction with global science data. “Exciting results of this project,” said Smarr, the project’s PI, “are reflected in various exhibits here on the show floor.” Now in its fifth year, the OptIPuter links project leads Calit2 and the University of Illinois at Chicago with many other universities in the U.S., the Netherlands, Japan, and Canada, plus several industrial partners to advance technology in key applications projects in biomedical informatics and earth sciences.
Furthermore, the OptIPuter is now serving as a foundation for infrastructure to support a large-scale project in marine microbial metagenomics, to be debuted early in 2007. Citing the “Tree of Life” derived from 16S rRNA sequences developed by Smarr’s early mentor Carl Woese at the University of Illinois at Urbana-Champaign, Smarr pointed out that most of evolutionary time was spent in the microbial world, but relatively little is yet known genomically about that part of the tree.
To begin addressing that gap, the J. Craig Venter Institute, with a grant from the DOE Office of Science, conducted the Sargasso Sea Experiment, in which they collected more than one billion base pairs of non-redundant DNA sequences from at least 1,800 genomic species including at least 148 that were previously unknown. From this data, they were able to identify more than 1.2 previously unknown genes. This work was published in a seminal paper led by J. Craig Venter in Science (2 April 2004, Vol. 304, pp. 66-74).
Venter is following up this work with the Global Ocean Survey (GOS), which is in the process of collecting oceanic and other (sea and fresh) water samples at more than 150 sites around the world to continue measuring the diversity of ocean microbes. This project is expected to double the number of proteins in GenBank!
So how do you make all this new data available to the scientific community for broader benefit? That’s where the CAMERA project comes in. CAMERA stands for the Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis project, funded in January this year with a seven-year, $24.5M grant from the Gordon and Betty Moore Foundation. It’s a partnership among Calit2, the Venter Institute, and UCSD’s Scripps Institution of Oceanography, the Center for Earth Observations and Applications, the Scripps Genome Center, and the San Diego Supercomputer Center. This project builds on a host of long-standing projects implementing national cyberinfrastructure for medical and ocean sciences with funding from the National Science Foundation (NSF) and the National Institutes of Health.
This project will incorporate data from the Sargasso Sea and GOS expeditions, the Joint Genome Institute sequencing project, the Moore marine microbial project, and community microbial metagenomics projects (some early candidates have already been identified).
The new computational environment, being implemented this month, includes 128 Dell nodes (with a total of 512 CPUs) for a total performance of 4.7 Tflops and 4 GBytes of memory/node. Sixty-four nodes are connected with a low-latency interconnect (Infiniband at 10 Gbps), and all are connected at 1 Gbps. This system includes an 8-node web farm. The storage system is based on 200 TBytes in 8 Sun x 4500 “Thumper” nodes each connected into the fabric at 10 Gbps. This resolves to 80 TB of RAID-5 storage, then replicated. The network infrastructure consists of a Layer 3 switch/router Force10 e1200, with 192 GigE Ethernet ports, and 16 10GigE Ethernet ports. This system is highly expandable. This infrastructure complements CAMERA’s current development environment, which is roughly 1/10 scale of the production environment. There is also the expectation that particularly large computations in the future will be able to link to tens of thousands of CPUs through the NSF-funded TeraGrid backplane.
Although the Calit2 data complex will be available over the shared Internet, several leading marine microbial biologists are planning much more advanced access from their laboratories. They will be deploying “OptIPortals,” connecting those overdedicated fiber from their labs to the nearest hub of the National LambaRail over which they will access the CAMERA data complex at Calit2, said Smarr.
An OptIPortal is a tiled wall driven by a graphics cluster that can be built from various flavors of hardware and operating systems in various sizes (in terms of number of display panels wide by tall), creating an affordable “termination device” for the OptIPuter global backplane. A typical example (shown in the figures to the right and displayed at SC06) consists of 20 dual CPU nodes running at ¼ teraflop with 5 terabytes of storage and 20 24” displays with a total of 45 megapixels of real estate driven by the Scalable Adaptive Graphics Environment developed by Jason Leigh’s team at UIC’s Electronic Visualization Lab. “Consider this the next-generation PC,” said Smarr. Approximate cost is $50k.
This system can be used to interactively visualize, for example, microbial genome data derived from the various projects described above. It will also be used for collaboration between researchers using high-definition video integration into SAGE, which was demonstrated on a number of OptIPortals on the SC06 show floor.