By Stephanie Sides
San Diego, CA, October 4, 2006 -- “There’s a revolution going on in cyberinfrastructure,” said Calit2 director Larry Smarr, “to handle the vast increase in the quantity and types of data arising from metagenomics. We’ll use that to build a global metagenomics community.” Smarr, as the first speaker on today's agenda of Metagenomics 2006, addressed some 130 attendees, nearly double that of the attendance from the day before.
Smarr referred back to the beginnings of the shared Internet in 1985 when the National Science Foundation adopted the protocols of NSFnet’s predecessor, ARPANET. NSFnet was the backbone among the five supercomputer centers, then it was extended to the regional, then campus networks. It’s good for e-mail and web browsing, of course. “But we’re in a period in networking comparable to the days of computer mainframes and card decks in computing, which many of you may be too young to remember, when the PC came out.
"We had to compute with everyone sharing a mainframe that had to be kept running at up to 95% capacity. Because we were all competing for the same resource, we each got a small fraction of the total and couldn't tell when our respective jobs would complete. Then the personal computer emerged, which was yours alone. With only your job on the computer, turnaround became predictable and much faster. Today we all share the same Internet, so every time you download a file, it takes an unpredictable amount of time. Even though the optical-fiber backbone is 10,000 megabits per second, an individual user will typically see tens of megabits per second. What you would like is a personal lightpath that gives you the full 10 gigabits per second, so it is predictable and allows you to interact visually with very large remote scientific datasets."
What will help meet those needs are dedicated optical channels. Enter the National LambdaRail (NLR), which has a backbone composed initially of four 10-gigabit-per-second light paths or “lambdas” and can grow to more than 40. "Our work at Calit2 with the Venter Institute in Maryland," said Smarr, "uses one of these dedicated 10-gigabit-per-second lambdas, so we can investigate how this will change metagenomics research."
Smarr described a project he put together five years ago, called OptIPuter, involving a team of computer and applications scientists to create high-resolution portals over dedicated optical channels to enable global science data. “It’s time to rethink the Internet as 1,000 times more powerful than it is today,” said Smarr. “Think of it as a distributed virtual computer that couples your lab with other resources on the network, sort of like a global-sized personal computer.”
Then Smarr walked the audience through the process a disruptive technology, like optical networking, takes, starting with the initial innovation followed by development and spread of the technology until it becomes fully mature. “It typically takes decades for a new technology to prove its usefulness so that it becomes ubiquitous,” said Smarr. “The Internet after all is 35 years old. Calit2 works at the beginning of this type of ‘S’ curve to envision the future and work on particular areas important to California.”
Metagenomics is at the early part of such an S curve. According to Carl Woese, a mentor of Smarr’s at the University of Illinois at Urbana-Champaign, and his foundational “Tree of Life,” almost all the diversity of life lies in the microbial area. So there’s lots of work to be done now that the cyberinfrastructure is evolving to support its needs.
Calit2 and the J. Craig Venter Institute were jointly approached by the Gordon and Betty Moore Foundation in August 2005 to build the next-generation science server as a global home for metagenomics data and analysis. The result was the Moore grant to the Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (called CAMERA for short). The project subsequently hired Paul Gilna as its executive director; Gilna is the former director of the LANL branch of the Joint Genome Institute and one of the co-founders of GenBank.
“Our initial focus,” said Smarr, who’s PI on the CAMERA project, “is the marine environment, but in actuality we’re thinking more broadly than that.” He referred to the Global Ocean Survey the Venter Institute is conducting to measure the genetic diversity of ocean microbes across 155 sampling sites 200 miles apart. “Sorcerer II [the ship being used] will double the number of proteins in GenBank,” said Smarr, “and the number of protein families continues to increase with each sample taken.”
The computational system being set up in CAMERA will not be based on a flat file system but instead on the notion of “web services” that separate the functions a scientist wants to perform from the computers they run on. The center of the data complex, in keeping with the idea of the network as the centerpiece, will be optical fabric (parallel 10-Gigabit Ethernet pipes), several hundred terabytes of rotating storage, and several hundred CPUs clustered together. This system should be in place in 6-8 weeks. (Additional computational power will be available from the NSF-funded TeraGrid, which will grow to 100,000 processors.)
Data sets that will be made available include data from the Sargasso Sea, the Global Ocean Survey, the JGI Community Sequencing Project, the Moore Marine Microbial Project, NASA Goddard satellites, and the Community Microbial Metagenomics Data project.
In addition to genomic studies, this infrastructure will also support structural genomics (proteomics) studies.
One of the keys to data analysis is interactive visualization. In this regard, Calit2 is implementing 3-D stereo and non-stereo, interactive, high-resolution tiled wall displays. “Think of this technology as a 'hot gaming PC,'” said Smarr. “It’s got good graphics, costs about $2,000 per screen, and runs on SDSC Rocks cluster software. The entire system comes with the latest version of Redhat linux on a CD for easy installation – and it’s free.”
Such “OptIPortals” are springing up in various places. Ginger Armbrust, University of Washington, and Ed DeLong, MIT – both members of the CAMERA Science Advisory Board – are implementing them. The devices enable high-resolution analysis with telepresence to support distributed, real-time research collaboration.