Advancing Computational Research with Scientific Workflows
San Diego, July 10, 2014 — It was a lark that brought Ilkay Altintas to San Diego. The year was 2001. She had just finished her M.S. thesis and was working at the Middle East Technical University in Ankara, Turkey, when she discovered an open position at the San Diego Supercomputer Center (SDSC). That job, related to scientific data management, launched her career where, in a relatively short period of time, she has carved out a particularly useful specialty and has had an impressive impact on helping computational scientists in a wide variety of disciplines.
“Pi” Person of the Year
Cut to 2013 when Altintas, now with a Ph.D. and serving as Director of a Center of Excellence, received the first “Pi Person of the Year” award at SDSC. The pi stands, not for principal investigator, but rather the mathematical constant π. In this case, it underscores that Altintas’ work spans both scientific applications (in fact many of them) and computer science (cyberinfrastructure). She literally has one “pi” leg in each camp.
Furthermore, in an article in Procedia Computer Science titled “Exploring the e-Science Knowledge Base through Co-citation Analysis”, Altintas is cited as one of the top-10 “turning-point” authors. The paper’s authors used the knowledge domain visualization software CiteSpace to analyze the e-Science knowledge base pertaining to grid, desktop grid, and cloud computing, to identify landmark articles and authors irrespective of the number of times their articles have been cited.
Focus on Scientific Workflows
It’s Altintas’ work as Director of SDSC’s Scientific Workflow Automation Technologies Laboratory that has earned her particular acclaim. A scientific workflow is software composed of a series of computational and/or data manipulation tools or steps to produce an application that can be run especially on high-performance computers to produce data for subsequent analysis or comparison with other data sets. These workflows are proving to be science accelerators as they reduce, in some cases dramatically, the time to results for scientists.
At the time, in the late 1990s, grid computing was coming into its own and began to support middleware tools. Service-oriented computing was also becoming popular. With the emergence of distributed systems, software developers needed to integrate resources and pass data among them. Against this computational complexity, Altintas became interested in how to program a string of processes in a more intuitive way. She also began to notice commonalities in user requirements across what seemed like very different application areas. This, to her, suggested the notion of re-use.
Putting these needs together, Altintas envisioned the way forward: a grassroots workflow effort based on an open-source platform.
The Kepler Workflow System
Altintas and colleagues built such a system on top of a modeling tool for engineering called Ptolemy II, named after the first-century mathematician-astronomer. Following this naming tradition, Altintas and colleagues named their system Kepler after the revolutionary 17th-century scientist known for his laws of planetary motion. The name provided brand recognition and, in retrospect, anticipated the wide-ranging impact the system was to have.
The Kepler project was initiated in August 2003 with a first Beta release in 2004 and ongoing release cycles since (the latest is version 2.4), managed by SDSC. One of the keys to the system’s success is that it goes one step beyond open-source software: Anyone can become part of the Kepler community and offer core functionality or modules to be deployed on top of Kepler releases. Altintas says that “development is applications-driven, with all functionality suggested by the community using, or wanting to use, it.”
Workflows in NBCR
In addition to her other responsibilities, Altintas is co-principal investigator (with Philip Papadopoulos) of Core 4 in the National Biomedical Computation Resource (NBCR), based in the Qualcomm Institute. In this role, she focuses on developing practical cyberinfrastructure for multi-scale refinement, modeling, and analysis workflows. “Kepler helps integrate and build NBCR applications so they can execute transparently on a variety of computing resources," notes Altintas. "The software modules can be mixed and matched depending on the scientist’s purpose and goals. I’m always listening for inputs and outputs as a mechanism to guide development of a particular workflow.”
NBCR, in fact, serves as the application hub for Kepler. “Kepler has reusable building blocks – we’ve used most of them a fair number of times,” says Altintas. “It’s easy to put them together in rapid application prototypes and, from there, scale up execution or publish them as software packages that others can use. We do all of that at NBCR.” Within NBCR, Kepler now supports everything from bioinformatics and drug design applications, to complex microscopy and imaging applications, to patient-specific cardiac modeling. Application of Kepler in such diverse biomedical environments pushed further development, resulting in bioKepler.
Like applications that push the boundaries of technology in computer science, NBCR provides the ideal scientific applications to give bioKepler a demanding workout. bioKepler provides a graphical user interface (GUI) to connect big data with domain-specific biomedical tools to analyze that data in a scalable fashion. The GUI can be used to link tools together to create an application logic that can then be run in batch mode on high-performance computing and in cloud environments.
Significantly, Kepler also helps address a hot topic in the science community: provenance, that is, the ability to accurately reproduce the scientific breadcrumb trail that produced the results. Given the occasional scandal surrounding scientific conclusions based on analysis of false data, scientists are paying increasing attention to this issue to reproduce other’s results and verify their integrity. Reproducibility is especially important—and challenging—for multi-scale modeling, which is NBCR’s niche. It’s a field in which a single computational experiment may require upwards of 200 steps. Kepler workflows not only support reproducibility but, in addition to final results, they promote sharing of accurate scientific methods.
Synergy between Developers and Application Scientists
Kepler has always fueled synergy between its developers and the application scientists who use it. As scientists become trained in its use, they, in turn, bring more challenging scientific questions to the table, which, in turn, spur more development. “What if…?” is probably the most common question in Altintas’ lab.
“The scientists also advertise for us,” explains Altintas. “As their applications grow in number, we are able to test the platform more. Everyone wins. The more scientists learn about and use this technology, the more useful, robust, and comprehensive the ecosystem we’re developing becomes.”
But for scientists to go beyond the “black box” issue that has prevented progress, they need to understand workflow components and how they are put together to ensure the validity of their results. Further, they need to understand how workflows work so they can begin developing their own to address more complicated scientific questions. The impact of Kepler will scale as more scientists gain this understanding.
Altintas and her team provide training in various ways. Sometimes it’s a formal “bootcamp” for informatics and computational data science in which they focus on end-to-end processes to achieve specific results. Training is also provided through academic projects and for industry on a recharge basis. NBCR just co-sponsored a scalable bioinformatics bootcamp in late May, which will return in the fall, and a “hackathon”—an event at which scientists will gather with Kepler experts to develop their workflows—is scheduled for later this month (July 2014).
One satisfied bootcamp participant recently reported that, after learning how to use Kepler, he was able to achieve, in two days, results that previously would have taken him two years. And his experience is hardly unique.
While UCSD doesn’t offer workflow classes per se, it has just approved a new M.S. degree program in Data Science, in which the study and application of workflows will be part of some project-based courses. Altintas expects to be part of the team that teaches these classes. In addition, she and NBCR Director Rommie Amaro are exploring the possibility of using online training, such as Massive Open Online Courses (MOOCs), to enable researchers more broadly to make effective use of the tools like Kepler that NBCR develops and makes publicly available.
Workflows for Data Science: A Center of Excellence
In April of this year, Altintas inaugurated a Center of Excellence at SDSC, called Workflows for Data Science, or WorDS. Its goals include providing the ability to access and query data; scale computational analysis to higher-performance computers; increase software re-use and workflow reproducibility; save time, energy, and money; and formalize and standardize workflow processes. “Our focus is on use cases, not technology,” says Altintas. In addition to the eye-catching amount of funding Altintas has contributed—currently $8M—the center has published an impressive list of peer-reviewed papers.
Here the applications areas served are much broader than biomedical science and include environmental observatories, oceanography, geoinformatics, and computational chemistry. The areas of expertise represented by center staff include research on scientific workflow automation technologies, big data applications, workflows adapted for cloud systems, development and consulting services, and workforce development.
One of the most grounded projects that Altintas and her colleagues are working on is WiFire, a project funded in 2013 by the National Science Foundation. Its goal is to be able to predict where a fire will head while it’s burning. The team is building a cyberinfrastructure that integrates cameras and other data sensors mounted on radio antennas throughout San Diego County, high-speed communications networks, high-resolution imagery, and high-performance computing. When a fire starts, data from the sensor network along with satellite and weather data will be fed into an SDSC supercomputer to generate a model of the fire’s behavior. The system will be able to compute the progress of the flames faster than real time to provide advanced warning to help fire fighters make decisions to deploy their resources most effectively. The system got its first test during May 2014 when, over a few-day period, 11 fires raged across northern San Diego County.
The WiFire team involves various UC San Deigo labs, notably SDSC, Calit2's Qualcomm Institute, the Computer Science and Engineering department, the Mechanical and Aerospace Engineering department, and the High Performance Wireless Research and Education Network. The team envisions the WiFire testbed as a precursor to a national, and ultimately global, fire-fighting cyberinfrastructure.
By Stephanie Sides, for the National Biomedical Computation Resource