Workflow Exemplars

From WorkflowDesign
Jump to: navigation, search

Workflows for Data-Intensive Image Processing

With collaborators from the Department of Biomedical Informatics of the Ohio State University, we are studying how to design workflows for data-intensive applications that require processing of large amounts of data. To do this efficiently, data are split into smaller chunks to facilitate parallel processing. It is hard to disentangle from existing running code what is an appropriate design of the workflow components and of the workflow so that the parameters that control data splitting and processing are explicit in the workflow structure.

The application we are using is for microscopy image correction. The end product of the design process is shown in the figures below.

The design process with commentary is documented in this slide deck: OSU Workflow Design.

To expose parallelism in the computations explicitly in the workflow, application components may need to be rewritten to separate functionality and to expose parameters that can be used to manage the parallel computations.

This work is reported in the following publications:
* "Distributed Out-of-Core Preprocessing of Very Large Microscopy Images for Efficient Querying".
Benjamin Rutt, Vijay S. Kumar, Tony Pan, Tahsin Kurc, Ümit Çatalyurek, Yujun Wang and Joel Saltz
Proceedings of IEEE International Conference on Cluster Computing (Cluster'05), September 2005.
Available as a preprint.
* "An Integrated Framework for Parameter-Based Optimization of Scientific Workflows". 
Vijay S. Kumar, P. Sadayappan, Gaurang Mehta, Karan Vahi, Ewa Deelman, Varun Ratnakar, Jihie Kim, 
Yolanda Gil, Mary Hall, Tahsin Kurc, and Joel Saltz. Proceedings of the International Symposium on 
High Performance Distributed Computing (HPDC), Munich, Germany, June 11-13, 2009. 
Available as a preprint.
* "Parameterized Specification, Configuration, and Execution of Data-Intensive Scientific Workflows".
Vijay Kumar, Tahsin Kurc, Varun Ratnakar, Jihie Kim, Gaurang Mehta, Karan Vahi, Yoonju Lee Nelson, P. Sadayappan, 
Ewa Deelman, Yolanda Gil, Mary Hall, Joel Saltz. Cluster Computing Journal, Vol 13, 2010. 
Available from the publisher.
* "High Performance Computing and Grid Computing for Integrative Biomedical Research." Tahsin Kurc, 
Shannon Hastings, Vijay Kumar, Stephen Langella, Ashish Sharma, Tony Pan, Scott Oster, David Ervin, 
Justin Permar, Sivaramakrishnan Narayanan, Yolanda Gil, Ewa Deelman, Mary Hall, Joel Saltz. To appear 
in Journal of High Performance Computing Applications, 2010. Available as a preprint.

Workflows for Discovery, Production, and Distribution

With collaborators from NASA JPL and the Software Engineering group at USC’s Computer Science Department, we studied how the characteristics of scientific workflow systems vary significantly, and that it is hard for scientists interested in adopting workflow technology to determine which workflow system to select. This problem is of particular interest to NASA scientists that need to process large amounts of data collected through a variety of missions. The initial research results of this collaboration are described in detail in an article that will appear in IEEE Software.

In this collaboration, we developed a taxonomy of workflow systems based on phases of “in silico” research and that include discovery, production, and distribution. Discovery encompasses the rapid investigation of a scientific principle in which hypotheses are formed, tested, and iterated on rapidly. Production involves the application of a newly formed scientific principle to large data sets for further validation. Finally, distribution involves the sharing of resulting data for vetting by the larger scientific community. Each of these three phases have distinct scientific workflow requirements. By making scientists aware of these requirements, our taxonomy is useful better inform their decision regarding the adoption of a particular workflow technology.

This taxonomy describes three distinct types of workflow environments:

  1. Discovery workflows are rapidly re-parameterized, allowing the scientist to explore alternatives quickly in order to iterate over the experiment until hypotheses are validated. Discovery workflow environments support this type of dynamic experimentation. The high-level requirements that a discovery workflow environment should provide include aiding scientists in the formulation of abstract workflow models. Discovery workflow environments should transform the abstract models into workflow instances.
  2. Production workflow environments are focused on repeatability. These environments should be capable of staging remote, high volume data sets, cataloging results, and logging or handling faults. Unlike discovery workflows, production workflows must incorporate substantial data management facilities. Scientists using production workflow environments care less about the means of abstract workflow representation than they do about the ability to automatically reproduce an experiment. The high-level requirements of a production workflow environment include handling the non-orchestration aspects of workflow formation such as data resource discovery and data provenance recording. Additionally, production workflow environments should aid the scientist in converting existing scientific executables into workflow stages, including providing means of accessing ancillary workflow services.
  3. Distribution workflow environments focus on the retrieval of data. Distribution workflows are used to combine and re-format data sets and deliver these data sets to remote scientists for further analysis. The requirements for distribution workflow environments include support for rapidly specifying abstract workflows and for remotely executing the resulting abstract workflows.

In this collaboration, we also illustrated the taxonomy with workflow environments currently under development at USC and NASA JPL, including the Workflow Instantiation and Generation System (WINGS), the Scientific Workflow Software Architecture (SWSA), and the Object Oriented Data Technology (OODT) data distribution framework.

This work is reported in the following journal article:

* "Scientific Software as Workflows: From Discovery to Distribution". 
David Woollard, Nenad Medvidovic, Yolanda Gil, and Chris Mattmann. IEEE Software, 
Special Issue on Developing Scientific Software, July/August 2008.

Available from the publisher.

Workflows for Life Sciences

We also investigated a variety of workflow structures in life sciences applications including:

  • population genomics
  • epigenomics
  • gene expression analysis

These workflows typically reuse popular software packages, including Plink, PennCNV, GNOSIS, Allegro, STRUCTURE, FastLink, and Burrows-Wheeler Aligner (BWA), in addition to R and MatLab. With the advent of next-generation sequencing techniques there is a general trend towards using scalable data management steps, for example using SAMtools, as well as parallelization of computations because of the large sizes of the datasets.

Personal tools