Workflows for Data-Intensive Image Processing
With collaborators from the Department of Biomedical Informatics of the Ohio State University, we are studying how to design workflows for data-intensive applications that require processing of large amounts of data. To do this efficiently, data are split into smaller chunks to facilitate parallel processing. It is hard to disentangle from existing running code what is an appropriate design of the workflow components and of the workflow so that the parameters that control data splitting and processing are explicit in the workflow structure.
The application we are using is for microscopy image correction. The end product of the design process is shown in the figures below.
The design process with commentary is documented in this slide deck: OSU Workflow Design.
To expose parallelism in the computations explicitly in the workflow, application components may need to be rewritten to separate functionality and to expose parameters that can be used to manage the parallel computations.
This work is reported in the following publications: * "Distributed Out-of-Core Preprocessing of Very Large Microscopy Images for Efficient Querying". Benjamin Rutt, Vijay S. Kumar, Tony Pan, Tahsin Kurc, Ümit Çatalyurek, Yujun Wang and Joel Saltz Proceedings of IEEE International Conference on Cluster Computing (Cluster'05), September 2005. Available as a preprint. * "An Integrated Framework for Parameter-Based Optimization of Scientific Workflows". Vijay S. Kumar, P. Sadayappan, Gaurang Mehta, Karan Vahi, Ewa Deelman, Varun Ratnakar, Jihie Kim, Yolanda Gil, Mary Hall, Tahsin Kurc, and Joel Saltz. Proceedings of the International Symposium on High Performance Distributed Computing (HPDC), Munich, Germany, June 11-13, 2009. Available as a preprint. * "Parameterized Specification, Configuration, and Execution of Data-Intensive Scientific Workflows". Vijay Kumar, Tahsin Kurc, Varun Ratnakar, Jihie Kim, Gaurang Mehta, Karan Vahi, Yoonju Lee Nelson, P. Sadayappan, Ewa Deelman, Yolanda Gil, Mary Hall, Joel Saltz. Cluster Computing Journal, Vol 13, 2010. Available from the publisher. * "High Performance Computing and Grid Computing for Integrative Biomedical Research." Tahsin Kurc, Shannon Hastings, Vijay Kumar, Stephen Langella, Ashish Sharma, Tony Pan, Scott Oster, David Ervin, Justin Permar, Sivaramakrishnan Narayanan, Yolanda Gil, Ewa Deelman, Mary Hall, Joel Saltz. To appear in Journal of High Performance Computing Applications, 2010. Available as a preprint.
Workflows for Discovery, Production, and Distribution
With collaborators from NASA JPL and the Software Engineering group at USC’s Computer Science Department, we studied how the characteristics of scientiﬁc workﬂow systems vary signiﬁcantly, and that it is hard for scientists interested in adopting workﬂow technology to determine which workflow system to select. This problem is of particular interest to NASA scientists that need to process large amounts of data collected through a variety of missions. The initial research results of this collaboration are described in detail in an article that will appear in IEEE Software.
In this collaboration, we developed a taxonomy of workﬂow systems based on phases of “in silico” research and that include discovery, production, and distribution. Discovery encompasses the rapid investigation of a scientiﬁc principle in which hypotheses are formed, tested, and iterated on rapidly. Production involves the application of a newly formed scientiﬁc principle to large data sets for further validation. Finally, distribution involves the sharing of resulting data for vetting by the larger scientiﬁc community. Each of these three phases have distinct scientiﬁc workﬂow requirements. By making scientists aware of these requirements, our taxonomy is useful better inform their decision regarding the adoption of a particular workﬂow technology.
This taxonomy describes three distinct types of workflow environments:
- Discovery workﬂows are rapidly re-parameterized, allowing the scientist to explore alternatives quickly in order to iterate over the experiment until hypotheses are validated. Discovery workﬂow environments support this type of dynamic experimentation. The high-level requirements that a discovery workﬂow environment should provide include aiding scientists in the formulation of abstract workﬂow models. Discovery workﬂow environments should transform the abstract models into workﬂow instances.
- Production workﬂow environments are focused on repeatability. These environments should be capable of staging remote, high volume data sets, cataloging results, and logging or handling faults. Unlike discovery workﬂows, production workﬂows must incorporate substantial data management facilities. Scientists using production workﬂow environments care less about the means of abstract workﬂow representation than they do about the ability to automatically reproduce an experiment. The high-level requirements of a production workﬂow environment include handling the non-orchestration aspects of workﬂow formation such as data resource discovery and data provenance recording. Additionally, production workﬂow environments should aid the scientist in converting existing scientiﬁc executables into workﬂow stages, including providing means of accessing ancillary workﬂow services.
- Distribution workﬂow environments focus on the retrieval of data. Distribution workﬂows are used to combine and re-format data sets and deliver these data sets to remote scientists for further analysis. The requirements for distribution workﬂow environments include support for rapidly specifying abstract workﬂows and for remotely executing the resulting abstract workﬂows.
In this collaboration, we also illustrated the taxonomy with workflow environments currently under development at USC and NASA JPL, including the Workflow Instantiation and Generation System (WINGS), the Scientiﬁc Workﬂow Software Architecture (SWSA), and the Object Oriented Data Technology (OODT) data distribution framework.
This work is reported in the following journal article: * "Scientific Software as Workflows: From Discovery to Distribution". David Woollard, Nenad Medvidovic, Yolanda Gil, and Chris Mattmann. IEEE Software, Special Issue on Developing Scientific Software, July/August 2008. Available from the publisher.
Workflows for Life Sciences
We also investigated a variety of workflow structures in life sciences applications including:
- population genomics
- gene expression analysis
These workflows typically reuse popular software packages, including Plink, PennCNV, GNOSIS, Allegro, STRUCTURE, FastLink, and Burrows-Wheeler Aligner (BWA), in addition to R and MatLab. With the advent of next-generation sequencing techniques there is a general trend towards using scalable data management steps, for example using SAMtools, as well as parallelization of computations because of the large sizes of the datasets.