Use of Data Collections in Parallel Constructs of Workflows
We began to investigate the processing of data collections in workflows. Data collections are amenable to parallel processing within a workflow. In many workflows, a data collection is created from a single dataset. For example, an image may be input to a component that splits it into a collection of tiles that can be processed in parallel to make the analysis of the image more efficient.
We analyzed the processing of data collections in several scientific domains:
- First, we looked at biomedical image processing, where the management of data collections yields significant performance improvements. We found the need for expressing and processing nested data collections. For example, 3D image may be broken into layers and each layer broken into chunks.
- A second domain of investigation was statistical techniques for bioinformatics. Here we find many manipulations in the ordering of the elements of data collections, as well as distinct associations of results with the original data collections.
Based on this analysis, we designed a formalism to express the treatment of data collections in workflows. This formalism expresses how data collections are processed by workflow components in terms of how the elements of the collection should be mapped to the component inputs.
This work continues in that we are developing a collection of exemplar workflows with distinct needs for treatment of data collections. We are also investigating additional application areas.
This work is reported in the following publications: * "Expressive Reusable Workflow Templates". Yolanda Gil, Paul Groth, Varun Ratnakar, and Christian Fritz. Proceedings of the Fifth IEEE International Conference on e-Science, Oxford, UK, December 9-11, 2009. Available as a preprint. * "Characterization of Scientific Workflows". Shishir Bharathi, Ann Chervenak, Ewa Deelman, Gaurang Mehta, Mei-Hui Su, Karan Vahi, 3rd Workshop on Workflows in Support of Large-Scale Science (WORKS08), Austin, TX, November 2008.