Discovery of Workflows
A major area of investigation is workflow discovery in scientific research communities. Workflow repositories have the potential to transform computational science. Today, much effort is invested in re-implementing scientific analysis methods described in publications. If workflows were routinely shared by scientists, they would be readily reusable and as a result that effort will be saved. This would also facilitate reproducibility of scientific results, a cornerstone of the scientific method. Effective techniques for workflow matching and discovery are key incentives for scientists to share analytic methods as reusable computational workflows.
As scientific workflows become more commonplace, workflow repositories are emerging with contributions from a variety of scientists. Provenance systems record the details of the execution of workflows so they can be retrieved later (e.g., http://twiki.ipaw.info). Since workflow executions contain a lot of details that make it harder to reuse them, scientists also share workflow templates that describe a general kind of analysis that can be more easily reused (e.g., http://www.myexperiment.org). Workflow repositories can also contain best practices for common types of scientific analyses (e.g., http://genepattern.broadinstitute.org). A series of studies have shown that scientists wish to discover workflows given properties of workflow data inputs, intermediate data products, and data results. These are data-centered queries that are challenging to address because workflows typically lack this information when contributed to a repository.
In our work, we investigate mechanisms for workflow retrieval given data-centered queries and their combination with other constraints on components and workflow structure. One important challenge is that workflow catalogs typically specify only a limited amount of information that is insufficient for data-centered queries. That is, although semantic annotations of workflows have been explored in prior work, the presence of any semantic information in the workflow is assumed to be manually provided. However, when users create a workflow they rarely add such information. We believe that many semantic annotations can be extracted from component catalogs that describe individual components reused in different workflows. We have investigated how to use such component descriptions to automatically enrich the workflows created by users into semantic workflows that contain inferred properties that are needed for supporting data-centered queries. We designed algorithms for workflow matching that use these semantic descriptions of workflows to answer data-centered queries from users. An important consideration is that scientific applications are developed using data and component catalogs that are independent of workflow catalogs (e.g., www.nvo.org, www.earthsystemgrid.org, cabig.cancer.gov). Therefore, our workflow matching algorithms identify reasoning tasks that are specific to datasets and components, and submit requests to external data and component catalog services to carry them out.
This work is reported in the following publication: * "Workflow Matching Using Semantic Metadata". Yolanda Gil, Jihie Kim, Gonzalo Florez, Varun Ratnakar, and Pedro A. Gonzalez Calero. Proceedings of the Fifth International Conference on Knowledge Capture (K-CAP), Redondo Beach, CA, September 1-4, 2009. Available as a preprint.