Workflows Across Organizations
In order to share workflows across organizations, they need to be represented at a conceptual level that is common to all organizations and can be used to convey what kinds of tasks need to be carried out. They need to be described at a high-level that is independent of the execution environment that is available in any given organization.
Organizations differ in terms of their setup to manage workflows. It is well understood that each organization manages its own resources, so a workflow will execute in whatever execution environment the organization controls. Workflow mapping and execution systems exist that can manage this kind of resource-level heterogeneity for organizations.
Another source of heterogeneity across organizations is the software installations. Each organization has a different set of software codes, including proprietary software that the organization has licenses for, open software that its users have installed, and perhaps its own software base. Consider scenarios for workflow sharing in a science context. The methods may be very similar across institutions, because they would refer to scientific methods that are discussed and published to communicate the research. However, different research groups or labs have different preferences in terms of the software they use. For example, some labs have a strong preference for using MatLab, because it is a commercial product so it is stable and has facilities to manage and visualize data, while other labs strongly prefer R, because it is open source and more extensible. However, both frameworks implement many of the same basic algorithms for data analysis.
An important point to consider is that different organizations may also use different workflow management software. This makes it imperative that standard languages are used to facilitate workflow sharing across organizations. However, a standard workflow language will simply provide the syntax interchange language.
Workflows of web services present an interesting case. Web services allow workflows to be expressed so that they invoke a service, but have no reference to the details of how a service is implemented. However, the invocation is specific to the particular service API that implements a software component. There is no domain-level, implementation- and execution- independent description of workflows.
In summary, given a shared conceptual, domain-level description of a workflow, we can expect each organization may have its own collection of software codes and its own collection of execution resources.
Our goal is to develop a framework where workflows can be described in a way that makes them shareable and executable across heterogeneous organization infrastructure. We define three levels of description of workflows:
- Domain-level: The workflow steps at this level are concerned with methods, algorithms, and other domain-relevant aspects of the workflow. For example, "CorrelationScoring" is a type of method and is appropriate to this level. None of the workflow steps at this level are executable, because they are conceptual steps rather than actual codes. The workflow system has these steps represented as a domain component catalog. We represent domain-level steps as a collection of classes organized in a domain component ontology as a hierarchy or lattice. This catalog has no instances.
- Software-level: At this level, the workflow steps refer to specific software components that can be executed. For each domain-level workflow-step, a software-level step is chosen because it implements the method or function for that step. The workflow system has these steps represented as a software component catalog. This catalog is separate from the domain component catalog, but its components map to it. We represent the software component catalog as a set of classes and instances of the domain component ontology. We use classes to create useful abstractions when a component has several implementations that are related, for example the same algorithm implemented in several languages. Therefore, this catalog has both classes and instances.
- Execution-level: At this level, a workflow execution paradigm is chosen. One execution paradigm would be a localhost, if this is chosen then all the workflow steps are configured to run in the architecture and setup of the localhost. Alternatively, a distributed workflow mapping and execution engine could be used, which would select resources in a grid, cluster, or cloud for the software components selected. The workflow system has an execution component catalog, which represents all the execution requirements of each software components including architecture, OS, and runtime libraries.
Today, workflow systems describe workflows at the execution level or at the software level. There are no workflow descriptions being exchanged that describe workflows in a way that is independent of the organization infrastructure.
A workflow described at the domain-level will be shareable across organizations. These workflows can be then mapped to the software base of an organization, and then to the resource base.
We also investigated the publication of workflows as web objects, using Linked Data principles and the Open Provenance Model as a standard representation.
We have implemented the above framework using text analytics workflows with a diversity of software and execution environments.
This work is reported in the following publication: * "A Framework for Efficient Text Analytics through Automatic Configuration and Customization of Scientific Workflows." Matheus Hauder, Yolanda Gil, and Yan Liu. Proceedings of the Seventh IEEE International Conference on e-Science, Stockholm, Sweden, December 5-8, 2011. Available as a preprint. * "A New Approach for Publishing Workflows: Abstractions, Standards, and Linked Data." Daniel Garijo and Yolanda Gil. Proceedings of the Sixth Workshop on Workflows in Support of Large-Scale Science, held in conjunction with SC-11, Seattle, WA, Nov. 12-18 2011. Available as a preprint.