Capabilities of Workflow Systems
To disseminate results and foster adoption of workflow technologies, and to highlight the advantages of using semantic representations in workflow design, we have created an overview that describes the need for workflow systems in current cyberinfrastructure environments, the benefits that existing workflow systems have already demonstrated, and the possible additional benefits if workflow systems are adopted as common cyberinfrastructure components and are extended with semantic representations.
Workflow systems today can assist scientists by automating non-experiment critical tasks, systematically exploring the hypothesis space, managing parallelism and execution in distributed shared resources, and enabling low-cost reproducibility. If more broadly adopted, workflow systems will have an empowering effect leveling terms of the scientific processes supported. Today, semantic representations of scientific datasets are becoming more commonly used in cyberinfrastructure architectures to enable integration and reasoning over data. Similarly, knowledge-rich representations of workflows capture scientific principles and constraints that will enable a variety of artificial intelligence techniques to be brought to bear for validation, automation, hypothesis generation, and guarantees of data quality and pedigree. Knowledge-rich workflow systems open the doors to significant new capabilities for automated discovery, ever more integrative research that broadens the scope of scientific endeavors, education in science at all levels, and novel paradigms for interaction of scientists with cyberinfrastructure to fully exploit its capabilities.
This work is reported in the following journal articles: * "From Data to Knowledge to Discoveries: Scientific Workflows and Artificial Intelligence". Yolanda Gil. Scientific Programming, Volume 17, Number 3, 2009. Available as a preprint and from the publisher. * "Workflows and e-Science: An Overview of Workflow System Features and Capabilities". Ewa Deelman, Dennis Gannon, Matthew Shields, Ian Taylor. Future Generation Computer Systems, Vol 25, 2009. Available from the publisher. * "Assisting Scientists with Complex Data Analysis Tasks through Semantic Workflows". Yolanda Gil, Varun Ratnakar, and Christian Fritz. Proceedings of the AAAI Fall Symposium on Proactive Assistant Agents, Arlington, VA, November 2010. Available as a preprint. * "A Semantic Framework for Automatic Generation of Computational Workflows Using Distributed Data and Component Catalogs". Yolanda Gil, Pedro Antonio Gonzalez-Calero, Jihie Kim, Joshua Moody, and Varun Ratnakar. To appear in the Journal of Experimental and Theoretical Artificial Intelligence, 2011. Available as a preprint. * "Wings: Intelligent Workflow-Based Design of Computational Experiments". Yolanda Gil, Varun Ratnakar, Jihie Kim, Pedro Antonio Gonzalez-Calero, Paul Groth, Joshua Moody, and Ewa Deelman. IEEE Intelligent Systems, 26(1), 2011. Available as a preprint.
Capabilities of Current Workflow Systems
Workflow systems should take on workflows and requests at the highest layers of abstraction from users, and then have the systems automate the elaboration of the workflow into the lower layers of abstraction and their corresponding details. The higher the abstraction layer, the closer the workflow representation is to how a scientist may view the process or the request that triggers the process.
We can categorize the capabilities being developed for workflow systems in two separate layers: the knowledge layer and the symbol layer. This distinction was proposed by Allen Newell to describe intelligent systems [Newell 1982]. The knowledge layer is concerned with any characterization of that system in terms of its response to requests or goals and what knowledge it uses to solve them. In contrast, a symbol layer is concerned with the implementation of the knowledge and the reasoning mechanisms that are used to exploit it. For example, a symbol-level description of an autonomous vehicle would be in terms of whether it uses a genetic algorithm, a neural network, or a rule set. A knowledge-level description of that same system would be in terms of its ability to pursue standing goals of going to a destination, to incorporate opportunistic goals when a lane opens, and to defend itself from other drivers through fast reactive behaviors.
- The knowledge layer provides the higher level of abstraction, and is centered around what behaviors the system can exhibit, and the knowledge required to accomplish those behaviors. Knowledge includes constraints that must be satisfied by a workflow in order for it to be valid, strategies to complete or specialize a high-level workflow, effects-centered knowledge to accomplish a given experimental goal, and descriptions of data and their characteristics. Techniques include constraint reasoning, hierarchical decomposition and abstraction reasoning, automated search, heuristics that focus exploration of possibilities, and ontology-based reasoning of classes of data and computations. In considering this knowledge layer, we leave behind the realm of parallel programming and distributed systems. We enter the realm of artificial intelligence as an enabler of significant new capabilities in workflow systems. Artificial intelligence techniques can play an important role to represent complex scientific knowledge, to automate processes involved in scientific discovery, and to support scientists to manage the complexity of the hypothesis space.
- The symbol layer provides the mechanisms that support the behaviors exhibited at the knowledge layer.
The capabilities of workflow systems to assign resources and execute workflows are concerned with the architecture at the symbol layer. Scalability and secure sharing are enabled by the symbol-level architecture through infrastructure services and resources.
The symbol level is concerned with carrying out the tasks specified in a given workflow. In contrast, the knowledge level of a workflow system would be concerned with the kinds of tasks that it is able to accomplish for a scientist. This suggests a level of workflow descriptions and capabilities that affect what scientific tasks the workflow system can accomplish. This level would be concerned with what scientific tasks it can undertake, what workflows are selected for a task, what workflows are available in the system, and what their coverage is with respect to a set of tasks. The more knowledge, the more kinds of tasks the system can undertake. More knowledge about how to use and integrate workflows will result in improved behavior of the system in terms of solving more tasks and being capable of producing new kinds of results.
The diagram below illustrates this distinction, showing the correspondence to these two layers of capabilities being developed by current research in workflow systems.
New Capabilities Enabled by Semantic Representations
Cyberinfrastructure had its roots in the High Performance Computing community and large-scale scientific computing, where large data repositories and high-end computing facilities needed to reside at specific locations while being effectively accessible by remote users. Cyberinfrastructure broadly construed includes not only data and computing facilities but also instruments, tools, and often the people involved in forming and using all this combined infrastructure. A variety of middleware software enables access and exploitation of these facilities, including remote access services, interface portals, and data and tool repositories.
Workflow systems have already demonstrated the benefits of automatic management of computations. Workflows should become first-class citizens in science and cyberinfrastructure. They provide explicit representations of computational analyses and provenance information for new data. Workflow systems today assist scientists by automating non-experiment critical tasks, systematically exploring the hypothesis space, managing parallelism and execution in distributed shared resources, and enabling low-cost reproducibility.
Using semantic representations of workflows will have an empowering effect leveling terms of the scientific processes supported. Today, semantic representations of scientific datasets are becoming more commonly used in cyberinfrastructure architectures to enable integration and reasoning over data. Similarly, knowledge-rich representations of workflows capture scientific principles and constraints that will enable a variety of artificial intelligence techniques to be brought to bear for validation, automation, hypothesis generation, and guarantees of data quality and pedigree. Knowledge-rich workflow systems open the doors to significant new capabilities for automated discovery, ever more integrative research that broadens the scope of scientific endeavors, education in science at all levels, and novel paradigms for interaction of scientists with cyberinfrastructure to fully exploit its capabilities.
Current cyberinfrastructure capabilities are shown in blue in the figure below. New capabilities enabled by workflow systems are shown in yellow.