Annotating and Mining Bioinformatics Workflows

As the demand grows for more transparent and reproducible scientific research [1], it becomes increasingly urgent to adopt more formal strategies for recording scientific methodology. The most common approach to such explicit process modelling takes the form of a scientific workflow. These digital artifacts formally describe the series of steps by which a scientific experiment was/will be conducted. Workflows may exist at a variety of levels of abstraction, ranging from general process overviews, to specific tools, the data-flow connections between them, and their associated execution parameters.

The primary workflow repository in the Life Sciences is myExperiment [2], which archives workflows from most design and enactment environments. To facilitate discovery, workflows submitted to myExperiment can be annotated with both a block of descriptive text, as well as free-text keywords or "tags"; however there is little-to-no control over the quality or quantity of these annotations. Task-appropriate workflow discovery, then, relies largely on the matching of keywords from within these free-form, sometimes very limited textual sources. Similarly, detailed comprehension of the functionality and suitability of a discovered workflow also depends largely on human interpretation of these textual annotations.

One approach to improving the status quo would be to semantically annotate both workflows and their sub-components. Semantic annotations have numerous benefits over keyword and free-text annotations, such as supporting query expansion, filtering, precision, and computational tractability for formal verification and validation of workflow structures. The semantic annotations are mainly generated manually, either at the time of service/workflow authoring, or as part of a legacy migration and curation process. As such, it would be highly desirable to "boot-strap" the semantic annotation of legacy workflows and workflow sub-components through some form of automated semantic annotation.

We propose an approach to the semantic annotation of legacy workflows in the myExperiment repository, as well as their component services with biomedical ontology terms, in an attempt to describe these artifacts in a structured way.

1. Micheel, C.M., Nass, S.J., Omenn G.S., eds.: Evolution of Translational Omics Lessons Learned and the Path Forward. The Institute of Medicine of the National Academies (2012)
2. Goble, C.A., Bhagat, J., Aleksejevs, S., et al.: myExperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Research 38(suppl 2)W677–W682 (2010)