Citation: “Assessing Data Workflows for Common Data ‘Moves’ Across Disciplines” Alan Liu, 6 May 2017. doi: 10.21972/G21593.

This is a slightly revised version of my position paper for the “Always Already Computational: Collections as Data” Forum, UC Santa Barbara, March 1-3, 2017. (The original version is included among a collection of such position statements by participants in the conference.) A further revised version was later published as “Data Moves: Libraries and Data Science Workflows,” in Libraries and Archives in the Digital Age, ed. Susan L. Mizruchi (Cham: Palgrave Macmillan, 2020), 211–19, https://doi.org/10.1007/978-3-030-33373-7_15.

6 May 2017

In considering how library collections can serve as data for a variety of data ingest, transformation, analysis, reproduction, presentation, and circulation purposes, it may be useful to compare examples of data workflows across disciplines to identify common data “moves” as well as points in the data trajectory that are especially in need of library support because they are for a variety of reasons brittle.

Wings system workflow diagram.

Fig. 1 – Example of data workflow visualized in the Wings workflow system (from the Wings tutorial)

We might take a page from current research on scientific workflows in conjunction with research on data provenance in such workflows. Scientific workflow management is now a whole ecosystem that includes integrated systems and tools for creating, visualizing, manipulating, and sharing workflows (e.g., Wings, Apache Taverna, Kepler, etc.). At the front end, such systems typically model workflows as directed, acyclic network graphs whose nodes represent entities (including data sets and results), activities, processes, algorithms, etc. at many levels of granularity, and whose edges represent causal or logical dependencies (e.g., source, output, derivation, generation, transformation, etc.) (see fig. 1). Data provenance (or “data lineage” as it has also been called in relation to workflows) complements that ecosystem through standards, frameworks, and tools–including the Open Provenance Model (OPM) the W3C’s PROV model, ProvONE, etc. Linked-data provenance models have also been proposed for understanding data-creation and -access histories of relations between “actors, executions, and artifacts.” In the digital humanities, the in-progress “Manifest” workflow management system combines workflow management and provenance systems. . . .