Investigations of Data Representation

Overview

Preservation can be thought of as communication with the future. The records we preserve today need to be accessible and displayable by future technology. Beyond maintaining the accessibility of the raw bits of the digital data, preservation requires maintaining an ability to interpret the data as meaningful structures, relationships, and visual representations.

We are contributing to the development of a preservation system that would dramatically lower the per-file-format effort required for preservation. In particular, we are contirbuting to the development a format description language (the Data Format Description Language) and format-independent parser (Daffodil) to support interpretation of arbitrary binary or ASCII formatted files in terms of well-defined logical models.

Intellectual Merit
The explicit, declarative, descriptive model we are developing through this project significantly reduces the amount of machine and operating system dependent software that must be maintained to preserve access to file content and minimizes the effort needed to support new formats.

Broader Impacts
While preserving access to file content is a primary motivation for the development of DFDL and Daffodil, they are useful across the curation and preservation process and more broadly in e-Science in general.

The Technology

The Data Format Description Language (DFDL) Standard
Our team has participated in the development of the Data Format Description Language (DFDL), a new standard specification from the Open Grid Forum, released in January, 2011 [7,8].

The DFDL is a language to describe existing data formats, both binary and text, in a manner that makes the data accessible through generic mechanisms. The DFDL specification is based on the XML Schema (http://www.w3.org/XML/Schema.html), which is used to define the structure and semantics of XML documents and to annotate schemas for the benefit of human readers and applications. The input is a sequence of bytes and the output is an XML Information Model. For more information, see [7,8].

Parser Development
In previous work, Talbott and others at Pacific Northwest Labs developed the Defuddle parser, which implemented an early version of the DFDL specification [6]. Subsequently, this project updated and extended the Defuddle parser [1-3].

At the time of the release of Version 1 of the DFDL Specification, we reviewed the Defuddle parser, and determined that it needed to be completed revised [4].

The Daffodil parser is a completely new implementation, based on Version 1 of the DFDL, as well as lessons learned from Defuddle [5]. The Daffodil parser will be available in August 2011.

Semantic Extensions
While the XML Schema language is well suited for describing the layout of data (the "syntax"), interoperability and robust archiving require semantic mark up as well. This project will extend the DFDL model to support mapping to semantic web languages (the Resource Description Framework (RDF) and the Web Ontology Language (OWL)). We are exploring a two-step mechanism based on the use of the Gleaning Resource Descriptions from Dialects of Languages (GRDDL) specification http://www.w3.org/TR/grddl/ to associate XML to RDF mapping instructions, written, for example, in XSLT, with the DFDL description file [2,3].

Team Members

Collaborations & Communities

Publications and Presentations:

Acknowledgements

Citations

  1. McGrath, R.E., J. Kastner, A. Rodriguez, and J. Myers. ``Defuddle: a Tool for Format Translation and Metadata Extraction (Poster)''. Microsoft E-science Workshop (2009).
  2. McGrath, R.E., J. Kastner, A. Rodriguez, and J. Myers. ``Experiments in Data Format Interoperation Using Defuddle'', National Center for Supercomputing Applications, June, 2009, http://cet.ncsa.illinois.edu/publications/Data_Interoperation.pdf.
  3. McGrath, R.E., J. Kastner, A. Rodriguez, and J. Myers. ``Towards a Semantic Preservation System'', National Center for Supercomputing Applications, June, 2009, http://arxiv.org/abs/0910.3152.
  4. Rodriguez, A. and R. E. McGrath, ``Some Notes of comparison between DFDL and Defuddle''. National Center for Supercomputing Applications, October, 2010, http://cet.ncsa.uiuc.edu/publications/Review_of_Defuddle.pdf
  5. Rodriguez, Alejandro and Robert E. McGrath, ``Daffodil: A New DFDL Parser''. National Center for Supercomputing Applications, October, 2010, http://cet.ncsa.illinois.edu/publications/Daffodil-ANewDFDLParser.pdf
  6. Talbott, T. D., K. L. Schuchardt, E, G. Stephan, and J, D. Myers, ``Mapping Physical Formats to Logical Models to Extract Data and metadata: The Defuddle Parsing Engine'', International Provenance and Annotation Workshop. 2006, Springer: Heidelberg. p. 73-81.
  7. wikipedia, "Data Format Description Language". 2011, http://en.wikipedia.org/wiki/Data_Format_Description_Language .
  8. Powell, Alan W, Michael J Beckerle, and Stephen M Hanson, Data Format Description Language (DFDL) v1.0 Specification. GFD-P-R.174, Open Grid Forum, 2011. http://www.ogf.org/documents/GFD.174.pdf

Related Links