Why data format matters?
Digital formats, while offering overall valuable capabilities for data storage and manipulation, face inherent vulnerabilities that threaten long-term accessibility and usability. Format obsolescence, the process by which digital file formats become unusable due to technological advances and lack of support, is one of the most significant risks to digital preservation. This obsolescence can occur through various mechanisms: software updates that drop support for older formats, hardware that can no longer read certain storage media, or the disappearance of technical documentation needed to interpret the data. The challenges extend beyond mere technical considerations, as evidenced by historical cases. A famous example is the Viking Mars mission data from the 1970s, where some data became almost inaccessible due to obsolete storage formats and hardware (Layne et al., 2012). This incident highlighted how cutting-edge technology of one era can become a preservation liability in another. Research preservation faces significant obstacles to maintain access to decades of experimental data. Proprietary formats often depend on specific software versions or platforms, creating vendor lock-in and preservation risks. Moreover, the increasing volume and complexity of research data demands formats that can efficiently handle large-scale datasets while maintaining their integrity and preserving essential metadata. These challenges underscore the importance of strategic format selection in research data management.
The selection of appropriate data formats thus plays a crucial role in ensuring the long-term preservation and accessibility of research data. While specific requirements may vary by discipline and institution, adherence to these guidelines promotes interoperability and ensures long-term access to research outputs. This document outlines key considerations and recommendations for format selection in research data preservation.
Main Principles
When selecting data formats for long-term preservation, the scientific team (including researchers, data managers and development specialists) should prioritise formats that maximise data reusability and long-term accessibility. Key attributes for success include:
1. Openness and Documentation
- Open formats rather than proprietary ones (e.g., CSV over proprietary database formats);
- Well-documented specifications, preferably published as ISO (International Organization for Standardization) standards;
- Format specifications freely available;
- Complete technical documentation available to the public.
2. Format Characteristics
- Text-based formats preferred over binary (e.g., plain text files readable in standard text editors);
- Simple formats preferred over complex ones (e.g., CSV rather than XLSX for tabular data);
- Machine-readable structure (e.g. standard formats hierarchies, clear semantic meaning);
- Ability to export to open formats (e.g., DOCX, XLSX can be unpacked into XML).
3. Technical Sustainability
- Widespread adoption by relevant (research) communities;
- Regular maintenance and updates;
- Built-in error detection capabilities;
- Support for metadata embedding.
Recommended formats by data type
The data format needs to take into account different types of data, as present below. For a comprehensive overview of technical considerations and best practices, including file formats and standards, refer to the Digital Preservation Handbook from the Digital Preservation Coalition.
**Textual Data
- Plain text (UTF-8, ASCII)
- PDF/A (ISO 19005)
**Tabular Data
- CSV (RFC 4180 compliant)
- ODS (OpenDocument Spreadsheet)
- Database dumps in SQL format
**Images
- TIFF (uncompressed)
- PNG
- JPEG2000 (lossless)
- SVG (for vector graphics)
**Audio
- WAVE
- FLAC
- AIFF
**Video
Motion JPEG 2000
FFV1/MKV
Uncompressed AVI
MPEG-4 (H.264) for access copies **Scientific/Statistical Data
HDF5
netCDF
Statistical software Portable Format (e.g. SPSS, Stata, R, SAS)