Data Preservation and Sharing
This section reviews data preservation concepts for a nuanced understanding. It covers the rationale for funders and organizations in data sharing, as well as incentives for researchers. We identify common data sharing models and summarize the roles of data licenses, access policies, and data use conditions.
Key concepts
Research data is “any information that has been collected, observed, generated or created to validate original research findings” (ULeeds, 2024). Research data includes lab notebooks, software, schematics, media recordings, presentation materials, methods, SOPs, and other documents. Researchers should consider this broad range when preserving and sharing data.
Data preservation consists of “a series of activities necessary to ensure safety, integrity and accessibility of data for as long as necessary, even decades” (RDMkit, 2024). The main goal is to maintain accessibility. Data formats may become outdated, storage technologies may change, files may get corrupted, or data production and reuse contexts may vary. Preservation is the coordinated actions of dedicated technology and curation teams to ensure datasets are preserved with essential metadata for future discovery, understanding, and reuse. The recommended way of preserving research data is to deposit it in a suitable repository. It's important to distinguish between data preservation and data archival, as often referred to by organisations. Archival typically refers to long-term cold storage to meet legal or organisational data retention requirements, often without guarantees of continued data accessibility. While archiving data is better than losing it, such data may have low reuse potential, and its accessibility may decline over time.
Data sharing “is making your data known to others" so that they can (potentially) re-use it" (RDMkit, 2024). Funders, as part of their open science policies, encourage data sharing “as openly as possible but as closed as necessary” (EC, 2024). Data can be shared with the global research community, as in the case of open data, or it can be shared under controlled access subject to conditions. Data sharing often involves moving data from your research environment to a suitable repository; this could be an institutional, national, repository or discipline-specific community database.
Data repositories are dedicated platforms for safekeeping research data for future reference and reuse. Depositing research data in repositories helps ensure data are Findable, Accessible, Interoperable, and Reusable (FAIR) (Wilkinson, 2016). While the FAIR principles are aspirational and no repository guarantees full compliance, they do ensure data are at least findable and, to a minimum extent, accessible. Sharing data through repositories or data papers is a more principled and FAIR way of sharing, as it
- makes your data findable, often through a searchable catalogue,
- informs you on the means to access the data by, e.g. providing a download link or pointing to a contact for requesting access,
- provides a citation entry for your data, allowing you to take credit for data as well as your research findings.
We describe repositories and their characteristics in detail on our dedicated page here.
The rationale for data preservation
Institutions and funders require researchers to share data for several reasons:
- Reproducibility: Ensures the integrity of research conducted under their roof.
- Public Accountability: Most research is publicly funded and should be accessible to the public.
- Leveraging Investments: Data may have unexpected future uses, creating future value.
- Advancing Research: Sharing data can accelerate research and innovation, as seen in drug development at times of crisis, such as the COVID-19 pandemic.
For researchers, sharing data offers several benefits:
- Citation Advantage: Studies show data sharing can increase citations by up to 25% in some disciplines.
- Increased Visibility: Your data can reach a broader audience compared to your paper alone.
- Compliance: Funders, organisations, and publishers may require data sharing per policy.
- Cost Savings: Storing data in repositories can save on storage costs.
- Future Reference: Properly documenting and publishing data helps you and your team remember details and ensures the future accessibility of your research artefacts.
Models of Data Sharing
The standard models of data sharing are listed below; data repositories across the world typically support one or more of these models:
- Open-Access: Publicly accessible data.
- Registered-Access: Vetted researchers with institutional or research project affiliations are granted access.
- Controlled-Access: A data access committee oversees access by reviewing an access request.
- Consortium-Access: Researchers can join a consortium to access data collected in long-running programs.
The unrecommended model of data sharing are:
- Peer-to-peer informal access: This is the widely known method of making the “data available upon request” statement in research manuscripts. Studies show that the actual availability of such data increases by 17% per year, but the reliability of email addresses provided in manuscripts to request the data decreases by 7% per year (Vines, 2014). Funders and publishers are also increasingly discouraging this method of sharing.
- Exclusive access: Data is practically not shared beyond the principal investigators that generate or collect the data first-hand.
Using group websites or publishing data as supplementary material is also not recommended due to poor adherence to FAIR principles.
Data Licences, Access Policies and Use Conditions
When researchers share data, it's crucial to communicate the terms and conditions to prospective users. There are two main ways to do this:
- Pre-defined Data Licences: Using licences like Creative Commons or Open Data Commons allows data creators to grant usage rights. Generalist repositories, often open access, use this method and require data depositors to associate licences with their submissions.
- Access Policies of Data Repositories: Common in discipline-specific repositories, this approach involves a repository-wide access policy that applies to all datasets within it.
In controlled-access data-sharing scenarios, conditions can also be applied to data when sharing. Data use conditions comprise an essential part of administrative metadata, where such conditions can be described. These outline the boundary conditions under which data use can occur:
- Who can use the data
- What purposes it can be used for
- Where it can be used (e.g., GDPR restrictions in Europe)
- When it can be used (e.g., embargo periods)
- How it can be used (e.g., reporting of incidental findings to data subjects)
Research communities, like those in human genomics, have developed standard codes to represent use conditions (Lawson, 2024). These codes are included in dataset descriptions and can be referenced when searching for datasets in repositories. Examples of such use codes are:
- General research uses only
- Disease-specific research only
- Non-profit use only
- Geographical restrictions
- Return (results) to the database
- Collaboration required
- Ethical approval required
When sharing restricted-access data it is essential that such conditions are specified either on the dataset description and/or the data access policy or associated license.