Selecting a Data Repository

Learn how to evaluate and select appropriate data repositories.

Overview

As outlined in NIH's Supplemental Policy Information: Selecting a Repository for Data Resulting from NIH-Supported Research, using a quality data repository generally improves the FAIRness (Findable, Accessible, Interoperable, and Re-usable) of the data. For that reason, NIH strongly encourages the use of established repositories to the extent possible for preserving and sharing scientific data.

While NIH supports many data repositories, there are also many biomedical data repositories and generalist repositories supported by other organizations, both public and private. Researchers may wish to consult experts in their own institutions (e.g., librarians, data managers) for assistance in selecting an appropriate data repository.

NIH encourages researchers to select data repositories that exemplify the desired characteristics below, including when a data repository is supported or provided by a cloud-computing or high-performance computing platform. These desired characteristics aim to ensure that data are managed and shared in ways that are consistent with FAIR data principles.

Selecting a Data Repository

  • For some programs and types of data, NIH and/or Institute, Center, Office (ICO) policy(ies) and funding opportunities identify particular data repositories (or sets of repositories) to be used to preserve and share data.
    • For data generated from research subject to such policies or funded under such opportunities, researchers should use the designated data repository(ies).
  • For data generated from research for which no data repository is specified by NIH, researchers are encouraged to select a data repository that is appropriate for the data generated from the research project. Be sure to consult the list of desirable characteristics and the following guidance:
    • Primary consideration should be given to data repositories that are discipline or data-type specific to support effective data discovery and reuse. For a list of NIH-supported repositories, visit Repositories for Sharing Scientific Data.
    • If no appropriate discipline or data-type specific repository is available, researchers should consider a variety of other potentially suitable data sharing options:
      • Small datasets (up to 2 GB in size) may be included as supplementary material to accompany articles submitted to PubMed Central (instructions).
      • Data repositories, including generalist repositories or institutional repositories, that make data available to the larger research community, institutions, or the broader public.
      • Large datasets may benefit from cloud-based data repositories for data access, preservation, and sharing.

See Repositories for Sharing Scientific Data for a listing of NIH-supported data repositories.

Desirable Characteristics for All Data Repositories

When choosing a repository to manage and share data resulting from Federally funded research, here are some desirable characteristics to look for:

  • Unique Persistent Identifiers: Assigns datasets a citable, unique persistent identifier, such as a digital object identifier (DOI) or accession number, to support data discovery, reporting, and research assessment. The identifier points to a persistent landing page that remains accessible even if the dataset is de-accessioned or no longer available.
  • Long-Term Sustainability: Has a plan for long-term management of data, including maintaining integrity, authenticity, and availability of datasets; building on a stable technical infrastructure and funding plans; and having contingency plans to ensure data are available and maintained during and after unforeseen events.
  • Metadata: Ensures datasets are accompanied by metadata to enable discovery, reuse, and citation of datasets, using schema that are appropriate to, and ideally widely used across, the community(ies) the repository serves. Domain-specific repositories would generally have more detailed metadata than generalist repositories.
  • Curation and Quality Assurance: Provides, or has a mechanism for others to provide, expert curation and quality assurance to improve the accuracy and integrity of datasets and metadata.
  • Free and Easy Access: Provides broad, equitable, and maximally open access to datasets and their metadata free of charge in a timely manner after submission, consistent with legal and ethical limits required to maintain privacy and confidentiality, Tribal sovereignty, and protection of other sensitive data.
  • Broad and Measured Reuse: Makes datasets and their metadata available with broadest possible terms of reuse; and provides the ability to measure attribution, citation, and reuse of data (i.e., through assignment of adequate metadata and unique PIDs).
  • Clear Use Guidance: Provides accompanying documentation describing terms of dataset access and use (e.g., particular licenses, need for approval by a data use committee).
  • Security and Integrity: Has documented measures in place to meet generally accepted criteria for preventing unauthorized access to, modification of, or release of data, with levels of security that are appropriate to the sensitivity of data.
  • Confidentiality: Has documented capabilities for ensuring that administrative, technical, and physical safeguards are employed to comply with applicable confidentiality, risk management, and continuous monitoring requirements for sensitive data.
  • Common Format: Allows datasets and metadata downloaded, accessed, or exported from the repository to be in widely used, preferably non-proprietary, formats consistent with those used in the community(ies) the repository serves.
  • Provenance: Has mechanisms in place to record the origin, chain of custody, and any modifications to submitted datasets and metadata.
  • Retention Policy: Provides documentation on policies for data retention within the repository.

Additional Considerations for Human Data

When working with human participant data, including de-identified human data, here are some additional characteristics to look for:

  • Fidelity to Consent: Uses documented procedures to restrict dataset access and use to those that are consistent with participant consent and changes in consent.
  • Restricted Use Compliant: Uses documented procedures to communicate and enforce data use restrictions, such as preventing reidentification or redistribution to unauthorized users.
  • Privacy: Implements and provides documentation of measures (for example, tiered access, credentialing of data users, security safeguards against potential breaches) to protect human subjects’ data from inappropriate access.
  • Plan for Breach: Has security measures that include a response plan for detected data breaches.
  • Download Control: Controls and audits access to and download of datasets (if download is permitted).
  • Violations: Has procedures for addressing violations of terms-of-use by users and data mismanagement by the repository.
  • Request Review: Makes use of an established and transparent process for reviewing data access requests.

Repositories for Scientific Data

See Repositories for Sharing Scientific Data for a listing of NIH-affiliated data repositories.

/faqs#/data-management-and-sharing-policy.htm