Data Management

Proper data management is crucial for maintaining scientific rigor and research integrity. Learn about best practices for scientific data management.

Data Management 

Data management is the process of validating, organizing, protecting, maintaining, and processing scientific data to ensure the accessibility, reliability, and quality of the data for its users.

Proper data management helps maintain scientific rigor and research integrity. Keeping good track of data and associated documentation lets researchers and collaborators use data consistently and accurately. Carefully storing and documenting data also allows more people to use the data in the future, potentially leading to more discoveries beyond the initial research.

NIH emphasizes the importance of good data management practices and encourages data management to be reflective of practices within specific research communities. 

Refer to Writing a Data Management & Sharing Plan for what aspects to address in a DMS plan.

FAIR Principles

NIH encourages data management and sharing practices to be consistent with the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles.

These principles make it easier for computers to process and analyze datasets, which is important when reusing or repurposing datasets for secondary research.

To learn more, visit the GO FAIR initiative or read The NIH Strategic Plan for Data Science.

Length of Time to Maintain Data

Per Section 8.4.2 of the NIH Grants Policy Statement, grantee institutions are required to keep the data for 3 years following closeout of a grant or contract agreement. Contracts may specify different time periods. Please note that the grantee institution may have additional policies and procedures regarding the custody, distribution, and required retention period for data produced under research awards. 

Metadata and Other Associated Documentation

Metadata and other documentation associated with a dataset allow users to understand how the data were collected and how to interpret the data. Importantly, this ensures that others can use the dataset and prevents misuse, misinterpretation, and confusion.

The exact metadata or other associated documentation will vary by scientific area, study design, the type of data collected, and characteristics of the dataset.

Here are examples of metadata or other information that may be included with research data:

  • Methodology and procedures used to collect the data
  • Data labels
  • Definitions of variables
  • Any other information necessary to reproduce and understand the data

Naming Conventions

Within a project team, agreement on naming conventions for multiple objects or files—or multiple versions of files—could be useful before embarking on a project that generates large amounts of data that need names or unique identifiers. 

Common Data Elements

Common data elements (CDEs) are pieces of data common to multiple datasets across different studies. NIH encourages researchers to use CDEs, which helps improve accuracy, consistency, and interoperability among datasets within various areas of health and disease research. For more information, check out this National Library of medicine article about CDEs. NIH also maintains a repository of NIH CDEs.

Data Storage Format

There are many storage formats for different types and sizes of datasets. For instance, small and simple datasets can be managed in a spreadsheet program. More complicated or larger datasets may need to be managed in a database. Remember that some types of data storage incur costs, which may be part of the project budget. See Budgeting for Data Management & Sharing for details. 

Data Security

Maintaining multiple copies of data can help protect against unforeseen events. Similarly, version control can help maintain the integrity of data. For those storing data in a repository, see Selecting a Data Repository for guidance on selecting an appropriate repository.