NIH hosts many genomic and phenotypic data repositories. Learn about some of these repositories and the types of available datasets.
Accessing Genomic Data from NIH Repositories
NIH maintains a number of human and non-human genomic data repositories at the National Center for Biotechnology Information (NCBI). In addition, some NIH Institutes or Centers (ICs) maintain repositories aligned with their area of interest. Some repositories store both genomic and non-genomic (for example, imaging) data.
The process for requesting a dataset depends on how a repository manages access to their stored data:
- Open or Unrestricted Access: Some repositories store open-access or unrestricted access genomic data and consequently no special credentials are required for downloading data. These datasets are available for the public to access. Individuals downloading these data are expected to use the datasets responsibly.
- Registered Access: Some repositories allow access to their data only if users are registered with the repository. In addition, the repository might monitor the usage.
- Controlled Access: Some repositories, such as Database of Genotypes and Phenotypes (dbGaP), require credentialed users to apply for access to data. Learn how to request a dataset from dbGaP.
- Mixed: Some repositories contain both open- and controlled-access datasets.
Trans-NIH Genomic Data Repositories
NCBI hosts repositories that contain genomic data from humans as well as many other organisms. The table below lists several frequently used repositories along with the type of data hosted at the repository, how the repository manages access, and a link to the repository’s access portal.
|NIH Repository||Repository Description||Access Level/Type||Access Portal|
|Database of Genotypes and Phenotypes (dbGaP)||An archive and distribution center for the description and results of studies which investigate the interaction of genotype and phenotype. These studies include genome-wide association (GWAS), medical resequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits.||Controlled: Summary level data is open. Credentialed user must apply for access to individual level data.||dbGaP Authorized Access System|
|Database of Short Genetic Variations (dbSNP)||dbSNP contains human single nucleotide variations, microsatellites, and small-scale insertions and deletions along with publication, population frequency, molecular consequence, and genomic and RefSeq mapping information for both common variations and clinical mutations.||Open||dbSNP home page|
|Database of Genomic Structural Variation (dbVar)||dbVar is NCBI’s database of human genomic Structural Variation — large variants >50 bp including insertions, deletions, duplications, inversions, mobile elements, translocations, and complex variants.||Open||dbVar home page|
|GenBank||GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.||Open||GenBank access portal|
|Gene Expression Omnibus (GEO)||The Gene Expression Omnibus (GEO) is a public repository that archives and freely distributes comprehensive sets of microarray, next-generation sequencing, and other forms of high-throughput functional genomic data submitted by the scientific community.||Open||Gene Expression Omnibus (GEO) access portal|
|Sequence Read Archive (SRA)||The Sequence Read Archive (SRA) is NIH’s primary archive of high-throughput sequencing data.||Open||Sequence Read Archive (SRA) download portal|
NIH Institute and Center Supported Repositories
Some individual NIH Institutes and Centers (ICs) support repositories that contain human genomic as well as other types of data that are relevant to their specific area of interest.
The table below is a non-exhaustive list of repositories currently supported by individual institutes, centers, or offices. The table also lists who supports the repository, what type of data is hosted at the repository, how the repository manages access, and a link to the repository’s access portal.
|IC Repository||Institute, Center,
|Repository Description||Access||Access Portal|
|AccessClinicalData@NIAID||National Institute of Allergy and Infectious Diseases||AccessClinicalData@NIAID is an NIAID cloud-based, secure data platform that enables sharing of and access to anonymized individual, patient level clinical data sets from NIAID sponsored clinical trials to harness the power of data to generate new knowledge to understand, treat, and prevent infectious diseases such as COVID-19.||Controlled: Summary level data is open. Researchers must apply for access to individual level data.||Accessing NIAID Clinical Trials Data|
|All Of Us||NIH Office of the Director||The All of Us Research Program is part of an effort to advance individualized health care by enrolling one million or more participants to contribute their health data over many years.||Controlled: Summary level data is open. Researchers must apply for access to individual level data.||All Of Us Research Hub|
|AnVIL||National Human Genome Research Institute||The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-Space, or AnVIL, provides a cloud environment for the analysis of large genomic and related datasets.||Mixed||AnVIL Data Portal|
|BioData Catalyst||National Heart, Lung, and Blood Institute||NHLBI BioData Catalyst is a cloud-based platform providing tools, applications, and workflows in secure workspaces.||Mixed||Accessing BioData Catalyst Data|
|Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) Data and Specimen Hub (DASH)||Eunice Kennedy Shriver National Institute of Child Health and Human Development||The NICHD Data and Specimen Hub (DASH) is a centralized resource that allows researchers to share and access de-identified data from studies funded by NICHD. DASH also serves as a portal for requesting biospecimens from selected DASH studies.||Mixed||DASH Data Request Tutorial|
|FaceBase||National Institute of Dental and Craniofacial Research||FaceBase is a collaborative NIDCR-funded project that houses comprehensive data in support of advancing research into craniofacial development and malformation.||Mixed||FaceBase: Request Access to Controlled Data|
|Kids First||NIH Office of Strategic Coordination - The Common Fund||The Gabriella Miller Kids First Data Resource Center (Kids First DRC) is a new, collaborative, pediatric research effort with the goal of understanding the genetic causes and links between childhood cancer and structural birth defects.||Mixed||Kids First Data Resource Center: Getting Started|
|NCI Cloud Resources: Broad Institute FireCloud||National Cancer Institute||FireCloud is an open, standards-based platform for performing production-scale data analysis in the cloud. Built on the Google Cloud Platform, FireCloud empowers analysts, tool developers, and production managers to run large-scale analysis and to share results with collaborators.||Mixed||FireCloud “getting started” guide|
|NCI Cloud Resources: Institute for Systems Biology ISB Cloud||National Cancer Institute||The ISB Cancer Genomics Cloud, leveraging many aspects of the Google Cloud Platform, allows scientists to interactively define and compare cohorts, examine underlying molecular data for specific genes and pathways, and share insights with collaborators.||Mixed||ISB Cancer Genomics Cloud guide|
|NCI Cloud Resources: Seven Bridges Cancer Genomics Cloud||National Cancer Institute||The Seven Bridges Cancer Genomics Cloud, hosted on Amazon, has a rich user interface that allows researchers to find data of interest and combine it with their own private data. Data can be analyzed using more than 200 preinstalled, curated bioinformatics tools and workflows.||Mixed||Seven Bridges Cancer Genomics Cloud Access Guide|
|National Institute on Aging (NIA) Genetics of Alzheimer's Disease Data Storage Site (NIAGADS)||National Institute on Aging||NIAGADS is the National Institute on Aging Genetics of Alzheimer's Disease Data Storage Site. NIAGADS is a national genetics repository created by NIA to facilitate access by qualified investigators to genotypic data for the study of genetics of late-onset Alzheimer's disease.||Controlled: Summary level data is open. Credentialed user must apply for access to individual level data.||NIAGADS access request portal|
|National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) Central Repository||National Institute of Diabetes and Digestive and Kidney Diseases||The NIDDK Central Repository enables scientists to test new hypotheses without the need to collect any new data or biospecimens, and provides the opportunity to pool data across several studies to increase the power of statistical analyses. In addition, most NIDDK-funded studies are collecting genetic biospecimens and carrying out high-throughput genotyping making it possible for other scientists to use Central Repository resources to match genotypes to phenotypes and to perform informative genetic analyses.||Controlled: Summary level data is open. Credentialed user must apply for access to individual level data.||NIDDK Central Repository data request instructions|
|National Institute of Mental Health Data Archive (NDA)||National Institute of Mental Health||The National Institute of Mental Health Data Archive (NDA) makes available human subjects data collected from hundreds of research projects across many scientific domains. NDA provides infrastructure for sharing research data, tools, methods, and analyses enabling collaborative science and discovery. De-identified human subjects data, harmonized to a common standard, are available to qualified researchers. Summary data are available to all.||Mixed||NDA access portal|