Data Codebook

CRITICAL Dataset

The CRITICAL dataset is the first cross-Clinical and Translational Science Award (CTSA) initiative to create a multi-site, multi-modal, de-identified clinical dataset. It combines deep-data depth with broad-data width, addressing a major unmet need in healthcare research. The dataset encompasses comprehensive longitudinal inpatient and outpatient data, including pre-, during- and post-ICU admissions, for approximately 400,000 distinct critical-care patients. This diverse dataset supports the exploration of urgent clinical problems and facilitates the development of fair and generalizable AI tools for advanced patient monitoring and decision support.

The dataset has been curated to serve the research community, fostering innovations in AI/machine learning (ML), outcomes research, and other translational science domains. Its unique combination of size, diversity, and comprehensiveness makes it a valuable resource for tackling long-standing clinical challenges.

The CRITICAL dataset is designed for:

AI/ML research to develop and validate predictive models and decision support systems.
Outcomes-related research for better understanding of critical-care patient trajectories.
Exploration of clinical translation in broader research communities.

Key Features:

Largest Benchmark Dataset: Contains data from approximately 400,000 critical-care patients, making it the largest publicly shared, disease-independent clinical dataset of its kind.
Diversity: Includes extensive racial, ethnic, and geographic profiles to support fair and generalizable AI development.
Comprehensive Coverage: Features longitudinal data capturing in-patient and out-patient records before, during, and after ICU admissions.

Users of the dataset are bound by the signed Data Use Agreement (DUA), which strictly prohibits any attempt to reidentify patients, among other requirements.

Structure

CRITICAL is released as a collection of CSV files. It structures its data according to the Observational Medical Outcomes Partnership (OMOP) Common Data Model v5.3. Due to data availability at participating sites, CRITICAL includes a subset of the OMOP v5.3 tables.

Version 1.0 of the CRITICAL Dataset contains the following 17 OMOP tables, stored as CSVs. Links provided below are to the original OMOP documentation for each table, with any CRITICAL-specific conventions or differences noted.

The .zip file containing all of the tables is approximately 40GB. Unzipped, the files are approximately 300GB in total.

Adjustments

Tables have had the following adjustments/transformations for all records:

Date shifting: Date shifting is applied to all date and date/time fields.
Suppress source fields: All *_source_value fields have been set to NULL

Tables

PERSON
OBSERVATION_PERIOD
VISIT_OCCURRENCE
VISIT_DETAIL
CONDITION_OCCURRENCE
DRUG_EXPOSURE
- Columns stop_reason, sig, and lot_number have been set to NULL for all records
PROCEDURE_OCCURRENCE
DEVICE_EXPOSURE
MEASUREMENT
OBSERVATION
- Column value_as_string has been set to NULL for all records.
DEATH
SPECIMEN
LOCATION
- Only location_id is provided, all other columns are set to NULL.
CARE_SITE
PROVIDER
- Columns provider_name, npi, dea, and year_of_birth have been set to NULL for all records.
DRUG_ERA
CONDITION_ERA

How to Cite/Acknowledge

If you use the CRITICAL dataset in your research, please acknowledge it as follows:
CRITICAL Dataset. Sponsored by the NIH National Center for Advancing Translational Sciences (NCATS) through grant number U01TR003528.

Additionally, please include the following DOI citation in your publications: 10.5281/zenodo.14532192

License/Terms of Use

The CRITICAL dataset is provided under the terms outlined in the signed DUA. Access to the CRITICAL dataset requires institutions to sign a DUA and designate administrators (DRI Admins) to oversee user access. Individual users from approved institutions submit applications through the CRITICAL website, including their role, affiliation, and intended use of the data. After institutional administrators review and approve the application, users receive instructions to download the data for a 30-day period. For details, refer to the DUA.

Key provisions include but are not limited to:

Prohibition of Reidentification: Users must not attempt to reidentify patients in the dataset under any circumstances.
Responsible Use: Users must adhere to ethical research practices and comply with applicable regulations and institutional review board (IRB) requirements.

HOME

DATA ACCESS

DATA CODEBOOK

FAQ