Data Dictionary
Overview of the CMS DE-SynPUF dataset used by this pipeline, including table schemas, column definitions, and guidance for using other samples.
Table of Contents
- About the Dataset
- New System (Under Test)
- Beneficiary Summary Table
- Carrier Claims Table
- Schema Comparison: Old vs New
- Derived / Internal Tables
- Using Other CMS Samples
- Codebook Reference
About the Dataset
CMS DE-SynPUF (Data Entrepreneurs' Synthetic Public Use File) is a synthetic Medicare claims dataset published by the Centers for Medicare & Medicaid Services (CMS). It was created to enable software development and research without exposing real patient data.
| Property | Value |
|---|---|
| Publisher | CMS (Centers for Medicare & Medicaid Services) |
| Coverage | Calendar years 2008, 2009, 2010 |
| Population | Synthetic Medicare beneficiaries (fee-for-service) |
| Samples | 20 independent samples (Sample 1 through Sample 20), each ~115K beneficiaries |
| Download | cms.gov DE-SynPUF page |
| Codebook | DE-SynPUF Codebook (PDF) |
How the data was synthesized: CMS took real Medicare claims from 2008–2010, applied statistical disclosure limitation techniques (noise injection, field swapping, synthetic generation), and produced files that preserve the statistical properties of real data (distributions, correlations, claim patterns) without containing any actual patient records.
What this pipeline uses: Sample 1 only — 5 CSV files:
| File | Table | Records | Description |
|---|---|---|---|
DE1_0_2008_Beneficiary_Summary_File_Sample_1.csv |
beneficiary_summary |
~115K | Year 2008 beneficiary demographics + financials |
DE1_0_2009_Beneficiary_Summary_File_Sample_1.csv |
beneficiary_summary |
~115K | Year 2009 |
DE1_0_2010_Beneficiary_Summary_File_Sample_1.csv |
beneficiary_summary |
~113K | Year 2010 |
DE1_0_2008_to_2010_Carrier_Claims_Sample_1A.csv |
carrier_claims |
~2.37M | Carrier claims batch A |
DE1_0_2008_to_2010_Carrier_Claims_Sample_1B.csv |
carrier_claims |
~2.37M | Carrier claims batch B |
New System (Under Test)
The new system is the replacement claims processing pipeline whose outputs are validated against the old system (ground truth). The new system data is provided as a password-protected zip file as part of the assessment.
Source: Assessment-provided zip file (password: SlJxqMl9)
File Inventory
| File | DuckDB Table | Records | Size |
|---|---|---|---|
DE1_0_2008_Beneficiary_Summary_File_Sample_1_NEWSYSTEM.csv |
new_beneficiary_summary |
~116K | 14.6 MB |
DE1_0_2009_Beneficiary_Summary_File_Sample_1_NEWSYSTEM.csv |
new_beneficiary_summary |
~115K | 14.5 MB |
DE1_0_2010_Beneficiary_Summary_File_Sample_1_NEWSYSTEM.csv |
new_beneficiary_summary |
~113K | 14.1 MB |
DE1_0_2008_to_2010_Carrier_Claims_Sample_1A_NEWSYSTEM.csv |
new_carrier_claims |
~2.37M | 1.24 GB |
DE1_0_2008_to_2010_Carrier_Claims_Sample_1B_NEWSYSTEM.csv |
new_carrier_claims |
~2.37M | 1.24 GB |
Key Observations
The new system intentionally contains issues. At a high level:
- Same file structure as the old system — 3 beneficiary summary files (one per year) + 2 carrier claims files
- Same column names — all 33 beneficiary columns and 142 carrier claims columns are present
- Row count differences — the new system produces 4,777 more carrier claims than the old system (phantom claims that exist only in the new system)
- One type difference —
CLM_IDisBIGINTin the old system andVARCHARin the new system (handled automatically by DuckDB during matching via cast) - Intentional data discrepancies — detailed in the Schema Comparison section below
The pipeline auto-detects these files by reading CSV headers — the _NEWSYSTEM filename suffix is not required.
Beneficiary Summary Table
DuckDB table name: beneficiary_summary (old system), new_beneficiary_summary (new system)
Grain: One row per beneficiary per calendar year.
Primary key: (DESYNPUF_ID, summary_year)
| Column | Type | Description | Example |
|---|---|---|---|
DESYNPUF_ID |
VARCHAR | Unique beneficiary identifier (16-char hex) | 00013D2EFD8E45D1 |
summary_year |
INTEGER | Calendar year (derived from filename during ingestion) | 2008 |
BENE_BIRTH_DT |
INTEGER | Date of birth (YYYYMMDD format) | 19230501 |
BENE_DEATH_DT |
INTEGER | Date of death, 0 if alive (YYYYMMDD) | 0 or 20090715 |
BENE_SEX_IDENT_CD |
INTEGER | Sex: 1 = Male, 2 = Female | 1 |
BENE_RACE_CD |
INTEGER | Race: 1=White, 2=Black, 3=Other, 5=Hispanic | 1 |
SP_STATE_CODE |
INTEGER | State SSA code (1–53) | 26 |
BENE_COUNTY_CD |
INTEGER | County SSA code | 999 |
BENE_HI_CVRAGE_TOT_MONS |
INTEGER | Months of Part A (Hospital Insurance) coverage | 12 |
BENE_SMI_CVRAGE_TOT_MONS |
INTEGER | Months of Part B (Supplementary Medical Insurance) coverage | 12 |
BENE_HMO_CVRAGE_TOT_MONS |
INTEGER | Months of HMO coverage | 0 |
PLAN_CVRG_MOS_NUM |
INTEGER | Months of Part D (prescription drug) coverage | 12 |
Chronic Condition Flags
Each flag is 1 (condition present) or 2 (absent/unknown). The pipeline checks value 1 for prevalence.
| Column | Condition |
|---|---|
SP_ALZHDMTA |
Alzheimer's Disease / Related Dementia |
SP_CHF |
Heart Failure |
SP_CHRNKIDN |
Chronic Kidney Disease |
SP_CNCR |
Cancer (Breast, Colorectal, Prostate, Lung, Endometrial) |
SP_COPD |
Chronic Obstructive Pulmonary Disease |
SP_DEPRESSN |
Depression |
SP_DIABETES |
Diabetes |
SP_ISCHMCHT |
Ischemic Heart Disease |
SP_OSTEOPRS |
Osteoporosis |
SP_RA_OA |
Rheumatoid Arthritis / Osteoarthritis |
SP_STRKETIA |
Stroke / Transient Ischemic Attack |
Financial Summary Columns
Annual reimbursement totals per beneficiary, broken down by service type. These should reconcile with the sum of individual claim line items (validated by the financial reconciliation checks).
| Column | Description |
|---|---|
MEDREIMB_IP |
Medicare reimbursement — Inpatient |
BENRES_IP |
Beneficiary responsibility — Inpatient |
PPPYMT_IP |
Primary payer payment — Inpatient |
MEDREIMB_OP |
Medicare reimbursement — Outpatient |
BENRES_OP |
Beneficiary responsibility — Outpatient |
PPPYMT_OP |
Primary payer payment — Outpatient |
MEDREIMB_CAR |
Medicare reimbursement — Carrier |
BENRES_CAR |
Beneficiary responsibility — Carrier |
PPPYMT_CAR |
Primary payer payment — Carrier |
Financial reconciliation formula (per codebook):
MEDREIMB_CAR = SUM(LINE_NCH_PMT_AMT) for lines where:
LINE_PRCSG_IND_CD = 'A'
OR (LINE_PRCSG_IND_CD IN ('R','S') AND LINE_ALOWD_CHRG_AMT > 0)
BENRES_CAR = SUM(LINE_BENE_PTB_DDCTBL_AMT + LINE_COINSRNC_AMT) [same filter]
PPPYMT_CAR = SUM(LINE_BENE_PRMRY_PYR_PD_AMT) [same filter]
Carrier Claims Table
DuckDB table name: carrier_claims (old system), new_carrier_claims (new system)
Grain: One row per carrier claim. Each claim can have up to 13 line items (denormalized — line item columns are suffixed _1 through _13).
Primary key: CLM_ID
Claim Header Columns
| Column | Type | Description | Example |
|---|---|---|---|
DESYNPUF_ID |
VARCHAR | Beneficiary ID (FK to beneficiary_summary) | 00013D2EFD8E45D1 |
CLM_ID |
BIGINT/VARCHAR | Unique claim identifier | 196661176960050 |
CLM_FROM_DT |
INTEGER | Claim start date (YYYYMMDD) | 20090812 |
CLM_THRU_DT |
INTEGER | Claim end date (YYYYMMDD) | 20090812 |
ICD9_DGNS_CD_1 through _8 |
VARCHAR | ICD-9 diagnosis codes (up to 8) | 4019 |
PRF_PHYSN_NPI_1, _2 |
VARCHAR | Performing physician NPI | 0000000000 |
TAX_NUM_1 through _13 |
VARCHAR | Provider tax number per line |
Claim Line Columns (×13)
Each of the following columns is repeated 13 times with suffix _1 through _13:
| Column Pattern | Type | Description |
|---|---|---|
LINE_NCH_PMT_AMT_{n} |
DOUBLE | Medicare payment amount for this line |
LINE_BENE_PTB_DDCTBL_AMT_{n} |
DOUBLE | Beneficiary Part B deductible |
LINE_COINSRNC_AMT_{n} |
DOUBLE | Beneficiary coinsurance amount |
LINE_ALOWD_CHRG_AMT_{n} |
DOUBLE | Allowed charge amount |
LINE_BENE_PRMRY_PYR_PD_AMT_{n} |
DOUBLE | Primary payer payment |
LINE_PRCSG_IND_CD_{n} |
VARCHAR | Processing indicator: A=Allowed, R=Rejected, S=Secondary |
LINE_ICD9_DGNS_CD_{n} |
VARCHAR | Line-level ICD-9 diagnosis code |
HCPCS_CD_{n} |
VARCHAR | HCPCS procedure code |
LINE_PLACE_OF_SRVC_CD_{n} |
VARCHAR | Place of service code |
Why 142 columns: 8 diagnosis codes + 2 NPIs + 13 tax numbers + (13 lines × 9 columns per line) = 142 total. This is a denormalized layout — a normalized design would have a separate claim_lines table.
Schema Comparison: Old vs New
The pipeline's Step 2 (Schema Validate) and Step 5 (Compare) verify schema compatibility. Here is the full comparison:
Beneficiary Summary
| Property | Old System | New System | Match? |
|---|---|---|---|
| Column count | 33 | 33 | ✅ Identical |
| Column names | All 33 present | All 33 present | ✅ Identical |
| Column types | All match | All match | ✅ Identical |
| Row count | 343,644 | 343,644 | ✅ Identical |
| Unique beneficiaries | ~115K | ~115K | ✅ (3 years × ~115K) |
Carrier Claims
| Property | Old System | New System | Match? |
|---|---|---|---|
| Column count | 142 | 142 | ✅ Identical |
| Column names | All 142 present | All 142 present | ✅ Identical |
| CLM_ID type | BIGINT |
VARCHAR |
⚠️ Type mismatch — cast during match |
| Row count | 4,741,335 | 4,746,112 | ❌ +4,777 new-only (phantom claims) |
Discrepancies Found by the Pipeline
| Category | Count | Description |
|---|---|---|
| Beneficiary field mismatches | 446 records | At least one field differs between old and new |
| BENE_BIRTH_DT mismatches | 178 | Largest single field — systematic date parsing bug |
| Financial column mismatches | 47–46 per column | MEDREIMB_OP, BENRES_CAR, PPPYMT_CAR, etc. |
| Claims payment mismatches | 10,411 on LINE_NCH_PMT_AMT_1 | Primary payment amount — systematic calculation error |
| Phantom claims | 4,777 | Exist in new system only, 0 lost from old system |
| Total financial divergence | $35,624.71 | All positive (new system overstates) |
For the full analysis of what these discrepancies mean, see the Key Findings narrative in the HTML report and the Analysis Findings section in SOLUTION.md.
Derived / Internal Tables
These tables are created by the pipeline during execution and are available for querying in the DuckDB database.
| Table | Created By | Grain | Description |
|---|---|---|---|
_match_beneficiary |
Step 4 | One row per unique beneficiary-year key | FULL OUTER JOIN result: match_status = matched / old_only / new_only |
_match_claims |
Step 4 | One row per unique CLM_ID | FULL OUTER JOIN result for claims |
_discrepancy_detail |
Step 5 | One row per matched beneficiary-year | Field-by-field diff flags (diff_*), dollar deltas (delta_*), and total_diffs count |
_financial_recon |
Validation | One row per beneficiary-year with claims | Reported vs. computed financial amounts from claim line aggregation |
_discrepancy_detail Columns
| Column | Type | Description |
|---|---|---|
DESYNPUF_ID |
VARCHAR | Beneficiary ID |
summary_year |
INTEGER | Calendar year |
SP_STATE_CODE |
INTEGER | State code (for geographic analysis) |
diff_{field} |
INTEGER (0/1) | 1 if old value ≠ new value for this field |
delta_{financial_field} |
DOUBLE | Dollar difference (new − old) for financial columns |
total_diffs |
INTEGER | Count of fields with mismatches for this record |
Using Other CMS Samples
The pipeline is designed to work with any DE-SynPUF sample (1–20). To use a different sample:
-
Download the desired sample files from the CMS DE-SynPUF page.
-
Place the ZIP archives in
data/original_downloads/:
bash mv *.zip data/original_downloads/
The pipeline auto-extracts ZIPs intodata/old_system/during Step 1. -
Run the pipeline — file detection is automatic (matches glob patterns
*Beneficiary*and*Carrier*):
bash python -m src.main -
For new system comparison, place the comparison CSVs in a separate directory:
bash python -m src.main --new-data path/to/new/csvs/
Compatibility notes:
- All 20 samples share the same schema (column names and types are identical).
- Sample sizes are roughly equal (~115K beneficiaries, ~4.7M claims per sample).
- Samples are independent — different synthetic beneficiaries, no overlap.
- The pipeline auto-detects file types from CSV headers, so filenames can vary.
Extending to Non-CMS Data
The pipeline can be adapted for other paired-CSV comparison use cases. The key requirements:
- Both systems produce CSVs with overlapping schemas.
- There is a joinable primary key across systems.
- Configuration changes needed: TABLE_PAIRS in src/compare.py (column lists) and KEY_CONFIG in src/pipeline/step4_match.py (join keys).
Codebook Reference
The CMS DE-SynPUF Codebook provides the authoritative definitions for all columns, value codes, and business rules. Key sections used by this pipeline:
| Codebook Section | Pipeline Usage |
|---|---|
| Table 1: Beneficiary Summary | Column definitions, chronic condition flag values (1 = yes, 2 = no) |
| Table 4: Carrier Claims | Claim structure, line item layout, processing indicator codes |
| Financial Reconciliation Rules | How MEDREIMB_CAR is computed from LINE_NCH_PMT_AMT with LINE_PRCSG_IND_CD filter |
| SSA State Codes | Mapping of SP_STATE_CODE to U.S. states (used in geographic analysis) |
Download: DE-SynPUF Codebook (PDF)