Source: REQUIREMENTS_TRACEABILITY.md

Requirements Traceability

Maps every requirement from 02A. Data Engineer Take Home Assessment to how our solution fulfills it.


Purpose Statement Requirements

REQ-1: Build a data comparison and validation tool

"Build a small, reliable data comparison and validation tool that compares the outcomes of the two claims processing systems."

Status: FULFILLED

Component Where What It Does
6-step pipeline src/pipeline/step1_receive.pystep6_report.py Orchestrated by src/pipeline/runner.py with gate logic — halts early on bad input
Validation engine src/validate.py 19 internal consistency checks (identity, temporal, demographic, coverage, clinical, financial)
Comparison engine src/compare.py 110 cross-system checks (schema, row-level, field-level, aggregate)
Record matching src/pipeline/step4_match.py FULL OUTER JOIN on composite keys to align old/new records

REQ-2: Surface issues and inconsistencies

"The data from the new system intentionally contains issues and inconsistencies, and your goal is to surface them."

Status: FULFILLED

Issues surfaced:
- 446 beneficiary records with at least one field mismatch (out of 343,485 matched)
- 178 BENE_BIRTH_DT mismatches — largest single source of error
- 10,411 claims with mismatched LINE_NCH_PMT_AMT_1 (primary payment)
- 4,777 phantom claims — exist in new system only, zero lost from old
- $35,624.71 total financial divergence across 9 reimbursement columns
- All dollar deltas are positive — new system overstates, never understates

Evidence: reports/report_data.jsonchart_data.analysis_findings, chart_data.field_mismatches, chart_data.claims_payment_mismatches

REQ-3: Quantify their impact

"quantify their impact"

Status: FULFILLED

Metric Value Location in Report
Record-level accuracy 99.87% Discrepancy Dashboard KPI + Key Findings
Beneficiaries mismatched 446 / 343,485 Discrepancy Dashboard KPI
Claims payment mismatches 10,411 on LINE_NCH_PMT_AMT_1 Discrepancy Dashboard chart
Financial divergence $35,624.71 (0.003% of $1.24B total) Key Findings narrative + risk assessment
Phantom claims 4,777 new-only, 0 old-only Key Findings narrative
Field-level breakdown 521 total field diffs across 15+ columns Field Mismatches by Column chart

"identify any trends you see"

Status: FULFILLED

Trends identified:
- Year-over-year stability: Discrepancy rate is 0.12%–0.14% across 2008–2010, suggesting bugs are inherent to migration logic, not time-dependent
- One-directional financial bias: Every dollar delta is positive (net = abs = $35,624.71), indicating systematic calculation bias, not random noise
- Field concentration: BENE_BIRTH_DT accounts for 34.2% of all field mismatches — consistent across years → systematic date parsing bug
- Payment column concentration: LINE_NCH_PMT_AMT_1 has 5x more mismatches than the next column → targeted calculation bug
- Geographic uniformity: 50 of 52 state codes have mismatches, no geographic concentration → code-level not data-level issue
- Service type breakdown: OP ($14.3K) > IP ($12.4K) > CAR ($8.9K) financial divergence

Evidence: Discrepancy Dashboard charts (trend by year, field mismatches, geographic), Key Findings narrative, SOLUTION.md Analysis Findings section

REQ-5: Communicate what they mean for accuracy

"communicate what they mean for accuracy"

Status: FULFILLED

The report includes a Key Findings narrative section with:
- Overall accuracy assessment: "99.87% record-level accuracy"
- Root cause hypotheses: "systematic date parsing bug", "systematic calculation bias"
- Risk assessment: "$35,624.71 represents 0.003% of $1.24B total reimbursements — negligible in aggregate"
- Actionable recommendation: "not ready for production cutover" with two specific bugs to fix
- "What This Means for Accuracy" subsection with risk level classification

Evidence: HTML report → "Key Findings" + "What This Means for Accuracy" sections (JS-rendered from analysis_findings data), SOLUTION.md → Analysis Findings section

REQ-6: Use appropriate language and database

"Use the language and database you think are most appropriate to the task."

Status: FULFILLED

Choice Rationale (documented in SOLUTION.md)
Python 3.14 Current stable release (October 2025); full binary wheel support for all dependencies (DuckDB, pandas, plotly, jinja2)
DuckDB Embedded analytical database — zero-config, handles 2.4GB CSVs natively, columnar engine optimized for aggregations on wide tables

Alternatives considered and rejected: SQLite (row-oriented), PostgreSQL (requires server), Spark (overkill), Pandas-only (fragile at scale). All documented in docs/SOLUTION.md.

REQ-7: Provide a way to review results

"There should be a way for users to review the results, it could be an excel, charts, basic web view, tableau or whatever medium you find best communicates your findings."

Status: FULFILLED — multiple review interfaces provided

Interface Location Description
Interactive HTML report reports/comparison_report.html Self-contained single file with sidebar nav, Plotly charts, sortable tables, collapsible sections
JSON data artifact reports/report_data.json Machine-readable canonical data — can be consumed by any tool
CSV exports reports/exports/*.csv Raw diff tables (discrepancy_detail, financial_recon, match results) — openable in Excel
Web UI web/server.py FastAPI drag-and-drop upload interface for running the pipeline from a browser
React diff viewer viewer/ Interactive record-level diff viewer (demo data)
DuckDB database data/database/cms_claims.duckdb Query directly with DuckDB CLI for ad-hoc analysis

REQ-8: Be prepared to explain code and design decisions

"Be prepared to explain your code and design decisions."

Status: FULFILLED

Document Content
docs/SOLUTION.md Architecture decisions, design rationale for DuckDB/pipeline/Docker/report, analysis findings, deployment strategy, testing approach
docs/PIPELINE.md 469-line reference documenting all 6 pipeline steps, data model, 129 validation/comparison checks, output artifacts
docs/DATA_DICTIONARY.md Dataset overview, column definitions for all tables, codebook reference
docs/architecture.html Interactive Mermaid.js architecture diagrams
Inline code Docstrings, type hints, clear module separation

REQ-9: Depth of analysis, analysis strategy, supporting documentation

"your depth of analysis, analysis strategy, and supporting documentation providing evidence of your work will be key factors"

Status: FULFILLED

Dimension Evidence
Depth 6 named findings with root cause hypotheses, financial breakdown by service type, geographic distribution analysis, trend stability assessment, risk quantification ($35K on $1.24B)
Strategy Old system as ground truth → FULL OUTER JOIN matching → field-level diff flags → aggregate analysis → interpretive narrative. Documented in SOLUTION.md and PIPELINE.md
Documentation 5 docs: README.md, SOLUTION.md, PIPELINE.md, DATA_DICTIONARY.md, FEEDBACK.md. Architecture diagrams. 95 automated tests.

Comparison Tool Objectives

REQ-10: Ingest and clean the data

"Ingest and clean the data."

Status: FULFILLED

Step Source What It Does
Step 1: Receive step1_receive.py File discovery, zip extraction, SHA-256 checksums, file inventory
Step 2: Schema Validate step2_schema_validate.py Header validation, file classification (beneficiary vs carrier), schema compatibility checks
Step 3: Ingest & Profile step3_ingest.py + src/ingest.py CSV → DuckDB loading with type inference, null detection, summary_year derivation from filename, column profiling (nulls, distinct, min/max/mean)

Cleaning: DuckDB handles type coercion during ingest. The pipeline validates headers before loading (fail-fast on malformed input). Column profiling identifies data quality issues (high-null columns, unexpected values).

REQ-11: Build a simple ETL pipeline into a database

"Build a simple ETL pipeline into a database or lightweight datastore of your choice."

Status: FULFILLED

Tables created: beneficiary_summary, carrier_claims, new_beneficiary_summary, new_carrier_claims, _match_beneficiary, _match_claims, _discrepancy_detail, _financial_recon

REQ-12: Identify and quantify discrepancies

"Identify and quantify discrepancies, keep an eye out for identifiable trends."

Status: FULFILLED

See REQ-2 (surface), REQ-3 (quantify), and REQ-4 (trends) above.

REQ-13: Provide specific metrics

"Provide metrics that helps explain where and why the systems diverge. Examples: Total number of Beneficiaries whose data does not match, Total number of Claims whose payments don't match, Other discrepancy trends."

Status: FULFILLED

Requested Metric Our Value Where
Beneficiaries whose data does not match 446 out of 343,485 (0.13%) Discrepancy Dashboard KPI card + Key Findings
Claims whose payments don't match 10,411 on LINE_NCH_PMT_AMT_1; 657 on LINE_ALOWD_CHRG_AMT_1; 651 on LINE_COINSRNC_AMT_1; 635 on LINE_BENE_PTB_DDCTBL_AMT_1 Discrepancy Dashboard "Claims Payment Mismatches" chart
Other discrepancy trends Year-over-year stability, one-directional financial bias, field concentration (BENE_BIRTH_DT = 34.2%), geographic uniformity, phantom claims (4,777 new-only) Key Findings narrative, Discrepancy Dashboard charts, SOLUTION.md Analysis Findings

REQ-14: Generate a report

"Generate a report that communicates your findings."

Status: FULFILLED

The HTML report (reports/comparison_report.html) includes:
- Sidebar navigation with active section highlighting
- Data Context — what's being compared (file inventories, match results)
- Executive Summary — 8 KPI cards, failed checks highlighted
- Discrepancy Dashboard — 4 KPI cards, 4 interactive charts, Key Findings narrative with accuracy assessment
- Validation Results — 19-check table + bar chart
- Year-over-Year Trends — beneficiary count, claims count charts
- Financial Analysis — reimbursement trends, payment distribution, chronic condition prevalence
- System Comparison — 110-check collapsible table
- Data Profiles — collapsible column-level quality profiles for all tables


Submission Requirements

REQ-15: Source code

"Source code."

Status: FULFILLED

All source code in src/, web/, cloud/, infra/, viewer/, tests/. Git history preserved.

"Your comparison report or a link to your UI if applicable."

Status: FULFILLED

REQ-17: Screenshots

"Screenshots of your report, output or UI."

Status: FULFILLED

The screenshots/ directory contains 12 screenshots captured via scripts/capture_screenshots.py (Playwright):

File Content
screenshot_1.png Pipeline run — all 6 steps completing
screenshot_2.png Test suite — 95 passed
screenshot_3.png Executive Summary with KPIs
screenshot_4.png Key Findings narrative
screenshot_5.png Validation Results — Issues by Check bar chart
screenshot_6.png Financial Reconciliation (MEDREIMB/BENRES/PPPYMT rows)
screenshot_7.png System Comparison — 110 checks table
screenshot_8.png Year-over-Year Trends — Plotly charts
screenshot_9.png Documentation Hub landing page
screenshot_10.png Schema Explorer ERD
screenshot_11.png SQL Explorer with query results
screenshot_12.png Parquet Viewer with data preview

REQ-18: README.md

"A README.md that describes your project and how to run it."

Status: FULFILLED

README.md (395 lines) includes:
- Project description
- Quick Start (Docker + local)
- Web UI instructions
- Data setup instructions with download links
- Architecture overview (components, dependencies)
- Input/output descriptions
- Validation checks reference
- CLI usage
- AWS cloud deployment instructions
- Project structure tree

REQ-19: FEEDBACK.md

"A FEEDBACK.md that includes: Feedback on the assignment, Duration, How expertise/skillsets fit, Anything to help better evaluate candidates."

Status: FULFILLED

Section Status
Duration ✅ "Approximately 6 hours coding and another hour for writing and taking screenshots"
Approach ✅ 5 numbered points explaining key decisions
How My Expertise Fits ✅ USDS-relevant context with production pipeline thinking, domain awareness, communication focus, cloud-ready architecture
Skills Demonstrated ✅ 6 categories (data engineering, SQL, Python, data analysis, communication, infrastructure)
Assessment Feedback ✅ Constructive feedback on scope and fairness
Suggestions ✅ Scoring rubric and multi-sample testing suggestions

Summary of Gaps

# Gap Severity Status
No open gaps All requirements fulfilled ✅