Provenance & auditing¶
Tango ingests data from multiple upstream systems (SAM.gov, USAspending, FPDS, agency feeds, etc.). Those sources overlap, disagree, and update on different cadences.
To keep the dataset consistent (and debuggable), Tango tracks two related concepts:
- Field provenance: “Which source last updated this specific field, and when?”
- Change logs (CDC): “Which fields changed on this row, by what operation (INSERT/UPDATE), and from what source?”
Even though we do not currently expose provenance or changelog tables in the public API, this tracking matters to API consumers because it helps us:
- Explain behavior: why a value changed (or why it didn’t).
- Prevent regressions: avoid overwriting authoritative sources with lower-quality sources.
- Improve reliability: tighten guarantees around “data is ready / stable” moments for downstream consumers.
Terminology¶
- CDC (Change Data Capture): a pattern for recording row-level changes (insert/update/delete) as data is ingested.
change_source: a normalized identifier of the ingest source (e.g."sam","dsbs","usaspending").changed_fields: the list of field names that were changed by an operation.- For UPDATE, this is a true diff.
- For INSERT, this is a best-effort list of fields set by the ingest path.
batch_id/etl_job_id: identifiers used to correlate changes to a specific loader run.
What models have this (so far)¶
Today, Tango tracks provenance/auditing for:
- Organizations (
agencies.Organization) - CDC changelog:
agencies.OrganizationChangeLog - Source/field priority rules:
agencies.OrganizationSourceFieldAuthority - Entities (
entities.Entity) - CDC changelog:
entities.EntityChangeLog - Source/field priority rules:
entities.EntitySourceFieldAuthority - Entity relationships (
entities.EntityRelationship) - CDC changelog:
entities.EntityRelationshipChangeLog
How it works (high level)¶
At ingest time, Tango typically writes incoming data into a staging representation, computes a diff against the current table, records the audit log, and then applies the update.
flowchart LR
A[Upstream source file/API] --> B[Loader parses + normalizes]
B --> C[(Staging / temp representation)]
C --> D[Compute diff vs target]
D --> E[(ChangeLog rows)]
D --> F[(Field provenance updates)]
C --> G[Apply upsert/update to target] Why this is useful¶
- Multiple sources, different trust levels: provenance lets us define “who wins” per field (and why).
- Better debugging: we can query “what changed the name for UEI X?” or “why did this org’s code change?”
- Future-facing: this lays groundwork for exposing provenance in a controlled way (e.g. support tooling, exports, or opt-in API surfaces).
Example questions we can answer¶
- “What last updated an Organization’s
name(and from which source)?” - “What last updated
legal_business_namefor an entity UEI?” - “Which fields changed on organization FH key 12345 in the last sync?”
- “Did loader X actually change data, or was it a no-op?”