Schema drift killed our pipeline — three contract tests that catch it
A vendor flipped a JSON field from integer to string. No deprecation notice, no version bump, no email. The scraper happily kept ingesting. Fourteen hours later a downstream report read "42" where it expected 42, the dashboard rendered nonsense, and a paying customer noticed before our pager did.
That incident cost about half a day of triage and one apology email. The fix was thirty lines of contract-tests against the upstream response, run on every scheduled job. This post is the pattern.
What schema drift actually looks like
Schema drift is when the shape of data changes without the meaning changing — at least in the producer’s mind. From their side, “we just changed how the API serializes IDs.” From the consumer’s side, every assumption downstream is now wrong.
Three drift modes I’ve seen in the last year of running scrapers and ETL jobs against twenty-plus third-party data sources:
- Type flip.
42→"42", ortrue→1, ornull→"". Schema-validators that only check field presence pass. Anything that does math, comparison, or hashing fails silently. - Field rename or relocation.
created_atbecomescreatedAt. Or it moves from the top level into ametaobject. The old field disappears or becomes a phantom of itself. - Cardinality flip. A field that used to be a single object becomes an array of one. Code that did
record.author.namenow needsrecord.author[0].name.
None of these will trigger an HTTP error. None will appear in your APM dashboards. The job runs, the rows insert, the metrics look healthy. The data is wrong.
Why “validate against a JSON schema” isn’t enough
The first instinct is to write a JSON Schema and validate every payload. I’ve done it. It catches obvious breakage — a field that disappears entirely, a wrong top-level type — but it misses the subtle drift that actually causes outages.
Two reasons.
First, a JSON Schema is usually too permissive on purpose. If you wrote {"id": {"type": ["string", "integer"]}} because the upstream is inconsistent, you’ve already lost the type-flip detection.
Second, JSON Schema tells you whether the document conforms, not whether the document means the same thing as last week. Drift is a temporal property. Validators are stateless.
The fix is to add three checks that are explicitly comparative.
The three contract tests
Below is the rough shape of each in Python, pseudocode-level. The actual implementation lives in a contract_tests.py next to the scraper.
Test 1: Type stability per field path
For every leaf field in the response, record (path, python_type) and compare to the previous run’s recording. Mismatches are alerted, not auto-corrected.
import json, hashlib
from pathlib import Path
def collect_types(obj, prefix=""):
if isinstance(obj, dict):
for k, v in obj.items():
yield from collect_types(v, f"{prefix}.{k}")
elif isinstance(obj, list):
for v in obj[:1]: # sample first item
yield from collect_types(v, f"{prefix}[]")
else:
yield (prefix, type(obj).__name__)
def check_types(payload, snapshot_path):
current = dict(collect_types(payload))
if not Path(snapshot_path).exists():
Path(snapshot_path).write_text(json.dumps(current, indent=2))
return []
previous = json.loads(Path(snapshot_path).read_text())
return [
(path, previous[path], current[path])
for path in current
if path in previous and previous[path] != current[path]
]
The first run records the snapshot. Every subsequent run compares. If a vendor flips id from int to str, you get a single-line diff: ('.results[].id', 'int', 'str'). Reviewing a typo-sized diff is a thirty-second job.
Test 2: Field presence floor
Vendors love to “clean up” optional fields. The field stays in the docs, the value stops appearing in 2% of records, and three weeks later your aggregate count is silently wrong.
def check_presence(records, required_fields, threshold=0.95):
if not records:
return []
failures = []
for field in required_fields:
present = sum(1 for r in records if field in r and r[field] is not None)
ratio = present / len(records)
if ratio < threshold:
failures.append((field, ratio))
return failures
Set the threshold per field based on what’s actually true today, not what the docs claim. If phone_number shows up in 12% of records right now, set the floor to 10% and let it alert when it drops to 5%.
Test 3: Distribution sanity
The most subtle drift is the one where types and presence are unchanged but the values shift. A vendor expands their dataset to a new region, suddenly 30% of country_code values are "BR" where it was 0% yesterday.
from collections import Counter
def check_distribution(records, field, baseline_top, max_new_share=0.20):
current = Counter(r.get(field) for r in records if r.get(field))
if not current:
return []
total = sum(current.values())
new_keys = set(current) - set(baseline_top)
new_share = sum(current[k] for k in new_keys) / total
if new_share > max_new_share:
return [(field, new_share, list(new_keys)[:5])]
return []
The baseline is just “top 20 values from last week’s run, persisted as a JSON file.” The check fires when the share of not-in-baseline values exceeds 20%. That’s a heuristic, not a law — adjust per field, but don’t skip it for any field your business logic branches on.
How to wire this into a scheduled job
Two practical notes on putting this in production.
First: run the tests before the data lands in your warehouse. If the contract test fails, you want to halt the load, not roll back inserted rows. In Apify Actors I run them in the actor itself before pushing to the dataset; in Airflow DAGs they sit as a task between extract and load.
Second: don’t auto-correct. If a type flips, halt and page. Auto-coercion is how you end up with "42" parsed as the integer 42 in three places and as the string "42" in two — same field, two semantics, eternal pain. The whole point of the test is that a human looks at the diff before any data moves.
After we wired these three tests into the scraper that runs against the trustpilot endpoint, schema-related incidents on that pipeline went from “every other month” to zero in the last 90 days. Same code, same vendor, same volume — just the addition of a snapshot-and-compare step.
The cost is a couple of small JSON files persisted between runs and one extra DAG task. The benefit is that a vendor’s silent change announces itself before it eats your weekend.
If you’ve fought your own version of this — how do you catch it? Would happily compare notes. spinov001@gmail.com.
More postmortems on web scraping, schemas, and Apify production patterns → https://t.me/scraping_ai