5 Apify dataset deduplication patterns that stop double-billing your customers
5 Apify dataset deduplication patterns that stop double-billing your customers
I run 32 public Apify actors (79 total in the portfolio). The Trustpilot review scraper has 968 lifetime runs across 3 paying users; the Reddit discussion scraper has 98; the email extractor has 145. The single most expensive bug class I’ve fixed in the last six months isn’t crashes or rate limits — it’s silent duplication. A run finishes, the dataset has 4,000 rows, the customer pays, and 800 of those rows are the same record scraped twice through different URLs.
Duplicates corrupt analytics. They double-bill anyone on a pay-per-result tier. And on Apify they cost real compute units, because dedup-after-the-fact still pays the storage round-trip. Below are the five patterns I now bake into every actor before pushing the first row to a dataset.
1. Use uniqueKey on the request queue — never trust URL hashes
The default requestQueue.addRequest({ url }) deduplicates on the URL string. That works for static sites and breaks the moment any tracking parameter, session token, or ?ref= lands in the wild. Two requests to /product/123?utm_source=email and /product/123?utm_source=twitter will both fire, both hit the dataset, both bill the customer.
import { Actor } from 'apify';
await Actor.init();
const requestQueue = await Actor.openRequestQueue();
const url = 'https://example.com/product/123?utm_source=email';
const canonicalId = new URL(url).pathname; // '/product/123'
await requestQueue.addRequest({
url,
uniqueKey: canonicalId, // dedup on path, not full URL
});
Setting uniqueKey explicitly tells Apify “two requests with this same key are the same request.” The framework drops duplicates at enqueue time — they never run, they never bill. On the Trustpilot scraper this single change cut duplicate review rows from 6% of output to under 0.1%.
2. Maintain a Set in actor state — not in memory
Memory dedup looks correct in development and fails in production. As soon as your actor has more than ~50,000 items, the heap balloons; if the run gets restarted (Apify migrates instances under load), the in-memory Set dies and you start emitting duplicates of everything you already pushed.
The fix is to persist the dedup set in the actor’s key-value store and reload it on every run:
const store = await Actor.openKeyValueStore();
const seen = new Set(await store.getValue('SEEN_IDS') || []);
async function pushIfNew(item) {
if (seen.has(item.id)) return;
seen.add(item.id);
await Actor.pushData(item);
// Persist every 100 items so a crash doesn't lose dedup state
if (seen.size % 100 === 0) {
await store.setValue('SEEN_IDS', [...seen]);
}
}
The 951 → 954 delta on the Trustpilot scraper this week is partly because runs that previously crashed mid-batch and restarted were re-emitting reviews they’d already pushed. Persisting SEEN_IDS made restarts idempotent.
3. Hash the content, not the metadata
Some pages serve identical content under different identifiers — A/B test variants, mirror domains, paginated views that overlap on the boundary row. URL-level dedup misses all of these. The dataset has 4,000 rows; 200 of them are the exact same review under different review IDs because the merchant has a quirky CMS.
Hash the canonical content instead:
import crypto from 'node:crypto';
function contentHash(item) {
// Whitelist the fields that define identity
const canonical = JSON.stringify({
author: item.author?.trim().toLowerCase(),
date: item.date,
text: item.text?.trim().slice(0, 200), // first 200 chars
});
return crypto.createHash('sha1').update(canonical).digest('hex');
}
const hash = contentHash(review);
if (seen.has(hash)) return;
seen.add(hash);
await Actor.pushData({ ...review, _contentHash: hash });
The trick is the field whitelist. Don’t hash everything — scrapedAt, source_page, request_id will differ on every run and defeat the dedup. Pick the fields that define “this is the same record” semantically.
4. Use Actor.getInput().previousRunId for cross-run dedup
When customers re-run an actor weekly, they often want only new data — not the same 4,000 reviews they already paid for last week. The naïve solution is “let them filter post-download,” which means they pay for the same compute every week.
const input = await Actor.getInput();
const previousIds = new Set();
if (input.previousRunId) {
const prevDataset = await Actor.apifyClient
.run(input.previousRunId)
.dataset()
.listItems({ fields: ['id'] });
for (const item of prevDataset.items) {
previousIds.add(item.id);
}
}
// ... during scraping ...
if (previousIds.has(item.id)) return; // already delivered last run
This pattern cut one Trustpilot customer’s monthly Apify bill by 64% — they were re-running weekly to pick up new reviews and getting the entire history back every time. With previousRunId the actor only emits the delta. They didn’t ask for this feature; I shipped it because the alternative was them eventually noticing the bill and churning.
5. Final-pass Map collapse before output — the safety net
The first four patterns prevent most duplicates at enqueue or push time. Pattern five catches whatever still slipped through. Collapse the dataset through a Map keyed by your dedup field before the final write:
async function finalizeDataset() {
const dataset = await Actor.openDataset();
const { items } = await dataset.getData();
const collapsed = new Map();
for (const item of items) {
const key = item._contentHash || item.id;
// Keep the version with the most populated fields
const existing = collapsed.get(key);
if (!existing || Object.values(item).filter(Boolean).length >
Object.values(existing).filter(Boolean).length) {
collapsed.set(key, item);
}
}
// Write the collapsed dataset back to a named output
const outputDataset = await Actor.openDataset('CLEAN_OUTPUT');
await outputDataset.pushData([...collapsed.values()]);
}
The “keep the most populated version” tie-breaker matters more than it looks. Two scrapes of the same review may differ only in whether the optional merchantReply field was populated — keeping the richer record means downstream consumers don’t see fields randomly disappearing on re-runs.
What this is worth in a real bill
For the Trustpilot scraper at 954 lifetime runs, patterns 1, 2, and 5 alone reduced output volume by ~12% with zero data loss. On any pay-per-result pricing tier that’s a 12% margin recovery for the same delivered value. On compute-tier pricing it’s lower — but the bigger win there is correctness: customers stop catching us with bad data.
I built the Trustpilot review scraper, the Reddit discussion scraper, and the email extractor using exactly this dedup playbook. If you have an Apify actor where the dataset feels “almost clean but occasionally weird,” the bug is almost certainly in patterns 1 or 3.
Need a custom Apify actor or a clean-up audit on an existing one? Pilot pricing: $100 for one actor or $150 for a 3-actor bundle (audit + dedup-pattern retrofit + run-log instrumentation, delivered in 7 days). Email spinov001@gmail.com with the actor URL — I’ll quote within 24 hours. Recently delivered a paid 3-article series for Proxy-Seller (April 2026, $150) — Article #1 live on Dev.to (2,320 words, sponsored, full disclosure).
More tactical Apify writeups → @scraping_ai