Apr 30, 2026

Dead features in your own code: a self-audit story from my Apify actor

Two features that were documented but didn’t exist: a self-audit story from my own Apify actor

I run 31 published actors on Apify. One of them — email-extractor-pro — has a small, mostly happy user base (around 60 production runs at the time of writing). Last week I sat down to do a full README-vs-code audit on it, expecting a routine pass.

I found two features that the README documented, the INPUT_SCHEMA exposed in the UI, and the code… silently ignored.

This is a postmortem of that audit and the generalizable lesson behind it: documentation drift is real even in tiny code bases, and the only way to catch it is to read the code line-by-line as if you’d never seen it before.

The actor in 90 seconds

email-extractor-pro is a small crawler that walks a domain (or a list of domains), scrapes pages, and pulls out:

email addresses (regex + mailto: parse)
phone numbers (regex)
social profile URLs (LinkedIn, Twitter/X, GitHub, etc.)

It’s built on top of Crawlee with a Cheerio crawler. The user provides:

urls — list of starting URLs
maxPagesPerDomain — page budget per domain
maxConcurrency — global concurrency cap
deduplicateEmails — boolean, “deduplicate the output rows by email”
and a few other knobs

The README confidently described two features:

“Optional email deduplication” — toggle deduplicateEmails: false to allow duplicates.
“Priority routing” — pages like /contact, /about, /team, /imprint, /legal are processed first, ahead of generic content pages.

Neither was true. Here’s how I caught both.

Bug #1: `deduplicateEmails` was destructured and then ignored

The audit started by reading main.js top-to-bottom. Right at the top of the entrypoint:

const input = await Actor.getInput();
const {
    urls,
    maxPagesPerDomain,
    maxConcurrency,
    deduplicateEmails,   // ← read from input
    extractPhones,
    extractSocials,
} = input;

So far, fine. Then I Cmd-F’d for deduplicateEmails in the rest of the file.

One result. The destructure line. Nothing else.

The actual dedup logic looked like this:

const seen = new Map();   // key: email lowercased, value: first record
for (const record of records) {
    const key = record.email.toLowerCase();
    if (!seen.has(key)) {
        seen.set(key, record);
    }
}
const deduped = [...seen.values()];
await Actor.pushData(deduped);

Notice the absence of if (deduplicateEmails) anywhere. The Map-based dedup runs unconditionally. Setting deduplicateEmails: false in the actor input has zero effect on the output. You will always get a deduplicated result.

This is a classic dead input. The schema exposes a knob in the UI, the README promises behavior, but the code doesn’t read the variable beyond destructuring it. Type-checkers and linters happily allow this — the variable is “used” in the destructure expression even if never referenced again.

The fix (a one-liner) is:

const records = collectAllRecords(crawl);
const output = deduplicateEmails ? dedupByEmail(records) : records;
await Actor.pushData(output);

But a more important lesson: a feature exists only when both the schema AND the code agree on what it does. README is not enough. UI exposure is not enough. The function or branch has to actually run.

Bug #2: “Priority routing” was syntactically real but functionally dead

This one was sneakier. The README said:

Pages like /contact, /about, /team, /imprint, /legal are crawled first to reduce average time-to-first-email.

And the code had what looked like the right implementation:

const PRIORITY_PATHS = ['/contact', '/about', '/team', '/imprint', '/legal'];

const isPriorityUrl = (url) => {
    const path = new URL(url).pathname.toLowerCase();
    return PRIORITY_PATHS.some(p => path.endsWith(p) || path.includes(p + '/'));
};

const crawler = new CheerioCrawler({
    transformRequestFunction: (request) => {
        if (isPriorityUrl(request.url)) {
            request.userData.priority = 1;   // ← "high priority"
        }
        return request;
    },
    // ...
});

That looks like a “tag the request, queue picks the high-priority one first” pattern. It’s not.

Here’s the part the README author (me) missed: Crawlee’s RequestQueue does not consult userData.priority when picking the next request. userData is a free-form bag for user code — the queue itself uses request order (FIFO) and the forefront flag.

The relevant Crawlee API for “process this URL before the others already in the queue” is the forefront: true parameter on addRequests (or addRequest). Setting userData.priority = 1 on a request just attaches a property to the request; the queue ignores it.

So in practice the crawler walks pages in roughly the order they’re discovered by the in-page link extractor — which on most sites means home → top-nav links → footer links → deeper pages. /contact and /about get processed when the crawler reaches them organically, not earlier.

This is a “dead mechanism”. Each individual line is correct JavaScript. The constant array is real. The check function works. The userData assignment runs. None of it changes the queue order.

The fix is to add discovered priority links with forefront: true:

async requestHandler({ request, $, enqueueLinks }) {
    // discover priority links first, push to forefront
    const priorityLinks = $('a')
        .map((_, el) => $(el).attr('href'))
        .get()
        .filter(href => href && isPriorityUrl(href));

    await enqueueLinks({
        urls: priorityLinks,
        forefront: true,
    });

    // discover everything else, append to back of queue
    await enqueueLinks({
        selector: 'a',
    });

    // ... extract emails from current page
}

Now the queue actually front-loads /contact and /about pages, exactly as the README has been promising for months.

Why does this happen at all?

Both bugs share the same underlying cause: the code grew in two layers, and the README was written from intent rather than from observed behavior.

The first version of the actor had no dedup and no priority routing. Both were added later — but added piecemeal. The dedup branch got coded, then someone (me, six months ago) refactored the output pipeline and accidentally pulled the dedup out of its if (deduplicateEmails) branch. The priority routing was added when I read a Crawlee tutorial that mentioned userData and assumed it interacted with queue order — it doesn’t, and I never wrote a test that would have caught the difference.

The README, meanwhile, was written from the design doc — what the features were supposed to do — not from a fresh code read.

That gap can survive for a long time on a quietly-running actor with a happy user base. Nobody filed a bug, because:

The dedup case: users got dedup’d output by default, so setting deduplicateEmails: true “worked” (it was the always-on default they weren’t aware of).
The priority case: users got their emails eventually anyway; the order pages were crawled in didn’t change the final result, just the wall-clock time-to-first-email — which nobody had a baseline for.

Silent bugs in features users don’t notice are still bugs. If you advertise a feature, it has to exist.

The audit checklist I now run on every actor

After this exercise I added a small checklist that I run on every README before I publish updates:

Grep every input field name across the source tree. If the field appears only in the destructure line, it’s a dead input. Either wire it up or remove it from the schema.
Read every “first” / “before” / “priority” claim against the actual queue/scheduling code. If the README says “X happens first,” there has to be a forefront: true, a sort, or an explicit ordering call. Mental models are not implementation.
Reverse-test the schema. For every schema field, ask “if I flip this to its non-default value, what should change?” Then run the actor twice and check it actually changed.
Treat README and INPUT_SCHEMA as a single artifact. They lie together or they tell the truth together. Audit them in the same pass.

Doing this on email-extractor-pro surfaced the two dead features above. Doing it on the next 12 actors I audited surfaced ~40 smaller drifts (wrong default values, fields that silently capped at lower limits, retry counts the README claimed but Crawlee defaults overrode, etc.).

The broader takeaway

If you’re shipping anything with an input schema — Apify actors, Cloud Functions, Lambda, Cloudflare Workers, anything that exposes a “configure me” UI — your schema is a public API. Users read it as a contract. Documentation drift turns it into a lie that costs you trust the next time someone tries to use a knob that doesn’t work.

The audit takes maybe 30 minutes per actor. The cost of not doing it is the moment a paying customer realizes a feature you advertised doesn’t exist.

I run 31 published Apify actors at apify.com/knotless_cadence (78 total in portfolio). The flagship Trustpilot scraper has 949 lifetime production runs. I write code-heavy postmortems and case studies — recent example: a paid 3-article series for a client in the proxy industry ($150).

Need a custom scraper or an audit of an existing one? Pilot pricing $100/article or $150 for a 3-article series. Email spinov001@gmail.com.

More tips on web scraping, APIs, and Apify development → https://t.me/scraping_ai