<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>Spinov · Web Scraping &amp; AI Research</title><description>Practical case studies, code-heavy tutorials, and production-grade Apify actors for data extraction at scale. Author: Aleksei Spinov.</description><link>https://blog.spinov.online/</link><item><title>3 Telegram Channels Worth Following for Production Data Engineering</title><link>https://blog.spinov.online/blog/3-telegram-channels-production-data-engineering/</link><guid isPermaLink="true">https://blog.spinov.online/blog/3-telegram-channels-production-data-engineering/</guid><description>Three Telegram channels (@dataeng, @apache_airflow, @bigdata_en) I keep reading for distributed-systems patterns, orchestration depth, and downstream feedback that improves real production scrapers.</description><pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate></item><item><title>5 Apify run-log patterns that make production debugging 10x faster</title><link>https://blog.spinov.online/blog/5-apify-run-log-patterns-faster-debugging/</link><guid isPermaLink="true">https://blog.spinov.online/blog/5-apify-run-log-patterns-faster-debugging/</guid><description>Five Apify-actor logging patterns I use across 78 production scrapers — tag-prefix, structured retries, soft-block detection, dedup checkpointing, summary-line — turning 30-min log greps into 30-sec lookups.</description><pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate></item><item><title>5 Apify Scheduler Mistakes That Quietly Burn Compute Units (And the Cron Fixes)</title><link>https://blog.spinov.online/blog/5-apify-scheduler-mistakes-burning-compute-units/</link><guid isPermaLink="true">https://blog.spinov.online/blog/5-apify-scheduler-mistakes-burning-compute-units/</guid><description>Five scheduler misconfigurations I&apos;ve made or watched customers make on Apify, with exact cron/actor.json fixes and cost-of-mistake math from running 31 published actors.</description><pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate></item><item><title>5 production scraping failures from 1000+ runs (and the fixes that actually shipped)</title><link>https://blog.spinov.online/blog/5-production-scraping-failures-1k-runs/</link><guid isPermaLink="true">https://blog.spinov.online/blog/5-production-scraping-failures-1k-runs/</guid><description>Real failure modes from 2190 lifetime Apify runs across 32 actors — schema drift caught silently, retry self-DDoS, concurrency WAF traps, memory creep on long runs, silent webhook failures. With code.</description><pubDate>Tue, 12 May 2026 00:00:00 GMT</pubDate></item><item><title>A Budget Brake That Stops a Scraper Before $200</title><link>https://blog.spinov.online/blog/a-budget-brake-stops-your-scraper-before-200/</link><guid isPermaLink="true">https://blog.spinov.online/blog/a-budget-brake-stops-your-scraper-before-200/</guid><description>Spend alerts fire after the money is gone. A budget brake refuses the next run before it spends. Here&apos;s a 40-line preventive fuse, run locally, with the real output, plus where it stops working.</description><pubDate>Wed, 03 Jun 2026 00:00:00 GMT</pubDate></item><item><title>I write production scrapers. AI made 30% of them worse. Here&apos;s the rule of thumb.</title><link>https://blog.spinov.online/blog/ai-for-production-scrapers-rule-of-thumb/</link><guid isPermaLink="true">https://blog.spinov.online/blog/ai-for-production-scrapers-rule-of-thumb/</guid><description>After 1,819 production runs across 32 Apify actors — a practical map of where AI helps with scraper code, where it&apos;s neutral, and the 30% where it quietly breaks production.</description><pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate></item><item><title>5 Apify dataset deduplication patterns that stop double-billing your customers</title><link>https://blog.spinov.online/blog/apify-dataset-deduplication-patterns/</link><guid isPermaLink="true">https://blog.spinov.online/blog/apify-dataset-deduplication-patterns/</guid><description>Five production patterns to prevent silent dataset duplication on Apify — uniqueKey, content hashing, KV-store guards, and SQL-backed dedup. Real numbers from 968 Trustpilot runs.</description><pubDate>Sun, 17 May 2026 00:00:00 GMT</pubDate></item><item><title>5 Apify run-log patterns that make production debugging 10× faster</title><link>https://blog.spinov.online/blog/apify-run-log-patterns-debugging/</link><guid isPermaLink="true">https://blog.spinov.online/blog/apify-run-log-patterns-debugging/</guid><description>Five log patterns I added to a production Apify scraper after 951 runs: fatal markers, pagination cursors, proxy audit, retry telemetry, run summary.</description><pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate></item><item><title>5 Apify scheduler mistakes that quietly burn compute units</title><link>https://blog.spinov.online/blog/apify-scheduler-mistakes-cost-cu/</link><guid isPermaLink="true">https://blog.spinov.online/blog/apify-scheduler-mistakes-cost-cu/</guid><description>Five real Apify scheduler misconfigurations from 32-actor portfolio + cron / actor.json fixes + cost-of-mistake math, ordered by cost-impact.</description><pubDate>Fri, 15 May 2026 00:00:00 GMT</pubDate></item><item><title>Apify vs. self-hosted: the three numbers I use to decide</title><link>https://blog.spinov.online/blog/apify-vs-self-hosted-decision/</link><guid isPermaLink="true">https://blog.spinov.online/blog/apify-vs-self-hosted-decision/</guid><description>A decision framework for when to use Apify Store vs build a self-hosted scraper, based on run data from 31 public actors including one at 949 production runs.</description><pubDate>Thu, 30 Apr 2026 00:00:00 GMT</pubDate></item><item><title>5 Apify webhook patterns that turn one-off scrapers into reliable data pipelines</title><link>https://blog.spinov.online/blog/apify-webhook-integration-patterns/</link><guid isPermaLink="true">https://blog.spinov.online/blog/apify-webhook-integration-patterns/</guid><description>Five production-tested Apify webhook patterns from 1818 lifetime runs across 79 actors: signed payloads, idempotency, dead-letter queues, retry budgets, and schema-drift detection.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate></item><item><title>Conditional GET in production scrapers: what I learned wiring it into 3 actors</title><link>https://blog.spinov.online/blog/conditional-get-incremental-scraping/</link><guid isPermaLink="true">https://blog.spinov.online/blog/conditional-get-incremental-scraping/</guid><description>Real numbers from 2,190 lifetime runs: 304 Not Modified saved 32-71% of bandwidth across Trustpilot, exchange-rate and npm-package actors. Code, failure modes, and when to skip it.</description><pubDate>Tue, 19 May 2026 00:00:00 GMT</pubDate></item><item><title>Cost per result: a 4-line worksheet for Apify actors</title><link>https://blog.spinov.online/blog/cost-per-result-apify-worksheet/</link><guid isPermaLink="true">https://blog.spinov.online/blog/cost-per-result-apify-worksheet/</guid><description>What does one record actually cost end-to-end? A simple 4-line worksheet that surfaces hidden costs across every Apify actor in your portfolio.</description><pubDate>Thu, 30 Apr 2026 00:00:00 GMT</pubDate></item><item><title>Dead features in your own code: a self-audit story from my Apify actor</title><link>https://blog.spinov.online/blog/dead-features-in-your-own-code/</link><guid isPermaLink="true">https://blog.spinov.online/blog/dead-features-in-your-own-code/</guid><description>An honest postmortem on finding two README-documented features that didn&apos;t exist in my own production scraper, with the audit script I now run on every actor.</description><pubDate>Thu, 30 Apr 2026 00:00:00 GMT</pubDate></item><item><title>Description drift in serverless function catalogs — a monthly refresh playbook</title><link>https://blog.spinov.online/blog/description-drift-serverless-catalogs/</link><guid isPermaLink="true">https://blog.spinov.online/blog/description-drift-serverless-catalogs/</guid><description>Why function-catalog descriptions go stale within months, and a 30-second monthly refresh playbook with Python code, drawn from a 32-actor Apify portfolio.</description><pubDate>Tue, 12 May 2026 00:00:00 GMT</pubDate></item><item><title>A 30-Line Probe That Tells You If a Page Needs a Browser</title><link>https://blog.spinov.online/blog/does-this-page-need-a-browser/</link><guid isPermaLink="true">https://blog.spinov.online/blog/does-this-page-need-a-browser/</guid><description>Half the &apos;you don&apos;t need a browser&apos; takes on my feed are right and none of them tell you how to check. Here&apos;s a stdlib probe that reads the raw HTTP response and votes NO_BROWSER, JS_REQUIRED, or MAYBE. I ran it on 10 named public URLs; 6 returned their data without Chrome.</description><pubDate>Fri, 05 Jun 2026 00:00:00 GMT</pubDate></item><item><title>DuckDB + dbt: a zero-cost analytics warehouse for projects under 100 GB</title><link>https://blog.spinov.online/blog/duckdb-dbt-zero-cost-analytics/</link><guid isPermaLink="true">https://blog.spinov.online/blog/duckdb-dbt-zero-cost-analytics/</guid><description>Why I run dbt-duckdb on a  VM instead of paying 90/month for Snowflake, with the full repo layout and CI workflow.</description><pubDate>Thu, 30 Apr 2026 00:00:00 GMT</pubDate></item><item><title>I&apos;ve Run 2,190 Production Scrapes — &quot;Ethical&quot; Isn&apos;t a robots.txt Question, It&apos;s a Rate-Limit One</title><link>https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/</link><guid isPermaLink="true">https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/</guid><description>Ethics-of-scraping posts argue about robots.txt and ToS. After 2,190 production runs across 32 scrapers, the line between &apos;ethical&apos; and &apos;banned&apos; turned out to be the same line — and it&apos;s drawn by conditional GET and a sane rate limit, not by a checkbox. Here&apos;s the working pattern.</description><pubDate>Mon, 25 May 2026 00:00:00 GMT</pubDate></item><item><title>Spoofing Your Scraper&apos;s Fingerprint Is a Losing Arcade</title><link>https://blog.spinov.online/blog/fingerprint-spoofing-is-a-losing-arcade/</link><guid isPermaLink="true">https://blog.spinov.online/blog/fingerprint-spoofing-is-a-losing-arcade/</guid><description>Spoofing JA3, TLS and header order is a race you lose by design. After 2,190 production scraper runs, the thing that survives is how the run behaves — not how human its fingerprint looks.</description><pubDate>Tue, 02 Jun 2026 00:00:00 GMT</pubDate></item><item><title>Five Apify Input Schema Mistakes And The Fixes That Stuck</title><link>https://blog.spinov.online/blog/five-input-schema-mistakes-and-fixes/</link><guid isPermaLink="true">https://blog.spinov.online/blog/five-input-schema-mistakes-and-fixes/</guid><description>Five real input-schema mistakes I shipped across 78 Apify actors, what each cost in support emails and re-runs, and the exact schema patterns I use now.</description><pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate></item><item><title>I&apos;ve Run 2,190 Production Scrapes. The Framework You Pick Isn&apos;t What Breaks — Here&apos;s What Actually Does</title><link>https://blog.spinov.online/blog/framework-isnt-what-breaks-your-scraper/</link><guid isPermaLink="true">https://blog.spinov.online/blog/framework-isnt-what-breaks-your-scraper/</guid><description>After 2,190 production scraper runs, the framework almost never decided whether a job lived or died. Three disciplines did: element-targeted waiting (not networkidle), browser memory recycling, and bounded retry with backoff and jitter — with reproducible stdlib code.</description><pubDate>Thu, 28 May 2026 00:00:00 GMT</pubDate></item><item><title>9 Free LLM APIs in 2026 You Can Use Without a Credit Card</title><link>https://blog.spinov.online/blog/free-llm-apis-2026-no-credit-card/</link><guid isPermaLink="true">https://blog.spinov.online/blog/free-llm-apis-2026-no-credit-card/</guid><description>Nine LLM APIs with a genuinely free tier and no credit card in 2026 — limits, OpenAI-compatibility, and which ones survive an extraction workload. Verified May 2026.</description><pubDate>Sun, 31 May 2026 00:00:00 GMT</pubDate></item><item><title>HTTP 200 Is a Lie: A 30-Line Schema Canary for Source Drift</title><link>https://blog.spinov.online/blog/http-200-is-a-lie-schema-canary/</link><guid isPermaLink="true">https://blog.spinov.online/blog/http-200-is-a-lie-schema-canary/</guid><description>Your scraper returns 200, the parser doesn&apos;t crash, and your corpus quietly rots. Across 962 runs on one source, the failure that bit me wasn&apos;t a block — it was the source reshaping its output. Here&apos;s a stdlib schema canary that asserts the shape of the data, not just the response.</description><pubDate>Sat, 30 May 2026 00:00:00 GMT</pubDate></item><item><title>Idempotent webhook receivers in 50 lines of Python</title><link>https://blog.spinov.online/blog/idempotent-webhooks-in-50-lines/</link><guid isPermaLink="true">https://blog.spinov.online/blog/idempotent-webhooks-in-50-lines/</guid><description>Stop losing duplicate Stripe/GitHub/Slack webhooks. A 50-line Python + Postgres pattern that survives retries and crashed workers — code, schema, and a 5-minute reproducible test.</description><pubDate>Thu, 30 Apr 2026 00:00:00 GMT</pubDate></item><item><title>Three memory-leak patterns in long-running scrapers (and how I caught them after 968 Trustpilot runs)</title><link>https://blog.spinov.online/blog/memory-leaks-long-running-scrapers/</link><guid isPermaLink="true">https://blog.spinov.online/blog/memory-leaks-long-running-scrapers/</guid><description>Production scraping memory leaks: BeautifulSoup retention, growing URL queues, and connection-pool exhaustion. Real fixes with measured before/after from 968+ live Trustpilot scraper runs.</description><pubDate>Mon, 18 May 2026 00:00:00 GMT</pubDate></item><item><title>Automate Your Backups with MinIO: Free S3-Compatible Storage for Everything</title><link>https://blog.spinov.online/blog/minio-backup-automation/</link><guid isPermaLink="true">https://blog.spinov.online/blog/minio-backup-automation/</guid><description>Replace paid S3 with self-hosted MinIO. Step-by-step backup automation in Python with 50 lines of code.</description><pubDate>Wed, 29 Apr 2026 00:00:00 GMT</pubDate></item><item><title>Three operational rules I added after my Trustpilot scraper crossed 100 runs</title><link>https://blog.spinov.online/blog/operational-rules-after-100-runs/</link><guid isPermaLink="true">https://blog.spinov.online/blog/operational-rules-after-100-runs/</guid><description>Schema-drift detection, IP-budget enforcement, and golden-diff snapshots — three rules I added to a production Apify actor after 100+ runs revealed where silent failures hide.</description><pubDate>Thu, 30 Apr 2026 00:00:00 GMT</pubDate></item><item><title>Building a Proxy Health Monitor for 24/7 Scraper Uptime</title><link>https://blog.spinov.online/blog/proxy-health-monitor-247-scraper-uptime/</link><guid isPermaLink="true">https://blog.spinov.online/blog/proxy-health-monitor-247-scraper-uptime/</guid><description>A production-ready Python proxy health monitor: detects failures, rotates dead proxies, and alerts you before scrapers go down. Built from a 100K-page incident postmortem.</description><pubDate>Wed, 13 May 2026 00:00:00 GMT</pubDate></item><item><title>Feeding Raw HTML to Your LLM Is a Token Tax. I Measured It on 10 Real Pages — Median 7.4×, and It Hits Every Scheduled Run</title><link>https://blog.spinov.online/blog/raw-html-is-a-token-tax-i-measured-it/</link><guid isPermaLink="true">https://blog.spinov.online/blog/raw-html-is-a-token-tax-i-measured-it/</guid><description>Everyone says &apos;markdown beats HTML for tokens.&apos; Nobody shows a number. I tokenized raw HTML vs extracted text across 10 public pages with tiktoken. Here&apos;s the real multiplier, the 30-line meter, and what it costs when it repeats on a schedule.</description><pubDate>Fri, 29 May 2026 00:00:00 GMT</pubDate></item><item><title>Why your retry logic is broken (and the 30-line fix)</title><link>https://blog.spinov.online/blog/retry-logic-broken-30-line-fix/</link><guid isPermaLink="true">https://blog.spinov.online/blog/retry-logic-broken-30-line-fix/</guid><description>Most Python services retry network calls wrong — fixed delay, wrong errors, no upper bound. Here is the 30-line jitter+deadline pattern I use in production after a 4-minute incident.</description><pubDate>Thu, 30 Apr 2026 00:00:00 GMT</pubDate></item><item><title>Schema drift killed our pipeline — three contract tests that catch it</title><link>https://blog.spinov.online/blog/schema-drift-3-contract-tests/</link><guid isPermaLink="true">https://blog.spinov.online/blog/schema-drift-3-contract-tests/</guid><description>When a vendor silently flips a JSON field type or drops a value, your scraper keeps running and your data lies. Three small contract tests that catch schema drift before it lands in your warehouse.</description><pubDate>Thu, 30 Apr 2026 00:00:00 GMT</pubDate></item><item><title>Scraping All the Text Is the Easy 10%. Keeping the Corpus Worth Training On Is the Other 90% — Notes From 962 Runs</title><link>https://blog.spinov.online/blog/scraping-text-is-the-easy-10-percent-dedup-and-decay/</link><guid isPermaLink="true">https://blog.spinov.online/blog/scraping-text-is-the-easy-10-percent-dedup-and-decay/</guid><description>Getting the text out is the easy 10%. After 962 production scraper runs, the hard 90% is deduplication, re-collection, and decay — with a reproducible stdlib-only corpus deduper.</description><pubDate>Wed, 27 May 2026 00:00:00 GMT</pubDate></item><item><title>Token Bucket vs Exponential Backoff: What Changed After 966 Runs</title><link>https://blog.spinov.online/blog/token-bucket-vs-exponential-backoff/</link><guid isPermaLink="true">https://blog.spinov.online/blog/token-bucket-vs-exponential-backoff/</guid><description>After 966 production runs of the Trustpilot scraper, I replaced exponential backoff with a token bucket in five actors. Code, numbers, and the failure modes that disappeared.</description><pubDate>Fri, 15 May 2026 00:00:00 GMT</pubDate></item><item><title>Token Economics of Agent-Driven Scraping: When LLM Agents Cost 50× More Than a Cron Job</title><link>https://blog.spinov.online/blog/token-economics-agent-driven-scraping/</link><guid isPermaLink="true">https://blog.spinov.online/blog/token-economics-agent-driven-scraping/</guid><description>Six months of production scrapers (970r on a single actor) showed LLM agent loops cost 30-80× more than deterministic crawlers above ~50 pages. Real token math, two narrow agent-win cases, and the fallback-only pattern.</description><pubDate>Mon, 18 May 2026 00:00:00 GMT</pubDate></item><item><title>Traefik + Docker: Zero-Config Reverse Proxy That Discovers Your Containers Automatically</title><link>https://blog.spinov.online/blog/traefik-docker-reverse-proxy/</link><guid isPermaLink="true">https://blog.spinov.online/blog/traefik-docker-reverse-proxy/</guid><description>Traefik watches the Docker socket, auto-discovers new containers, and routes traffic to them based on labels. No config files to edit. No reloads. Just docker compose up and go.</description><pubDate>Wed, 29 Apr 2026 00:00:00 GMT</pubDate></item><item><title>How my Trustpilot scraper survived 949 production runs (and the 3 things that almost killed it)</title><link>https://blog.spinov.online/blog/trustpilot-scraper-949-runs-postmortem/</link><guid isPermaLink="true">https://blog.spinov.online/blog/trustpilot-scraper-949-runs-postmortem/</guid><description>A post-mortem of three production failures from the Trustpilot scraper that crossed 949 runs on Apify Store: silent selector death, IP-block math, and the residential-proxy temptation.</description><pubDate>Wed, 29 Apr 2026 00:00:00 GMT</pubDate></item><item><title>What 250 runs of a Trustpilot scraper taught me about anti-bot patterns</title><link>https://blog.spinov.online/blog/trustpilot-scraper-production/</link><guid isPermaLink="true">https://blog.spinov.online/blog/trustpilot-scraper-production/</guid><description>Real numbers from a public Apify actor: which anti-bot tricks actually mattered, what the error budget looked like, and what I would do differently.</description><pubDate>Sat, 25 Apr 2026 00:00:00 GMT</pubDate></item><item><title>Welcome — what this blog is for</title><link>https://blog.spinov.online/blog/welcome-and-roadmap/</link><guid isPermaLink="true">https://blog.spinov.online/blog/welcome-and-roadmap/</guid><description>A code-first blog about web scraping, data extraction, and AI research, written by someone shipping production scrapers on Apify.</description><pubDate>Mon, 27 Apr 2026 00:00:00 GMT</pubDate></item><item><title>When NOT to scrape: 3 patterns where I now reach for an API instead</title><link>https://blog.spinov.online/blog/when-not-to-scrape/</link><guid isPermaLink="true">https://blog.spinov.online/blog/when-not-to-scrape/</guid><description>After shipping 79 Apify actors, here&apos;s the 60-second decision rule I now apply before writing any selector code.</description><pubDate>Thu, 30 Apr 2026 00:00:00 GMT</pubDate></item><item><title>You Pay for the Bandwidth That Returns Nothing</title><link>https://blog.spinov.online/blog/you-pay-for-the-bandwidth-that-returns-nothing/</link><guid isPermaLink="true">https://blog.spinov.online/blog/you-pay-for-the-bandwidth-that-returns-nothing/</guid><description>A per-GB proxy bill charges you for failed requests and retries too. On one config 53% of the bytes returned zero rows. Here&apos;s a model you can run with your own numbers.</description><pubDate>Thu, 04 Jun 2026 00:00:00 GMT</pubDate></item><item><title>Your Scraper Collected 50 Rows. There Were 4,000.</title><link>https://blog.spinov.online/blog/your-scraper-collected-50-rows-there-were-4000/</link><guid isPermaLink="true">https://blog.spinov.online/blog/your-scraper-collected-50-rows-there-were-4000/</guid><description>A scraper can finish green, return only valid rows, and still hand you a quarter of the dataset. Pagination cutoffs are silent. Here is a 40-line completeness probe that catches them.</description><pubDate>Sun, 07 Jun 2026 00:00:00 GMT</pubDate></item><item><title>Your Scraper Died at Row 12,000. The Rerun Pattern.</title><link>https://blog.spinov.online/blog/your-scraper-died-at-row-12000/</link><guid isPermaLink="true">https://blog.spinov.online/blog/your-scraper-died-at-row-12000/</guid><description>A long scrape that dies three hours in didn&apos;t lose one request — it lost the whole run, and rerunning from zero means paying twice for data you already had. Here&apos;s the ~40-line stdlib pattern that resumes a crashed job, fetches only the missing delta, and writes zero duplicates. Real captured output of a crash and a clean resume.</description><pubDate>Sat, 06 Jun 2026 00:00:00 GMT</pubDate></item><item><title>Your Scraper Got Clean Data. The Site Lied to It.</title><link>https://blog.spinov.online/blog/your-scraper-got-clean-data-the-site-lied/</link><guid isPermaLink="true">https://blog.spinov.online/blog/your-scraper-got-clean-data-the-site-lied/</guid><description>A site can detect your scraper and serve a 200 with a perfect schema and plausible values that are deliberately false. Status codes and sanity checks are blind to it by design. Here&apos;s a 30-line probe that grounds each row to an independent invariant — and why naive cross-source consensus gets fooled too.</description><pubDate>Tue, 09 Jun 2026 00:00:00 GMT</pubDate></item><item><title>Your Scraper Passes Every Run. It&apos;s Still Rotting.</title><link>https://blog.spinov.online/blog/your-scraper-passes-every-run-its-still-rotting/</link><guid isPermaLink="true">https://blog.spinov.online/blog/your-scraper-passes-every-run-its-still-rotting/</guid><description>Your scraper exits 0 on every run. Schema valid, row count plausible. And the yield has been sliding for weeks. A 20-line lagged-baseline probe over your own run log catches the drift before it becomes a breakage.</description><pubDate>Mon, 08 Jun 2026 00:00:00 GMT</pubDate></item><item><title>Your Scraper Re-Downloads Everything. Most Didn&apos;t Change.</title><link>https://blog.spinov.online/blog/your-scraper-re-downloads-everything-most-didnt-change/</link><guid isPermaLink="true">https://blog.spinov.online/blog/your-scraper-re-downloads-everything-most-didnt-change/</guid><description>A scheduled scraper re-downloads its whole corpus every run, even though almost nothing changed since last time. The fix isn&apos;t faster fetching — it&apos;s deciding FETCH/SKIP/CONDITIONAL from a manifest before the first request. A 30-line planner, its real output, and the production trap (weak rotating ETags) that fakes the savings.</description><pubDate>Wed, 10 Jun 2026 00:00:00 GMT</pubDate></item><item><title>Your Scraper Returned a Clean Row. It Was Wrong.</title><link>https://blog.spinov.online/blog/your-scraper-returned-a-clean-row-it-was-wrong/</link><guid isPermaLink="true">https://blog.spinov.online/blog/your-scraper-returned-a-clean-row-it-was-wrong/</guid><description>HTTP was 200, the selectors held, the JSON parsed — and the LLM still returned a plausible, syntactically valid, semantically false value. A 60-line field-level sanity check catches the lie that schema validation can&apos;t see.</description><pubDate>Mon, 01 Jun 2026 00:00:00 GMT</pubDate></item></channel></rss>