How I Engineered a 1.7M Row Product Safety Database to Serve Pages in 66ms on a Single Free VM

Let me tell you what happens when you shove 1.7 million dense PostgreSQL rows onto a 6GB ARM instance and then try to serve real user traffic. Sequential table scans that choke the server for 45 seconds. COUNT(*) queries that make the query planner weep. Multi-language LLM translation costs that compound into a budget black hole before your first hundred users hit the site. That's where LemonKnows started — a consumer safety and food intelligence platform tracking 1.7M products, 41,000 global government recalls, and 100,000+ community complaints, all running on Oracle Cloud Free Tier with zero recurring hosting cost. Today, user-facing pages assemble in 66ms. Here is the exact engineering playbook that got it there. \ The Problem With COUNT(*) If you are running SELECT COUNT(*) on tables with millions of records to power front-end pagination or dashboard metrics, you are actively killing your database. PostgreSQL handles MVCC (Multi-Version Concurrency Control) by physically scanning rows to check visibility for the active transaction. On a 1.7M row dataset, this obliterates the query planner. Under load, a single COUNT(*) call was stalling the server for up to 45 seconds. The standard advice is to use pg_class estimates: SELECT reltuples::bigint FROM pg_class WHERE relname = 'products'; Faster, but useless the moment you add WHERE clauses. pg_class doesn't know about your filtered subsets. The actual fix: I stopped asking Postgres to count things at query time entirely. I built a lean stats_cache lookup table — a flat key-value store of pre-aggregated counters. Every time a background ingestion pipeline runs or a product's safety status changes, the backend increments or decrements the relevant counter asynchronously. Front-end requests execute a 2ms primary key lookup instead of triggering a multi-second scan: -- The killer: SELECT COUNT(*) FROM products WHERE category = 'automotive' AND status = 'active'; -- The replacement: SELECT total_count FROM stats_cache WHERE key = 'products_automotive_active' LIMIT 1; The tradeoff is eventual consistency — counts lag slightly behind real-time state. On a consumer safety platform, that's an acceptable contract. The recall data itself is always accurate. The surrounding aggregates don't need to be. \ Why Single-Column Indexes Are Mostly Theater Single-column indexing is practically worthless when your queries filter by category, verify a regulatory safety flag, and sort by a reliability score simultaneously. This was the index setup that was quietly destroying performance: CREATE INDEX idx_products_category ON products (category); CREATE INDEX idx_products_status ON products (status); CREATE INDEX idx_products_score ON products (score); Three indexes. None of them useful for the actual query pattern hitting production: SELECT * FROM products WHERE category = 'automotive' AND status = 'active' ORDER BY score DESCLIMIT 20; Postgres would pick one index, apply it, then filter the rest in memory — effectively falling back to a partial sequential scan on a massive intermediate result set. The fix: drop the single-column noise entirely and deploy one highly specific composite index tailored directly to the actual filter execution order: CREATE INDEX idx_products_cat_status_score ON products (category, status, score DESC); Column order is not cosmetic. It must mirror the exact sequence of the WHERE and ORDER BY clauses. When it does, the query planner bypasses sequential scans completely in favor of Index Condition lookups — the difference between scanning 1.7M rows and scanning the exact 20 you need. Building a Zero-Cost Multi-Language AI Translation Layer \ LemonKnows serves data across 7 languages — English, French, German, Spanish, Italian, Portuguese, and Dutch. Wiring an LLM API directly into the request stream to handle translations dynamically introduces two immediate dealbreakers: compounding network latency and an unpredictable API bill that scales with traffic. The architecture I built instead treats localization as a content-addressed cache layer. The data flow: An international request arrives for a product defect breakdown in German The Node.js backend generates a SHA-256 hash of the raw English source text It executes a lookup on ugc_translations where hash = SHA256 and lang = 'de' Cache hit: localized text returned from local storage. API cost: $0. Lookup time: sub-5ms Cache miss: string is piped via background worker to Gemini 2.5 Flash Lite, translated with regulatory jargon preservation, written to the local DB cache, and returned to the user The key design constraint: the background worker runs asynchronously. The user never waits for a live LLM call. On a cache miss, they get the translation on this request. Every subsequent request for that string — from any user, in any country, including search engine crawlers — costs nothing and returns in under 5ms. A specific dataset gets translated exactly once. After that it's free to serve forever. The UGC Image Pipeline and Database-Level Abuse Prevention Managing 1.7M+ products means dealing with massive content fragmentation — specifically missing product imagery. Manually sourcing photos for a million items is impossible for a solo operator, so I engineered a crowdsourced UGC loop with abuse prevention built at the database layer rather than the application layer. The core constraint lives in the schema: -- Hard limit: one pending suggestion per user, per product ALTER TABLE product_image_suggestions ADD CONSTRAINT unique_user_product UNIQUE (user_id, product_id); The database engine itself enforces the limit. No application-level rate limiting logic to maintain, no race conditions to handle, no way to bypass it with concurrent requests. The constraint is atomic. Submissions enter an isolated pending state and are audited via a lean admin dashboard before touching production data. A clean PATCH /products/image-suggestions/:id with { "status": "approved" } triggers the migration. Core production data stays completely untainted by unverified community input until explicit approval. What the Numbers Actually Look Like 1.7M product records across all categories \n 41,000+ global government recall records \n 100,000+ community safety complaints \n 7 languages served \n 1 Oracle Cloud Free Tier VM (6GB ARM) \n $0 recurring hosting cost \n 66ms average page assembly time None of this required exotic infrastructure. It required being precise about where real-time accuracy actually matters versus where eventual consistency is a perfectly rational engineering choice. The database doesn't need to count your rows in real time. Your indexes need to match your actual query patterns, not your intuitions about which columns matter. And your LLM API calls should happen exactly once per unique input, then never again. What I'd Do Differently The stats_cache approach works but requires disciplined counter maintenance across every ingestion pipeline. A missed decrement on a status change creates silent drift that's hard to debug. I'd consider a scheduled reconciliation job that periodically recomputes counters from source truth and patches any drift — something I haven't built yet but probably should. On the translation layer, the SHA-256 hash approach breaks if the source English text changes by even a single character — it generates a new cache miss and retranslates. Smarter fuzzy matching or a semantic similarity threshold would reduce redundant translation calls on minor content edits. LemonKnows is live at lemonknows.com . If you have thoughts on the architecture, a better approach to any of these problems, or just want to tell me what I got wrong — the comments are open. \ \ \ \

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook