I Reverse-Engineered a Government Portal to Save My Father Days of Manual Work

From family errand to engineering case study: How a data engineer implemented dynamic, user-friendly filters to efficiently navigate a website that stores its data in a flat static file rather than a live database. Introduction My father needed to find a name. Not on Google. Not in a spreadsheet. It was somewhere inside a 23-year-old electoral roll — a set of official voter lists published as PDFs on the government website. The website provided a search filter, and my dad had the 4/5 values he needed. But because the final filter value was unavailable, the only manual option left was to click through every combination, open the PDF, and find the name. We were looking at opening 305 separate multi-page PDFs and reading through them line by line. It was truly an exhausting task. I'm a data engineer, and I usually use AI as an advanced enabler for my day job. My world is normally built around enterprise pipelines, data curation, and analytics. This wasn't that kind of problem. It was smaller. But more personal. Instead of writing complex custom code from scratch, I turned to AI as my partner. I used it to quickly script the browser automation, parse the local language text layers, and intelligently scan through those hundreds of pages. A multi-day manual struggle ended in just 2 to 3 hours of effort by leveraging AI to pinpoint the exact solution and finish the task. The moment my dad saw my effort meant more to me than any production release giving massive business value or customer delight ever has. All of a sudden, I started feeling more valuable, and, in practice, I thought a Data Engineering role truly makes life more convenient and easier, especially at this time when content creation is at an all-time high. This is the story of how I solved it The Challenge: Manual Search in a Heap of PDFs The portal looks simple enough. gives a form: Filter 1: Year Filter 2: District Filter 3: Assembly Constituency Filter 4: Part Number ← the unknown After picking the first three filters check at the last dropdown — 305 options, each one a different polling station — select one, click View , wait for a PDF, scroll through scanned voter entries, close it, and repeat. Another hurdle is search language and pdf language is different. My father had been doing this manually for a long time. Open. Search. Close. Next part. Hope. Repeat. It is the kind of task that feels like it should take five minutes and quietly steals evenings for weeks. The emotional weight isn't technical. It's watching someone you love do repetitive, fragile work because the system wasn't built for the question they're actually asking: "Is this person on the roll?" What a Data Engineer Actually Does (When Nobody's Watching) People ask me what data engineering is. I usually say something about reliably moving data from A to B. The honest answer is closer to this: Find where the data lives. Figure out how to reach it. Automate the boring parts. Curate the data for downstream consumption. World-scale problems — climate models, fraud detection, healthcare analytics — all start with the same instinct my dad needed: there has to be a better way than doing this by hand . Government data is often public but not friendly. PDFs behind forms. Dropdowns that depend on other dropdowns. Bot protection that treats scripts like intruders. The "world problem" in our living room was just one name in one constituency in one year. But the shape of the problem was the same: Discovery — Where is the list of parts? Extraction — How do I get the PDFs without clicking 305 times? Search — How do I find a needle in 305 haystacks? Trust — Can my father believe the result? That's data engineering. Not always Spark clusters. Sometimes it's Python, Chrome, and a father waiting in the next room. The Solution: What is Built I ended up with two small scripts (~200 lines total) and a lot of failed experiments behind them. The final design is a hybrid pipeline : use the API where the server allows it, use a real browser where it doesn't. ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ ECI REST API │────▶│ Playwright + │────▶│ Local PDF │ │ (part list) │ │ Chrome (download)│ │ files (305) │ └─────────────────┘ └──────────────────┘ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ PyMuPDF search │ │ (name matching)│ └─────────────────┘ Stack: Python 3, requests , Playwright, PyMuPDF ( fitz ) Step 1 — Reverse-engineering the portal The page at <government site>. It is a React single-page app. There is no plain HTML <select> dropdowns — the form uses react-select comboboxes with hidden inputs: | Visible control | Hidden field | Example value | |----|----|----| | State | stateCd | S04 | | District | district | 26 | | Assembly | constituency | 189 | | Part | partNoAndName | 1 | My first instinct was browser-only scraping. Then I downloaded the site's minified JavaScript bundle ( main.*.js , ~1.7 MB) and searched it for API path strings. That's where I found references to < gateway-****> and endpoints under /api/v1/citizen/sir/ . The breakthrough endpoint: GET https:<> Header: state: S04 Accept: application/json This returns JSON like: { "status": "Success", "payload": [ { "partNumber": 1, "partName": "<>", "acNumber": 189, "distNo": 26, "oldPdfUrl": "<>" } ] } One HTTP call. 305 parts . Each with a local language polling-station name and a direct PDF URL. Interestingly, the API exposes oldPdfUrl , but still can't bulk-download with requests . Step 2 — Downloading PDFs through a real browser What failed first | Approach | Result | |----|----| | requests.get(oldPdfUrl) | 403 Forbidden | | requests + Referer + cookies from portal | 403 on batch requests | | Playwright headless Chromium | 403 — HTML error page, not PDF | | Playwright headless Chrome ( --headless=new ) | Same 403 | | In-page fetch() from JavaScript | CORS / network blocked | What worked Playwright driving visible Chrome ( channel="chrome" ), simulating exactly what my father did manually: Open the portal Select State → District → Assembly (once) For each part: open the Part dropdown → pick option → click View Capture the PDF bytes before they disappear into a new tab The View button is button[type='submit'] Clicking it opens the PDF in a new browser tab — no confirmation modal in our case. For react-select, standard [role='option'] selectors didn't work. The working selector was: page.locator("[role='combobox']").nth(3).click() # open Part dropdown page.locator("[id*='-option-']").nth(part_number - 1).click() # pick part The dropdowns are cascading — District options only load after State is selected, AC after District, Parts after AC. Each step needs a wait ( ~1–3 seconds ) for the next API call to populate options. The PDF capture trick: route interception Instead of trying to download from the new tab's URL (also blocked outside the browser session), we intercept the network response: def capture_pdf_route(route): if "eci-backend" not in route.request.url: route.continue_() return response = route.fetch() latest_pdf["body"] = response.body() route.fulfill(response=response) context.route("**/eci-backend/**", capture_pdf_route) When View is clicked, Playwright catches the PDF response from eci-backend , stores the raw bytes, and validates them: if not body.startswith(b"%PDF"): raise RuntimeError("PDF download blocked") Valid PDF magic bytes. Not an HTML error page disguised as success. Files are saved with readable names: part_041_<>.pdf Reliability features manifest.json — tracks every part: downloaded, skipped, or failed (we can resume with --start 304 ) Skip existing files — re-runs don't re-download Total runtime for the full constituency: ~31 minutes . Step 3 — Searching 300+ PDFs for a name Once PDFs are local, the problem is pure data engineering: extract → transform → load(filter) . import fitz # PyMuPDF doc = fitz.open(pdf_path) text = "\n".join(page.get_text() for page in doc) Each voter row in the extracted text contains fields such as name, relative (husband/father), age, serial number, and EPIC reference — all concatenated into a single string. KEYWORDS = [ "xxx", "xxx", # English "yyy", "yyy", # local language "xx xx", "xx xx", # two-word variant (for filtering) ] Text is normalized (lowercased, whitespace collapsed) before regex matching. ... 1 <first name> <last name> <address> 1283.0 f 41 189 41 ... That context is how to distinguished (<>, one name) from ( , two words — likely a different person). Results go to search_results_<name>.json with file path, keyword matched, and context snippet. Scanning 305 PDFs took under a minute . We found <> in 4 parts across the constituency — each with a different husband's name and age, so my father could pick the right record. How the Solution Helped When we ran the search, we found <> in multiple parts — with enough detail (husband's name, age, serial number) to identify the right entry. My father had been looking manually. I scanned 305 PDFs programmatically in under a minute. I watched his face when the matches appeared on screen. That look — surprise mixed with relief — is something no Jira ticket has ever given me. He wasn't surprised because Python is magic. He was surprised because the thing that had been eating his time for so long suddenly wasn't impossible anymore . The data was always there. We just stopped asking a human to be the loop. What Worked vs. What Didn't Worked: Public API for part enumeration Browser automation with real Chrome Intercepting PDF responses in Playwright Local PDF text search with Hindi + English keywords Didn't work: Plain HTTP downloads (blocked) Headless automation (blocked) Guessing undocumented POST APIs from minified JavaScript Assuming every "<>" match is the same person If you're learning from this: failure is most of the job . We tried the elegant solutions first. The working solution was slightly ugly and completely effective. That's normal. The Part That Doesn't Go in the README My father spent a long time on this. Not because he lacked intelligence or effort, but because the system was designed to look up one part , not to search a name under a constituency . I helped him the way I know how: by treating his problem like data. Where does the list live? Can I automate retrieval? Can I search at scale? Can I show him the answer clearly? When it worked, I wasn't proud because I wrote clever code. I was proud because I gave my dad his time back — and because he saw, maybe for the first time, what his son actually does for a living. Data engineering isn't always about petabytes. Sometimes it's about one name. One family. One evening, when your parent stops struggling and starts smiling. That's the world problem I solved this week. And I'll remember it longer than any pipeline I ship this quarter. Closing Thought Public data should be accessible. Until it is, engineers will keep writing scripts in living rooms — for parents, for communities, for anyone stuck clicking "View" for the 200th time. If you have a similar story, I'd love to hear it. A Father’s Delight is incomparable to Customer Delight. — A data engineer who finally got to impress his dad Tags: #Python #DataEngineering #WebScraping #Automation #ElectionData #Family

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook