
\ I lead product on a browser agent that fills out job applications for candidates. Where previously the best option for candidates was a browser extension, job boards can now use LLM-driven browser automation to enable candidates with better UI/UX. Our product traverses enterprise applicant tracking systems (e.g., Workday, iCIMS, Oracle, Ashby) and submits tens of thousands of qualified applications every month on behalf of real candidates. If you've ever built software that touches the web, you may be familiar with the truth: everything that can break, will break. Constantly. And it breaks in ways that may be new to you, even if it's the 1000th run of the same process. This article explains how we started to stop throwing human time at that problem, and started pointing AI agents at our own AI's failures instead. The reality of browser automation Every applicant tracking system (ATS) we add to our coverage brings its own flavor of flakiness: different maintenance mode periods, dropdowns that change their item order, success states that are impossible to detect reliably, various types of bot-detection, and the occasional form that throws an error in its own JavaScript that never reaches our logger. This quickly results in a queue of unstructured failures. A good chunk may genuinely be blocked and need engineering effort to solve, but most browser issues can usually be solved via a simple retry. But first, a human needs to open the trace, wait for it to load, diagnose, and come to this conclusion. Previously, the instinct may have been to hire, throw a working student or two at it. However, this scales linearly with the number of systems and pages you support, and our whole reason for existing is to not scale linearly, but exponentially. If headcount is our only lever for reliability, we don't have a product. Why an agent introduces new failures With an LLM-driven system, a new type of error class comes into play. Whereas for a usual software product, it may be sufficient to set up Sentry and fix bugs as they come in, an agent will continue to surprise you with its nondeterministic output. In a nondeterministic web-driving system, the stack trace is often the least interesting artifact. The same form can fail three different ways in an hour, and the ‘error’ is a symptom three layers downstream of the actual cause. The error string may be the same (e.g., "Couldn't find option to select in the dropdown"), but the cause of the issue could range from a maintenance mode, to the page load stalling, to the dropdown options having actually changed. These all result in completely different fixes. The real work isn't reading the error message, but opening the trace and tracking the steps that led to the failure. The solution: classify via LLM In eval writing, there's a concept called "LLM as a judge". Using an LLM as a judge means that you're writing a test for a function that you know is not deterministic, for example, an AI chatbot response, and your test uses another LLM call to check if the chatbot answers in the way you expect. For example, your test may be "the chatbot never blames the customer", and you try and jailbreak the chatbot to blame the customer in your test. Then your LLM 'judge' answers: "did the answer blame the customer?" We can do the same for our failures. We use another LLM to validate the unknown output of the agent. I think of it in two tiers: Tier 1: an in-process classifier. A single LLM completion fires inside our apply process the moment something fails, while the live page, the form graph, and the error are all still in scope. It writes a structured label to the database: { "error_cause": "DROPDOWN_NO_MATCH", "description": "Country 'Burkina Faso' not present in 49-entry option list", "stack_trace": "..." } This costs a fraction of a cent, but it labels everything. It's not trying to fix anything; it's trying to make the next step possible. Tier 2: the agent loop: We have a cron job that watches the errors flowing in from the previous tier, and clusters them. When a cluster crosses a threshold (say five sibling failures in 24 hours across multiple processes), it spawns a Cursor cloud agent to solve the issue. Spawning an agent for a single failure is never worth the cost. Spawning one for a recurring class of failures is. Getting the context right Everyone fixates on the agent. In practice, the agent is the easy 20%, and it gets better with each model release. The 80% is in the context. We are lucky, as we already had access to the Playwright trace when we started building this. The trace contains all of the information we'd ever need, so it was just a question of prompt engineering until our test suite of 30 traces all evaluated to the correct failure diagnosis. The most impactful items ended up being: Application logs, so that the agent can trace the steps that your system took Network requests, so that the agent can debug failed resource requests or captcha failures Screenshots taken before actions that led to the failure state Keeping humans on the one-way doors One principle we have not yet moved past is that the human makes the final decision. While it depends on how sensitive your product is, the classification of errors and proposed PRs to fix root causes are already saving us 90% of the time spent. Making the final decision ourselves gives us the security and freedom to explore the bounds of our classification system without having to be scared that it will all of a sudden perform a bunch of wrong decisions that we can't undo. We think about every recommended action in terms of blast radius. A retry outcome is low blast radius and almost always reversible, so it's a one-click human confirm today and a great candidate for full automation later. A PR that changes our codebase has the potential for a much higher blast radius, so we treat it that way. Before you build this As mentioned above, classification is the big unlock. You cannot cluster, prioritize, or automate anything until every failure is cheaply labeled at runtime. I highly recommend that you build that first. Once you have the classifications, don't skip the boring deterministic fixes. The classification of bugs is only useful if you can also balance it with the root-cause solutions that lower your overall error ratio. While this back and forth feels incredibly natural now, I can't say we always had the foresight. We spent a full year building an AI that operated software with humans primarily overseeing it, but pointing a second AI at the first one's mistakes was the big unlock. \
View original source — Hacker Noon ↗
