What Healthcare Engineers Know About Building Reliable Systems That Web Engineers Don’t

This is going to sound combative. It isn’t meant to be. Web engineers, broadly, are excellent at the work they do. The systems they build operate at scales and with feature velocities that healthcare engineers, broadly, don’t approach. The web engineering tradition has produced some of the most impressive software infrastructure in computing history. The thing healthcare engineers know that web engineers tend to underestimate isn’t about technical skill. It’s about a specific category of failure mode and the reliability discipline that comes from designing around it. The discipline is invisible most of the time. It only becomes visible when the failure mode would have manifested. I want to lay out what I think the difference actually is, with examples, in the spirit of cross-discipline learning. Healthcare engineers can learn things from web engineers — most healthcare codebases I’ve worked on would benefit from a serious dose of modern operational practice. The reverse is also true, and the reverse is what’s less often discussed. The frame: silent failure as the worst kind Web engineering has a strong “fail loud” tradition. Errors get raised. Stack traces get logged. Sentry pages somebody. A 5xx response gets returned. The standard is that the system either works or fails in a way that’s visible. Healthcare engineering takes that further. The standard isn’t just “fail loud.” The standard is “fail loud, fail safe, and never fail silent.” The distinction matters. A web service that returns a 500 error is failing loud; the user sees an error, the engineer sees a stack trace, the system corrects. A web service that returns a 200 with the wrong data is failing silent; the user thinks they got what they asked for, and the wrongness might not be discovered until much later, if at all. In typical web applications, silent failures are usually annoying or expensive. The wrong recommendation. The wrong total in the cart. The cached stale data that shows when fresh data should have. The user notices, files a support ticket, the issue gets resolved. In healthcare applications, silent failures are sometimes much worse. The wrong medication in a chart. The missed allergy alert. The lab result that didn’t make it to the chart before the next clinical decision. The audit log entry that wasn’t recorded for an access that did happen. These don’t generate support tickets in the same way. They generate adverse outcomes. The healthcare engineering discipline is biased toward catching silent failures even when the cost is friction. Synchronous confirmation where async would be faster. Strict validation where lenient would be more permissive. Acknowledgments at every step where fire-and-forget would be cheaper. The bias produces slower, more conservative systems. It also produces systems that fail loudly when they fail, which is the property that matters most. A Kafka queue filled up during a high-volume period. The API continued returning 200s — correctly, from its perspective; it had handed the message off to Kafka. What it hadn't done was verify that Kafka was processing. The queue was backed up days deep. Downstream systems weren't receiving the data. No alerts were fired. No errors surfaced. Everything looked healthy from the outside because the entry point was healthy. The failure was silent for days until someone noticed the downstream system was stale. The fix was straightforward. The harder conversation was about what "success" means when you're returning 200 before the work is done. The artifact: the audit log as a primary concern Web engineering produces audit logs. They’re usually treated as a security and compliance concern, separate from the primary application logic. Healthcare engineering treats the audit log as a primary application concern. Every PHI access is supposed to be logged. The log entry is part of the operation, not a side effect of it. If the log write fails, the operation might need to fail too, depending on the regulatory framework. This changes the design. The transaction boundary around a data access expands to include the audit write. The reliability of the audit log has to be at least as high as the reliability of the data access. The audit log infrastructure has to be queryable in ways that ad-hoc audit infrastructure usually isn’t. The web engineer’s instinct, when they see the audit log requirements, is often to fire the audit event into a queue and let it be processed asynchronously. The healthcare engineer’s instinct is to ask what happens when the queue is down. If the answer is “we lose audit events,” the design isn’t acceptable. The audit log has to be durable, ordered, and complete, at the same level as the financial transaction log in a banking system. I’ve watched the asymmetric instinct play out repeatedly. A web engineer joins a healthcare project, looks at the audit log infrastructure, and proposes simplifying it by moving to async. The healthcare engineers explain why the current synchronous design exists. The web engineer is initially skeptical; the synchronous design is more expensive, both in latency and in operational complexity. After working through specific failure scenarios, the design usually stays synchronous. This isn’t because the web engineer is wrong. They’re applying patterns that work in their context. The patterns just don’t transfer cleanly to a context where audit completeness is a regulatory requirement. A new team member, looking to reduce latency in a high-volume clinical workflow, proposed deferring audit writes — completing all the clinical events first and writing the audit entries afterward. The logic was reasonable on its face: audit is overhead, do it last, do it async. The problem was that the audit entries weren't just logs — they were timestamped records of who did what and when, tied to the specific moment each event occurred. Deferring them broke that fidelity. If the process failed between the clinical event and the deferred audit write, the event happened but was never recorded. We weren't auditing after the fact; we were auditing in real time or not at all. The conversation took a while. The design stayed synchronous. The reliability target: data integrity over availability Web engineering is usually optimized for availability. The site has to be up. The user has to get a response. The trade-off is sometimes that the response is approximate or stale. Healthcare engineering is usually optimized for data integrity. The site can be down briefly; that’s recoverable. Wrong data is harder to recover from. The trade-off goes the other direction. This is the principle behind a lot of the architectural choices that look slow or conservative to a web engineer. The choice to do synchronous processing of medication orders rather than optimistic-update-and-async-confirm. The choice to refuse requests rather than degrade gracefully when a downstream system is unavailable. The choice to fail a clinical workflow rather than risk an inconsistent state. The web engineering instinct in each case is the opposite. Optimistic updates because users hate waiting. Graceful degradation because errors are user-hostile. Continue-on-error because partial functionality beats no functionality. These instincts produce better web applications. They produce worse healthcare applications, because the failure modes they accept are unacceptable in this domain. Engineers cross-pollinating from web to healthcare must learn this. The instinct that’s served them well has to be re-evaluated. Most do. Some struggle, particularly if they’ve internalized the web engineering culture as universally correct rather than as appropriate to a specific context. The schema: integrity over flexibility Web engineering, particularly in the modern stack, tends toward schema flexibility. JSON documents. Schemaless databases. Optional fields. Forward and backward compatible APIs. Healthcare engineering tends toward schema rigidity, partly because the data has clinical semantic content that has to be preserved exactly. A lab result has units, a reference range, a method, a timestamp, a specimen source. Changing the schema isn’t just a data format change; it’s a clinical data change. The schema rigidity produces slower iteration. Changing the data model is harder. New features that require schema changes are a multi-team negotiation. The web engineering instinct of “ship the schema change, migrate later” doesn’t work when the schema represents clinical content. This isn’t always positive. Some healthcare systems are over-rigid; the schema becomes a bottleneck for everything. The healthy version is more rigid than web is and less rigid than legacy clinical systems are. A schema that’s debated, validated, and committed to with care, but that’s not impossible to change. We modified a field in the patient model. It broke a downstream student health system. The system had stored procedures with static references to the old field structure — undocumented, unowned, invisible until they failed in production. No one on the current team knew they existed; the engineer who wrote them was long gone. The schema change took a day. Untangling the dependency nobody had written down took considerably longer. The deployment: smaller, slower, more careful Web engineering has produced extraordinary deployment infrastructure. Continuous deployment. Feature flags. Progressive rollouts. Rollback in seconds. Big tech web engineering teams deploy hundreds of times per day. Healthcare engineering deploys less. The reasons are partly cultural (healthcare is risk-averse), partly regulatory (some clinical systems require validated builds), and partly technical (clinical workflows are sensitive to changes in ways consumer workflows aren’t). The web engineer arriving at a healthcare project sometimes wants to introduce continuous deployment. The instinct is right; healthcare deployment is often slower and more painful than it needs to be. The instinct is also wrong if applied uncritically; some of the slowness is appropriate to the risk. The version that works is faster than legacy healthcare deployment but slower than typical web deployment. Multiple deployments per week, maybe per day, with progressive rollouts. Feature flags for client-facing changes. But also: change advisory for clinically significant changes. Validation runs for changes that affect data. Rollback plans that are tested. The web engineering tradition’s contribution here is real. It’s pulled healthcare deployment toward modernity. The healthcare engineering tradition’s contribution is the discipline that says “but not for everything; some changes warrant slower process.” The testing: edge cases as primary Web engineering writes unit tests for happy paths and a sample of error paths. The coverage target is usually 70-80%. The tests run in seconds. The test data is synthetic and small. Healthcare engineering tends to test more thoroughly because the edge cases are clinically significant. A lab result with a value outside the normal reference range isn’t an edge case; it’s the case that matters most. A patient with an unusual demographic combination (an infant with a complex name structure, a patient with a placeholder DOB, a patient with multiple insurance coverages) isn’t a curiosity; it’s something the system encounters daily. The test data has to capture this variety. The test suite has to validate against known-difficult cases. Production-like load tests have to use production-shape data, not just production-volume data. This is harder, slower, and more expensive than typical web engineering testing. It catches things that the typical approach misses. The test suite was thorough by standard measures. What it didn't cover was the international patient — someone visiting the US whose medical history was scarce or entirely absent. Every code path had been written under the implicit assumption that a patient record was reasonably complete: prior diagnoses, medication history, allergy records, previous encounters. For domestic patients, that assumption held often enough that nobody questioned it. For international patients, the record was frequently near empty. The code didn't fail loudly. It made decisions — flagging, filtering, calculating risk — on incomplete data without signaling that the data was incomplete. The assumption of completeness was never written down as an assumption. It was just baked in. We found it when a clinician flagged an unexpected result for a patient who had no US medical history. The fix required both code changes and test data that reflected the variety of patients the system would encounter in production. What I’d suggest cross-pollinating For web engineers entering healthcare: Learn to think about silent failure modes. Audit logs that aren’t written. Messages that don’t make it. Data that’s almost-right. The web engineering tradition doesn’t emphasize these as much. Learn to be suspicious of optimistic updates and async confirmation patterns in clinical paths. They’re fine for some operations and dangerous for others. The discrimination is the work. Learn to read the regulations directly. Don’t rely on a summary from somebody who didn’t read them either. The text is more specific than people think, and the specificity is where the engineering work lives. Learn to value data integrity over availability when they conflict. The web engineering instinct is the opposite, and that instinct is correct for web applications. It’s not correct here. For healthcare engineers learning from web: Modern operational practice is mostly underutilized in healthcare. Continuous deployment, feature flags, progressive rollouts, observability — all of these can be applied carefully to healthcare workflows. Most legacy healthcare systems have none of them. The web engineering tradition’s investment in tooling, process, and operational rigor is a model worth emulating. The “keep it working” muscle is real and worth importing. Schema flexibility for non-clinical data is fine. Audit log fields, operational metadata, reporting attributes — these don’t need the same rigidity as clinical content. Faster feedback loops between detection and response are achievable in healthcare contexts. The legacy healthcare patterns of “we’ll fix it next quarter” are not necessary; they’re cultural inheritance. The honest answer: web engineering needs healthcare caution more than healthcare needs web engineering discipline. Healthcare's operational gaps are visible and fixable — the tooling exists, the patterns exist, the path is clear. What web engineering is missing is harder to import because it requires accepting a different definition of success. A system that returns fast and fails silently is not a successful system. A system that's slow because it's verifying, confirming, and auditing every step is not a poorly engineered system. That reframing doesn't come naturally to engineers who've been rewarded for speed and scale. The healthcare engineering instinct — that correctness is non-negotiable and that silent failure is the worst kind — is the thing I'd most want web engineers to carry back with them. It doesn't slow you down as much as it sounds like it does. It just makes you think differently about what done means. What I’d want a web engineer to know about working in this space If a web engineer is moving into healthcare engineering, the things I’d want them to know: The work is real engineering. It’s not “boring corporate software” or “easier than web scale.” The constraints are different, not lesser. The conservatism isn’t paranoia. It’s a reasoned response to a failure mode that’s worse than typical web failure modes. The legacy systems aren’t always stupid. They’ve been validated against decades of edge cases. Replacing them is harder than it looks. The clinicians know things you don’t. The most useful conversations are the ones where you listen for longer than you talk. The compliance work is engineering work. It’s not a checkbox somebody else takes care of. If you’re not engaging with it, somebody else is doing it badly on your behalf. The reward is the kind of impact that’s harder to measure than ad CTR or user retention. The work matters in ways that are sometimes invisible but real. People who care about that find healthcare engineering meaningful in a way that web engineering rarely is. The cross-pollination between web engineering and healthcare engineering goes both ways. Web has things to teach healthcare about velocity, observability, and modern operational practice. Healthcare has things to teach web about silent failure, audit completeness, data integrity, and the discipline of building systems that have to be right rather than fast. The engineers who can hold both traditions in mind, and apply each where appropriate, are the ones I’ve found most valuable to work with. The ones who treat their original tradition as universally correct tend to underperform in the new context. The skill is contextual judgment. The judgment is the part that takes years to develop. The traditions are starting points, not endpoints. Healthcare systems are frustrating to work on. The legacy debt is real. The constraints are real. The pace is slower than it needs to be. I've sat in rooms debating schema changes for weeks and watched simple features take months. None of that is romantic. What cuts through it is simpler than any of that: somewhere, a radiologist using a tool I helped build caught something on a scan. An oncologist got a result faster than they would have otherwise. A patient got into treatment sooner. I don't know which patient. I don't get to know. But the possibility that it has happened, statistically, across the volume of cases these systems touch — is enough. It's a different kind of motivation than shipping a feature and watching the metrics move. It's quieter. It lasts longer. \

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook