Why Vulnerability Reduction Percentages Can Be Misleading

You see this kind of number in security writing all the time. “We reduced vulnerabilities by 80%.” “Critical findings dropped 90%.” “Open Highs reduced by 75% over six months.” The numbers travel through case studies, vendor pitches, conference talks, executive summaries. I’ve been on the inside of programs that produced numbers like this. I’ve also been on the inside of programs that produced these numbers in ways that, looking back, were not entirely honest. The number is real, in some sense; what the number means is the more interesting question. This article is an honest accounting. What does an 80% vulnerability reduction actually look like, what work produced it, what was excluded from the numerator and denominator, and what the work actually cost. I’m not going to give you a specific organization’s specific number. I’m going to walk through the kinds of decisions that go into producing a number like that, and what each decision implies about how much you should trust the headline. What the number is measuring The first question to ask, when somebody quotes a vulnerability reduction percentage, is what’s being measured. There are several plausible denominators: Total findings open at the start of the measurement period. Total findings reported during the measurement period. Findings of a particular severity (Critical, High, Medium). Findings in production code (excluding test code, sample code, deprecated modules). Findings that survived triage (excluding marked false positives, risk-accepted findings). Findings against a particular tool’s output, normalized to that tool’s rule set. Each denominator produces a different percentage from the same underlying work. An 80% reduction in “Critical findings in production code that survived triage” is a different statement from an 80% reduction in “all open findings across all severity levels.” The first is much harder to achieve and a much stronger signal. The second can be produced by aggressive false-positive triage without fixing anything. The numerator has the same problem. “Reduction” can mean: Findings closed. Findings remediated (fixed). Findings whose status changed from open to either closed, fixed, risk-accepted, or false-positive. A program that aggressively risk-accepts can show large “reductions” without changing the underlying code. The most common manipulation I have seen in a RIS integration project connecting two radiology systems, our first quarterly report showed a 70% vulnerability reduction. What it reflected was risk-accepted findings in OAuth token handling and MongoDB connection strings — documented as accepted risks, not remediated. When we tightened the definition to confirmed fixes only, the number dropped to 35%. The 35% required real engineering work. The 70% was a reporting decision. What the work usually looks like When the program is real, when the underlying code actually has fewer vulnerabilities at the end than it did at the start, the work that produced the change usually has a recognizable shape. The first month or two are mostly about triage and tooling. The team filters out false positives, writes custom suppression rules, gets the signal-to-noise ratio under control. This phase produces large headline reductions because a 200-finding false-positive class disappears in a single suppression rule. The number looks dramatic. The actual security improvement is real but bounded; you’ve eliminated noise, not vulnerabilities. The next phase is the prioritization phase. The team distinguishes between findings that need fixing and findings that don’t. Severity is reranked against operational risk. Risk acceptance is documented. The remaining work is what actually needs to be fixed. This phase doesn’t produce dramatic headline numbers; it produces a cleaner backlog. The third phase is the remediation phase. Engineers fix the findings that survived triage. This is where the real reduction happens, but it’s slow. A team can sustainably remediate maybe 50-100 findings a quarter, depending on complexity. If the surviving backlog is 400, you’re looking at four to eight quarters to drain it. The fourth phase, which most programs don’t reach, is prevention. The team works on the patterns that produced the findings in the first place. Framework upgrades. Helper libraries. Lint rules. Code review checklists. This phase reduces the inflow of new findings, which is the only way the program can actually maintain a low backlog over time. The 80% reduction stories that are real are usually combinations of all four phases over a multi-quarter period. The 80% reduction stories that are misleading are usually phase one accomplishments labeled as comprehensive program achievements. Our timeline looked different from the typical pattern: before the patient portal integration began — pulling imaging data from multiple RIS systems, running AI diagnostic predictions against that data — we ran a TDD proof of concept specifically to establish secure coding patterns across the team. We invested two months in tooling and standards before writing production code. The result was a backlog that never grew out of control. The prevention phase wasn't phase four. It was phase one. What gets left out of the headline There are a few things that are usually left out of the percentage when it appears in a case study or pitch. The findings that weren’t found. The 80% reduction is against the findings the tools produced. Findings the tools didn’t produce — business logic flaws, design flaws, vulnerabilities the rules don’t cover — aren’t in the denominator. The program might have great numbers and still have categories of vulnerabilities the team is blind to. The findings that the program created. A program that introduces new tooling sometimes generates new findings as the new tooling sees code that was previously unscanned. The headline reduction can mask growth in the denominator. The findings that were merged into other findings. Triage often consolidates similar findings. “We had 400 SQL injection findings; after triage we have 1 finding ‘SQL injection in DAO module’ covering all of them” is a 99.75% reduction by count. The vulnerability count in the underlying code didn’t change. The work done by other teams. Library upgrades, framework migrations, refactorings done for non-security reasons sometimes eliminate categories of findings as a side effect. The security program gets credit. The credit isn’t quite false — somebody did the work — but the program didn’t do it. The reduction that looked best on the dashboard was one the security program had nothing to do with. Upgrading to Java 8 eliminated an entire class of dependency findings in the RIS integration — CVEs that lived in the JDK itself, resolved by the version upgrade, not by any remediation work. Migrating to .NET 4.6 and updating Entity Framework and the associated NuGet packages removed vulnerabilities that had been open for quarters. Both showed up as program wins. Neither was a security decision — they were platform upgrades driven by other roadmaps. What the work actually costs For honesty, the cost side. What does it cost to drive a real reduction? Triage time. A senior security engineer spending 5-10 hours per week on triage, custom rule development, and prioritization. For a year, this is meaningful effort. Remediation time. Each finding that’s actually fixed takes engineer time. Average over a population: probably 2-6 hours per finding for typical findings, more for complex ones. If you remediate 200 findings, that’s 400-1200 engineering hours. Tool costs. SAST, DAST, dependency scanning, secret scanning, infrastructure-as-code scanning. The licensing for a mature security tool stack is non-trivial. Process costs. Triage meetings. Remediation tracking. Reporting infrastructure. Probably one engineer’s worth of time spread across multiple people, sustained. Cultural costs. Friction with the development team. Time spent on pushback, negotiation, finding-by-finding context. The relationship work is real and continuous. For a mid-sized organization producing the kind of reduction that makes a case study, the cost is probably 1-3 full-time equivalents of dedicated security engineering time, plus a fraction of every developer’s time, plus tool licensing, plus opportunity cost on the features that didn’t get built. Sustained over the multi-quarter period the reduction takes. When a vendor cites a reduction in their case study, the cost side is usually not in the headline. The customer paid for the licenses, paid for the engineering time, and produced the reduction. The vendor’s contribution is the tool that surfaced the findings; the customer’s contribution is the work to fix them. What the headline is good for Despite all the caveats, the headline number isn’t useless. It’s a coarse signal, like most aggregate metrics. A program that’s reduced findings by 80% is probably doing more than a program that’s reduced them by 5%. The directionality is real even when the magnitude is questionable. The headline is also useful for executive communication. Leadership doesn’t want to read a six-page methodology document. They want to know whether the security investment is producing results. A trend line — open Critical and High findings over time, declining — is a usable executive metric, even though it elides everything we just discussed. The headline is dangerous when it’s used as a target. The moment you set “reduce findings by X%” as a quarterly goal, the team is incentivized to produce that number, and the cheapest way to produce the number is usually triage and risk acceptance, not remediation and prevention. Goodhart’s law applies aggressively here. The teams that have used the metric well treat it as a leading indicator that triggers investigation. Why is the number what it is? What’s driving it up or down? Is the work that’s producing the number the work we want to be doing? The metric is the conversation starter, not the conclusion. The metric that worked best in practice was a 70% quality gate in SonarQube. It wasn't a reduction target — it was a threshold the code had to pass before it shipped. That distinction mattered. The team stopped asking how we close findings and started asking how we write code that doesn't produce them. It changed the conversation from remediation to prevention. That shift in thinking is the most useful thing a vulnerability metric can do. What I’d want from a security program If I were evaluating a security program, the questions I’d ask aren’t about the headline number. What’s the inflow rate? How many new findings per release, broken down by severity? If the inflow is high, the team is producing vulnerabilities faster than they’re fixing them, regardless of the snapshot reduction. What’s the false positive rate? If the rate is high and not improving, the team is fighting noise rather than vulnerabilities. What’s the mean time to fix, broken out by severity? If Criticals take more than two weeks, the program is moving too slowly. If Mediums take more than 90 days, the operational severity ranking might not be aligned with business priorities. What’s the backlog age? Findings that have been open for more than a year are usually either accepted (in which case they should be documented as such) or stale (in which case they should be revisited). A high-age backlog is a warning sign. What categories are dominating? If broken access control is 60% of findings, that’s a structural issue, not a triage issue. The fix is architectural. What’s the program’s relationship to development? Is the development team a partner or an adversary? The relationship is the leading indicator of program health that doesn’t show up in any metric. In practice, two things told me more than the headline number. First, which vulnerability categories kept repeating — if the same class of finding was showing up sprint after sprint, that was a pattern problem, not a developer problem, and it needed an architectural fix not a ticket. Second, SonarQube's per-developer diff view. We made each developer responsible for the findings introduced by their own changes. That accountability changed behavior faster than any team-level metric. You can't hide a pattern when the report has your name on it. What I’d say if you asked me about a real number from my work If somebody asks me whether I’ve personally been involved in programs that achieved an 80% vulnerability reduction, the honest answer is “yes, by the metrics that were being tracked at the time, with caveats about what those metrics measured.” Working on RIS integration between two radiology systems, our SonarQube scores improved substantially. Some of it was real remediation — OAuth implementation, dependency upgrades, findings that got fixed. Some of it was triage eliminating noise without touching the underlying code. Some of it was a Java version bump and a .NET migration that the security program took credit for but didn't initiate. The 70% quality gate was the most honest metric we had. A threshold the code had to pass, not a number the team could game. The reduction was real. The headline, without context, would imply more than the work accomplished. The version I’d want to share is something like yes, the program reduced findings substantially, and the reduction was real in the sense that the underlying code had fewer scanner-detectable vulnerabilities than it did at the start. The reduction was less real in the sense that some of the headline came from triage work that eliminated findings without changing the code, and some came from prevention work that we couldn’t directly attribute to the security program. The program produced real security value. The headline number, if quoted out of context, would imply more than the work actually accomplished. Both of these things are true at the same time. What this means for evaluating other people’s numbers When you read a vulnerability reduction number in a case study, vendor pitch, or conference talk, the questions to ask: What’s the denominator? What’s the numerator? What time period? What got included and excluded? What other work was happening in parallel that might have produced the same effect without security program intervention? What’s the cost side that isn’t in the headline? Is the trend sustained, or is it a one-time effect that will revert? What’s the program’s posture on findings the tools don’t catch? The number is rarely a lie. It’s also rarely the whole story. The interesting questions are downstream of the number, and most reporting doesn’t get there. What changed over time is not skepticism — the numbers are real, the work behind them is real. What changed is knowing that the number is always a summary of something more complicated. After a decade in healthcare systems where the code you ship has a patient on the other end of it, you stop leading with the headline and start asking what produced it. The longer version is where the actual information lives. The honest version of “we reduced vulnerabilities by 80%” is something like: “Over a multi-quarter program, we eliminated a large class of findings through triage, remediated several hundred of the surviving findings, and changed our development practices in ways that reduced the inflow of new findings. The headline reduction reflects all of these together, and the underlying code is genuinely more secure than it was at the start, though the magnitude of the change isn’t fully captured by any single metric.” That’s longer. It’s also more useful for somebody trying to understand whether the work was real and replicable. Most security programs don’t make the longer version available. They give you the number. The longer version is usually only available to people who were inside the program. Sharing it more broadly, when we get the chance, is one of the few ways the field gets honest about what works and what doesn’t. The 80% reduction is real. It’s also more complicated than the headline. Both, at the same time. The work behind the number is the part worth understanding. \ \

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook