Wait, Are We Winning?!: ​​Measuring Software Security

Multicolored waves similar to an investment graph in the colors of blue and red.
March 19, 2026

Columbia University’s New York Cyber Task Force, framed “winning” in terms of who had more system-wide advantages, attackers or defenders.

In the earliest days of computing, defenders had the advantage: if you locked the door and vetted everyone with access to your computer, then it would not matter much if your hardware or software was insecure. As early as 1970 networking became more prevalent which meant software vulnerabilities went from being a local issue, only exploitable by those with access to the physical computer, to a global and systemic problem, accessible by any other Internet-connected computer.

In this new environment, it mattered a great deal that software was insecure. A 1972 report concluded that the design of “hardware and software is totally inadequate to withstand malicious attack” and that “unless security is designed into a system from its inception, there is little chance that it can be made secure by retrofit.” This insecurity is driven in large part by misaligned incentives. As Bruce Schneier noted as far back as 2003, “there are no real consequences for having bad security, or having low-quality software,” and software that prioritizes additional features and timely release dates is often rewarded, “even if they come at the expense of quality.”

This bleak history can lead many security practitioners and policymakers to pessimism or despair. But new data suggests defenders are finally making progress in software security.

To measure software security, we looked at insights from a range of sources, including from Veracode’s State of Software Security Report, which has been reporting on software security for 15 years and now includes insights from over 1.3 million unique applications across over 2,000 customers.

We can report substantial progress across several important lines: 1) the severity of flaws, 2) their ability to be exploited, 3) the remediation speed of applying patches to the most severe flaws, 4) adversary speed and use of zero-day exploits, and 5) reductions in the use of memory-unsafe languages. The bad news is that security debt is increasing and Artificial Intelligence is not yet delivering more secure software. We will cover all these in turn.

According to Veracode data from software security testing, the level of flaws remains high: around 80% of applications were found to have at least one security flaw. However, the proportion of applications with a major flaw (that is, one in the OWASP Top 10 and CWE Top 25 lists of critical vulnerabilities) has been declining for years and at an increasing rate. From 2010 to 2020, there was an improvement of only about 1 percent a year. Since 2025, this has accelerated to 4 percent a year. Now over half of the software scanned lacks obvious, major flaws; up from less than a quarter in 2010.

Indeed, Veracode has reported that “every measurement” of security improved “over the last six years,” including a 30 percent improvement in software without a “flaw in an open-source library when they are first scanned,” which has since further improved. This last metric is especially useful, as it sheds light on both success at secure-by-design and open-source, third-party software and shows a reduction of risk from third-party code. This could be either from more-secure open-source code, or, more likely, quicker updates to remediated versions of the  third-party packages used by software builders.

In addition to there being fewer severe vulnerabilities, the average vulnerability is less exploitable. According to data provided by the Exploit Prediction Scoring System, and analyzed by Veracode, the “exploit potential” of vulnerabilities has decreased by nearly a third in less than five years (see Figure 1.) Exploit potential is the likelihood that a given software vulnerability will be actively used by attackers in the real world. In other words, it estimates how attractive and feasible it is for attackers to create or use an exploit for that vulnerability.

Figure 1:

A chart provided by Exploit Prediction Scoring System showcasing the decline of exploit potential of vulnerabilities from 2020 to 2024.

But are these gains offset by the fact that there are now far more applications out there?

It’s possible, but initial insights suggest this is not as severe as might be expected. If defenders were being flooded with insecure software, the numbers of exploits used by attackers might be expected to increase. But instead, the number of known exploitable vulnerabilities (KEVs) has been either flat (as reported by EPSS) or even decreasing (as reported by CISA).

The previous statistics illustrate success in the software security development process to create less exploitable code. There are successes as well in how quickly enterprises fix their own code and apply patches from those developers.

Cobalt, a penetration testing company, reported substantial improvements in patching the most serious vulnerabilities, which “has decreased dramatically—from 112 days [in 2017] down to 37 days” in 2024. The proportion of serious findings discovered in penetration tests “declined by about half,” from 20 percent to 11 percent in that time.

Veracode, using a different set of data, agrees that remediation times are still very long (with security bugs having a half-life of 243 days, see Figure 2) but those flaws are being fixed slightly faster, with a 4% year over year improvement from 2025.

Figure 2:

Figure 2 details a chart by Veracode on remediation times.

Attackers have budgets and bosses too” and would prefer to use the same vulnerabilities for years at a time, as it is easier and cheaper. Instead their costs are increasing, because code is more secure and patched more quickly. They have been forced into exploiting newly disclosed vulnerabilities (the age of which has dropped from 63 days in 2018 to just five days in 2023) and using more zero-day vulnerabilities (from about 30 a year up to 2020 to roughly 85 a year since then, according to Google). Threat actors are having to use these expensive flaws, known only to them and not the defenders, rather than relying on older, commodity flaws and are having to replace them more quickly.

Another major improvement, just since last year, is an 11 percent reduction of flaws in third-party, open-source packages. The majority of critical severity secure debt still comes from third-party code, but the security of the software supply chain is at least improving, with 62 percent of applications having vulnerabilities in the open-source packages on which they rely.

Only recently have “memory safe” languages become a priority, despite having been raised in the same 1972 report which introduced the concept of secure by design. While memory vulnerabilities typically represent about 70 percent of vulnerabilities that Microsoft fixes per year, there have been substantial improvements, with Google and Microsoft reducing by half the number of memory-safe vulnerabilities in Android.

Not all the trends are shifting in favor of more-secure software.

Though patch rates are improving, the security backlog is growing faster than remediation capacity can eliminate it. The percentage of applications with “security debt” (known vulnerabilities left unresolved for more than a year) has surged from 42 percent to 49 percent. Fully 60 percent of organizations have a critical debt (flaws that are both severe and highly exploitable) has grown by 20 percent. Much of this delay is from using engineering capacity to push new features rather than reducing security debt. So, even though patch rates are improving, defenders are falling farther behind in their security debt. 

What About Generative AI? So far, generative AI is not substantially improving software security. Four independent studies (NYU, Wuhan, Stanford, Purdue) indicate that AI-generated code is slightly less secure than human code, likely due to it being trained on whatever open-source data happens to be available.

Veracode has confirmed this with its own study of 100 LLMs released over the past 25 months which found that while LLMs have made great strides in creating syntactically correct code, they have not improved in writing secure code, as shown in Figure 3. The average code completion task across four major programming languages produces a vulnerability 45% of the time. Yet there has been some recent improvement. The late 2025 reasoning models wrote insecure code only 30% of the time.

Figure 3:

Figure 3 details a chart showcasing the Security Pass Rate versus LLM Release Date from September of 2022 to December 2025. This chart is from Veracode.

That said, the potential good news is that it is far easier to train AI to code more securely than it is to train the perhaps 10 million software developers in the world, who are often trained to prioritize functionality over security.

Software is becoming far more secure across a range of important measurements. Years of patient investments and improvements finally seem to not only be paying off but making it more difficult for attackers.

To keep the progress going, this data suggests three key recommendations.

First, industry should further invest in ensuring that generative AI tools help create secure code. This not only helps their developers improve their own internal software security, but also the software of their users.

Second, governments and industry should continue to tackle flaws in third-party, open-source code. These investments will likely have an outsized payoff, as this seems to be an important source of remaining critical severity security debt.

Third, to truly succeed, defenders need to invest more in measuring these successes. Measurement is far simpler and less expensive yet has received comparatively little attention.

To improve measurement of ecosystem-level improvements, defenders and developers need to leverage data from existing sources, such as Veracode’s SoSS reports as well as those from Cyentia, EPSS, and others, using a common framework across threat, vulnerability, and consequence, such as the one in Columbia’s “winning” structure.

Those with relevant data should be encouraged by governments, industry associations, and other stakeholders to report defensibility relevant statistics, in time series and according to the framework. Such data can then be analyzed and collated into a periodic report to measure progress. These could, for example, be the basis of the annual posture reports required by Congress of the White House’s Office of the National Cyber Director.

Perhaps the most important activity for improving software security would be to move from metrics to goals. We might agree, for example, to drive the proportion of software with critical flaws from around 50 percent to 25 percent in five years and just 5 percent in ten years. The average age of vulnerabilities used in successful exploits should be less than a single year. Political and technical goals such as these can drive ideas and innovations giving a target for defenders and developers to aim for.

Success is possible, let us aim for it and measure our progress.

The views represented herein are those of the author(s) and do not necessarily reflect the views of the Aspen Institute, its programs, staff, volunteers, participants, or its trustees.