Measuring and why it is important
Peter Drucker is usually credited saying “If you can’t measure it, you can’t manage it.” Yet others say that he never said it like that and also that it’s a fallacy[1]. We also believe that, like Drucker, that the first role of a manager is a personal one: “It is the relationship with people, the development of mutual confidence, the identification of people, the creation of a community. This is something only you can do.”[2]
However, this does not mean that measuring is not important, it just means that the inversion of the argument is not true, i.e., measuring is very important for management, but it is not the only important factor of management. Without measuring how would you be able to know if you are progressing towards your goals and success criteria? And, if you see that you are not progressing or not progressing as planned aren’t metrics at least one good option in trying to understand the issue and look into ways to improve the situation? So, measuring is important, and these questions also give us some indications about how metrics should be designed.
Link to goals and objectives
But let us come back to what this means for an AppSec program: When we design AppSec programs we view it as the foundation to know what the specific goals and objectives the program. By clarifying the goals and breaking them down them into objectives (or milestones) we can also plan better success criteria and then plan how to work towards them. For working towards the success criteria there can be soft factors like job satisfaction and organization culture, in our case secure development culture and the job satisfaction of the stakeholders of the software development organization, but there are also factors that boil down to questions about hard numbers such as “How many critical vulnerabilities do we have?”, “Is the number of vulnerabilities going up or down?” and “Is the risk from application level vulnerabilities[3] at an acceptable level to the business?”
Learning from other disciplines
Sometimes it is valuable to lean on shoulders of other disciplines who have the same or similar requirements. For a software application in production, it is quite similar if an outage is caused by a bug or a directly exploitable vulnerability. Therefore, it stands to reason that the ways to measure progress in managing bugs are similar to managing vulnerabilities. There are of course differences, for example, that risk to information disclosure or data integrity is usually caused by vulnerabilities and that the damage to the business from information disclosure or altered data can be even bigger than a system outage, but that mainly means that the problem is even bigger so there is even more motivation to address an exploitable vulnerability than a bug. Learning from long-standing background and experience in software quality assurance (QA) testing there are 3 important ways to measure this at a business level:
1.) Escaped defects
This metric refers to the number of defects that have “escaped” the QA process and are found in production, i.e., in software that is released. This is one of the most important metrics in QA since it is tied to the performance of the QA practice.[4]
2.) Defect distribution
Usually, this metric refers to when the defects are found in the development cycle, i.e., in unit testing, and of course the aim is to find them as early as possible, which means to shift left (note to the AppSec experts among the readers: sounds familiar?). Defect distribution measures how many defects were found earlier compared to later in the testing cycle, for example, unit testing vs. integration testing vs. escaped defects.
3.) Defect density
This metric refers to the number of defects in relation to the size of the release or software. The problem is however, how do you measure the size of a release or software? There are different ways. All of them are not perfect but the best available today is still to count the number of lines of code (loc). Another option is to count the amount of time that went into developing the software or to count story points[5]. But it is not always easy to get reliable data for other measures of density. LOC data is normally reliable to obtain but it can depend on whether some code in auto-generated and how much custom code is needed may depend on the programming language.
How to design KPIs for the business/strategic perspective of AppSec
As stated above the challenge of bugs/defects in software in production is quite similar or almost the same as vulnerabilities in software in production. Therefore, we can learn from the best practices from QA testing also for AppSec testing. Some of the key requirements for designing and using KPIs are:
- Measure only what is part of the goals and objectives
- Be able to “drill-in”
- Create a baseline and track values over time
- Make the metrics available to all stakeholder and as live as possible
- Indicators should not fluctuate unnecessarily
- Combine measures to give executive leadership one score to track
(1) The business goals and objectives for AppSec typically revolve around ensuring that the risk from application-level attacks is at an acceptable level to the business, i.e., the metric should endeavour to measure and quantify risk. What constitutes risk, e.g., if it is compliance risk, reputational risk, legal risk, or risk from a loss of revenue directly tied to a production outages due to software being unavailable requires a more detailed analysis. When you design your metrics or key performance indictors (KPIs) you should make sure that you only measure what is included in your goals and objectives and nothing else. The aim should be to measure as much as possible of the goals and objectives in as few as possible KPIs, ideally in just one metric at executive level.
You should be able to “drill-in” to the metric (2). This mean you need to be able to break down the number into individual components. For example, when using the defect density of the whole software development organization as a metric, you should be able to drill down into the defect density of each application that the organization in developing, so that areas with lower performance can be looked into, addressed, and improved. This is the best and sometimes the only way to work towards overall improvements. This also requires (3) to baseline and track values over time, otherwise you cannot compare the performance of one area (e.g., one application) with the performance of the same area in a previous time period and therefore don’t know if the area has progressed or regressed. In order for the organisation to work together on improvements it is crucial that everyone is able to monitor the metrics (4) therefore be able to act if the values decline. This can also help in driving some level of healthy competition between teams because no team normally wants to be at the bottom of the list in terms of performance or even worse than another team they relate to.
Furthermore, indicators should not fluctuate unnecessarily (5). This can maybe best be explained by an example of a KPI that has a serious flaw in this sense: Mean time to remediate (MTTR) is a common KPI that measures how much time (usually measured in days) it takes to remediate an issue after it was found. However, this metric only captures issues that are actually remediated. This may mean that when a software organisation or team starts remediating issues that were identified a long time ago, the value will actually temporarily go up. Therefore, this penalizes good behaviour if it is not at least viewed in conjunction with the number of vulnerabilities that are still open. The amount of time that a vulnerability is open can of course be relevant, but our recommendation is to look first at the vulnerability age (independently of whether the vulnerability was already fixed or not). Only in practices that have a high maturity it can be interesting to also look at the MTTR, but it should not be the first and most important KPI.
Lastly (6), it is a good practice that has proven to be effective to combine measurements into one weighted risk score value. For example, relevant measurements from different testing tools such as SAST and SCA and findings of different criticalities should be combined into one weighted score. That way executive leadership can review this score on a regular basis, e.g., monthly and see if it is developing in the right direction. The metric only needs to be agreed once and afterwards can be tracked and reported on a regular basis which should mean that the business can rest assured that the risk from application-level attacks is appropriately managed. If it does not develop in the right direction, you can drill-into it (see above) understand the root cause and address it.
KPIs for different levels – leadership/strategic and management/tactical
In QA as well as in AppSec there are KPIs/metrics that are relevant for different levels of the organization. For example, in QA testing metrics such as test coverage and code coverage are important on a tactical level and should be improved over time, but they are more important at the level of the QA management and less for the executive leadership. The number of escaping defects will already reflect if the test coverage is sufficient. Therefore, executive leadership does not have to track test coverage whereas as a QA manager this metric is important to work towards reducing escaping defects. Similarly, in AppSec testing there are findings that are detected in code that is already in production and findings that are found in development branches that are not yet in production (feature branches). The latter are not causing risk to the business, but the density of such issues is relevant from a tactical perspective to make sure fewer new vulnerabilities are introduced in newly developed source code.
Conclusion for AppSec KPIs
In conclusion our recommendation is to only measure defect density vulnerabilities from different testing types such as SAST and SCA for code that is already in production (i.e., release branches). The defect density should be weighted by the criticality of the finding, i.e., higher criticality findings should have a higher weight in the calculation. This metric is the equivalent to a combination of “escaped defects” and “defect density” in QA and can serve as the only value to track at executive level.
Other metrics, such as vulnerability aging and defect distribution are also important on a AppSec management level (but not necessarily at executive level). Defect distribution can compare how many defects were identified in code that is not yet in production compared to vulnerabilities detect in source code that is already in production and the ratio should be reduced over time. Vulnerability aging is a metric that should be reduced over time, and it will contribute to reducing the vulnerability density over time: The shorter the time vulnerabilities are open the fewer of them there will be. It is therefore a tactical level KPI that will help to improve the long term strategic KPI.
Conclusion
Finding the right KPIs for AppSec testing can be difficult but it is crucially important as one handle to manage application-level risk over time and this risk is one of the biggest risks to a lot of organizations whose main business process relies on software or is software. In our opinion it is valuable to lean on other related disciplines and learn from the past to inform decisions about KPIs. We hope this article gave some insights from our practice into which metrics and KPIs to use. Please contact the authors if you have comments or further questions. The considerations presented in this article is one of a number of best practices in AppSec management and practice that we gather and use in the Checkmarx AppSec Program Methodology and Assessment (APMA) Framework. For more information, please see our APMA page.
This article originated from joint work of Yoav Ziv and Carsten Huth at Checkmarx, Yoav with a long-standing background and experience in QA management, Carsten with a similar background and experience in application security. Working together in the customer-facing application of the topics discussed in this article, we gained insights which we find valuable to share with practitioners in application security and information security.
[1] Anne-Laure Le Cunff: “The fallacy of ‘what gets measured gets managed’”, see https://nesslabs.com/what-gets-measured-gets-managed
[2] Peter Drucker, according to Paul Zak, see https://www.drucker.institute/thedx/measurement-myopia/
[3] Despite the esoteric introduction we talk about AppSec in this article.
[4] https://www.testim.io/blog/qa-metrics-an-introduction/
[5] Story points are used in Agile and Scrum methodologies for expressing an estimate of the effort required to implement a product backlog item or any other piece of work.