Common statistical metrics are not applicable in performance testing. Incorrect metrics may cause you to ignore real problems and to optimize where it isn’t necessary. Let us look at the proper way to do performance testing.
What to measure
The most common advice given on performance optimization — and rightfully so — is to measure what you are planning to improve.
When talking about application performance, what are the most typical performance issues we need to address first? Luckily, RebelLabs has already asked this question. Here are the answers:
If we discard the multiple choice confusion and look at the raw data, 86% of respondents have experienced at least one of those critical issues, indicated by red bubbles. Which in turn means that 86% of code-related performance issues are caused by inefficient code and poor database access patterns. Let’s start with them.
How to measure
We know our suspects — inefficient application code, slow database queries and inefficient data access patterns, — but how to measure them? A typical recommendation is to apply statistical methods. However, is that a good recommendation?
When your goal is to find out about the incident after it happened, statistics works.
However, this won’t provide a lot of information. An increase of the average value will tell you that some users of your application were affected. A spike in the 50th percentile will tell you, again, that some users suffered. But how bad was it, really?
If you consider a better approach to performance management: catch performance issues before they put your customers through latency misery, then statistical methods simply don’t work.
To help us see why I will need to introduce the concept of latency profiles.
A latency profile is made up from all request latencies, sorted from fastest to slowest and plotted on the chart. In this chart, the Y axis shows the latency value and the X axis represents percentiles:
The green part of the line shows the requests that became faster than the baseline (grey). Red, in turn, shows requests which are slower.
A latency profile shows every single request. This is an important distinction from statistical metrics that consist of a single number. The average of 100 requests is one number. The median of 1000 requests is another single number. These metrics are lossy. This is why an increase in an average does not tell you anything about the proportion of affected users.
The latency profile, on the other hand, is lossless. All requests are taken into account and no information is lost. You can even see which segment was affected. Just like the example above: 80% percent of users got faster response times and 20% experienced a degradation.
How not to measure
A single metric does not tell you anything.
The charts below show page load time metrics after refactoring.
When we only look at the median (left), which is the point on the line right above the 50% mark, we might conclude that the page load times are faster. The refactoring is ready to be shipped. You won’t find out about the regression — 20% of requests above the baseline — until it hits production.
However, looking at the entire picture (right) leads to a different decision. Is the improvement for 80% of users acceptable — at the price of genuine performance degradation for the remaining red 20%? It might be — or might not be, depending on your goal. The important thing is to ask this question early. Before it becomes another production firefighting story.
Now let us apply this approach to testing.
Ideally, you need to run the same test suite before and after a set of code changes. For example, if a test suite gets run after each new build in a CI environment, you can compare those runs with each other.
When I use statistical gimmicks metrics for comparison, I might learn that my new code caused the average response time to increase. As explained above, a single metric gives me no additional information on whether I should act on this increase.
On the other hand, the latency profile differential tells me so much more. Comparing latency profiles from two builds produces an easy to understand and actionable diff. Here are some examples.
All tests became slower. I should be very worried.
A sporadic small bump. I should not worry too much.
An outlier problem. I should investigate.
For your app
Now let’s summarize what we learned.
Statistics don’t cut it when it comes to performance testing. Averages or percentiles compress all testing data into a single number and that hides the whole picture. Plus, it might lead to making poor decisions. Instead, you should be looking at all information available. The latency profiles show complete performance behavior, across all your tests. Comparing latency profiles leads to a clear understanding of the performance implications in your code changes. When done during testing, this leads to proactive regression detection.
The best part in all of this — the ideas outlined in this article are not theoretical. They are not even just abstractly practical. They are tangible.
Try XRebel Hub