An RCU bug resulted in about one “near miss” per hour, and I used ten-hour test runs to guide debugging.
So why ten hours?
Stand back! Obtaining this answer will require statistics!!!
And the statistics appropriate for this situation is the celebrated Poisson distribution. If you have a way of calculating the cumulative Poisson distribution, you give it the number of events seen in the test and the number of events expected in that test, and it gives you the probability of that many events (or fewer) occurring due to random chance.
In this case, I am proposing running a ten-hour test that has historically
resulted in about one near miss per hour, which means that I would expect
ten near misses in my ten-hour test.
If I instead see zero near misses, what is the probability that the lack of
errors was a fluke, AKA a false negative?
We can ask this question of the maxima
open-source symbolic-math
package as follows:
load(distrib); bfloat(cdf_poisson(0,10));
The bfloat()
converts the answer to floating point,
as opposed to the somewhat unhelpful
gamma_incomplete_regularized(1, 10)
that maxima
would otherwise give you.
The cdf_poisson(0,10)
computes the Poisson cumulative distribution
function, which gives the probability of zero or fewer events occuring
in a time interval in which ten events would normally occur.
This probability turns out to be 4.539992976248485b-5
,
which means that
we should have about 99.995% confidence (AKA “four nines”)
that a failure-free ten-hour run is not a false negative.
Of course, statistics cannot prove that the bug went away entirely. For example, if an alleged fix decreased the bug's rate of occurrence by an order of magnitude (as opposed to fixing the bug entirely), then a ten-hour failure-free run would be quite probable, occuring about 37% of the time. This of course means that we should use a longer test to validate any proposed fix.
In this case, I used a 1,000-hour test run, which would be expected
to have 1,000 near misses with the bug still in place.
Typing bfloat(cdf_poisson(0,1000));
at maxima
resulted in 5.075958897549416b-435.
In other words, the percentage representing the confidence that such
a run is not a fluke has 434 nines, two before the decimal point and 432 after.
We could therefore be extremely confident that an error-free 1,000-hour
run means that our proposed fix did something.
But we can conclude more.
For example, let's suppose that the proposed fix was incomplete, so that
it reduced the near-miss rate by only two orders of magnitude.
In that case, we would expect ten near misses in the 1,000-hour run,
so that the false-negative probability is again computed using
bfloat(cdf_poisson(0,10));
, meaning that we can be 99.995%
confident that an near-miss-free 1,000-hour run indicates that the
proposed fix reduced the near-miss rate by at least two orders of magnitude.
In short, although we cannot prove the absence of errors by testing, but we can prove high degrees of confidence in greatly decreased probabilities of occurrence!