Many testing groups avoid the painful process of hypothesis testing. When they do, they take a high risk of making serious errors – especially Type II Errors. These can be devastating for pilot plant work and can lead to significant “scope drift” as will be explained below.
It is important to note that whenever testing is done there are implied hypotheses – whether stated or not. Even something as simple as looking at the gas gauge on our car involves a number of hypotheses or assumptions as we try to determine if we should stop and get gas or keep on driving. We instinctively start evaluating the accuracy of our gauge, the reliability of our gas mileage, the probability of finding another station and the consequences of being wrong. If we are about to cross Death Valley we are going to act much differently than if we are driving to the corner grocery.
In the case of a pilot plant the consequences can be very big. They could involve safety issues or even the potential for financial ruin. Hence, understanding the hypotheses and consequences in any testing program can be a serious matter. Unfortunately, when we try to translate our intuitions about hypotheses or assumptions into objective statistical terms we often land in a state of confusion. The primary cause of our confusion is the rather circular and somewhat contradictory language surrounding the topic. We will try to keep it simple and point out some of the language traps while not losing the benefits of objective statistical terms.
Let’s begin with the idea of a hypothesis. This is a statement that you believe, hope or at least assume to be true. This is called the “null hypothesis” or H0. Here’s the first warning to the reader about language. Some authors try to give meaning the term “null.” By doing so they are hoping to influence the stating of hypotheses and “standardize” the process. In many cases that really isn’t very helpful. It often results in double negatives that add confusion. It is actually better to think of it as the basic assumption and go from there. The “alternative hypothesis” or H1, is simply all other possibilities (the logical antithesis of H0). If H0 is not true then H1 must be true.
We can give a simple example:
H0: Anna is the tallest girl in her math class
H1: Anna is not the tallest girl in her math class
These are simple enough and even imply some things about how to “prove” or “dis-prove” them. Notice that we could and probably should add more details to make it very clear on how we intend to measure and interpret results. We could add, “…as measured by the school nurse using the usual method.” It cannot be emphasized enough just how important it is to clearly state hypotheses. This goes a long way toward driving a successful testing program.
Things get more complicated as soon as we start making measurements and realize that there seems to be some fuzziness around the measurements. If we measure Anna’s height multiple times we will get slightly different numbers. This is not a problem if Anna is 3 inches taller than everyone else, but it becomes quite a problem if one or more girls are very close to Anna’s height. We now have to pass judgment on the probabilitythat Anna is actually the tallest. This leads to two possible errors that we will show below in a “truth table”:
Hypothesis |
Actual State |
Measurements Say Anna is Tallest |
Measurements Say that Anna is Not the Tallest |
H0 = Anna is the tallest |
H0 True |
Right Answer |
We incorrectly reject Ho Type I Error (α error) |
H1 = Anna is not the tallest |
H0 False |
We Incorrectly Accept H0 Type II Error (β error) |
Right Answer |
In words the two types of errors are:
Type I: H0 is actually true but our data tells us that it is false. Hence, we draw a wrong conclusion. We think H1 is true when it is not.
Type II: H0 is actually false but our data tells us that it is true. Hence, we draw the wrong conclusion. We think H0 is true when it is not.
Let’s now think about a situation that could be a little more interesting. Let’s think about an example that might occur with a pilot plant:
Case 1: We believe that the yield in our pilot plant is highly dependent on how we start-up the catalytic reactor. There are several different ways to do this that really add no significant costs or risks to our operation. We decide we want to compare our current procedure (Procedure A) to a new procedure (Procedure B). We realize that measuring yield is difficult and expensive, but small changes in yield could make us a lot of money in the long run.
Because we would really hate to lose out on potential earnings from a better start-up procedure, we set up our hypotheses and testing as follows:
- We set our H0 as, “Procedure B gives us higher yield than Procedure A” and
- We set our α error quite low – say 0.05 (or 5%).
We set up our testing program so that Procedure B will be rejected only if we are 95% sure that it is not better than Procedure A. This way we are pretty sure that we will not erroneously reject a better procedure. This is a very typical approach in a development process. There are many statistical programs for doing just this type of statistical analysis. In fact, many statistical programs are set up to deal only with Type I Error.
There’s a problem here that often gets missed. We have just biased our development program to accept hypotheses that may not be true. As we reduce our α error (the chances of making a Type I Error), we actually increase our chances for making a Type II Error – the chances for accepting a hypothesis that is not true. If we persist in this, we will be frequently making changes that are not demonstrably better and might even be worse than our current process. This happens with surprising frequency in development projects. It is a random scope drift that is actually a function of flawed analysis of hypotheses. In the current example we could imagine many different start-up procedures that would give similar yields. These might be accepted or rejected more or less randomly and we would never actually settle on a “best” procedure but would continually cycle between similar but hardly distinguishable procedures.
One approach would be to change the way we state our hypotheses. We could state H0 in the negative (i.e. Procedure B is NOT better than Procedure A). This would have the effect of exactly reversing Type I and Type II Errors. In other words, we would rarely accept a procedure that wasn’t demonstrably better than the current procedure. This has some advantages, but we would also reject many procedures that actually were better than the current procedure. This approach would do little to help us do a balanced assessment of the comparability of different procedures. The only way to really do a balanced assessment is to deal with the issue of Type II Error directly. This is rarely done in beginning statistics textbooks. In fact, one intermediate university level text flatly states that “a thorough treatment of Type II Errors is beyond the scope of this book.” This is not unexpected since the treatment of Type II Errors requires some pretty sophisticated mathematics to do it rigorously. Fortunately, there are some useful ways to estimate the probability of Type II Errors without resorting to exacting treatment.
The key to controlling Type II Errors is setting values for our testing such that any actual, undetectable differences between H0 and H1 are so low that we don’t care anymore. In other words, even if we erroneously reject H0 it really doesn’t matter much to us. We are setting a “level of indifference.” We then look at our ability to measure the key responses and increase the number of tests such that we can just “see” that “level of indifference” with some measureable (or estimable) probability.
In our current example, there is probably some level of difference between Procedure A and Procedure B, below which we really don’t care. That is to say, we aren’t going to change to Procedure B for just any a slight improvement. That becomes our “level of indifference” and we set up our testing so that we can detect down to that level. In order to do this we must evaluate our ability to measure (standard deviation of our measurements) and combine that with an adequate number of tests (i.e. trials) to get an acceptable level of confidence.
We realize that the way we have our testing set up (α = 0.05), we will accept a lot of possible start-up procedures – even some that do us very little good. In fact we are 95% sure that we won’t be throwing out a new procedure that might be good. Nevertheless, we don’t want to accept any new procedure that doesn’t have a reasonably good chance of giving us some useful improvement. We generally set this probability quite a bit higher than our α level. This is called our β level. For general screening work this is often set at 0.20. This means that we have an 80% chance that the accepted new procedure will give us an improvement equal to or greater than our “level of indifference.” When we approach Type II Error in this way we obtain a relatively simple procedure for dealing with it.
The process is:
- Set the “level of indifference,”
- Measure or estimate the standard deviation of the pertinent measurement,
- Set the level of acceptable Type II Error (often 20%),
- Estimate the number of samples required using the formula:
n = number of replicates
zα & zβ = “z scores” for the α and β probabilities respectively
s = the standard deviation of the measurement of u
(uα – uβ) = the “level of indifference” expressed as difference in means
Several software programs will perform the calculation above including StatEase Design Expert 9.0.
Let us now go through a typical example using the scenario of Case 1 above. Let’s say that we are indifferent to start-up procedures that produce less than a 1% change in yield. Let’s also assume that we have measured yields many times in our pilot plant and can hold a standard deviation of 0.25%. Let’s also assume that we are screening many start-up procedures and want to set our α at 0.05 and our β at 0.20. When we analyze these data we see that we need 5 to 6 runs to reduce our Type II Error to less than 20%. If this was a very simple test with only these two factors (Procedure A and Procedure B) we would set up our testing with 3 random comparisons of the two procedures for a total of 6 start-up tests.
Would we be done? Probably not. If our testing showed that Procedure B was much better than Procedure A – say by >5%, we would feel quite confident that we were on to something. Even then, we would probably want to repeat a verification test from scratch. There’s always the outside chance that our whole procedure was screwed-up (labeled procedures incorrectly?) or was compromised by some exogenous, uncontrolled factor (changes in catalyst?).
If Procedure B was only slightly more than 1% better than Procedure A we would be even more cautious. There is still nearly a 20% chance that we are wrong about Procedure B. If we decide that we really want to go after that extra 1%+ in yield, we would want to repeat our test and maybe even redesign it to be a more convincing.
Comments