We often need to know how randomness in several variables interact to create randomness in a derived result. This can occur when we are trying to project complex outcomes from multiple factors or estimate the error in a measurement that is the result of several, independent measurements. An example of the first type might be trying to predict future profitability from estimated future costs and incomes. An example of the second type might be trying to estimate chemical yield when we have uncertainty in the measurements of both raw materials and finished products.
These are two typical examples of what is known as “propagation of error.” Sometimes the most practical approach is to resort to a Monte Carlo Simulation to estimate the “propagated error.”
Just about a year ago we gave a general method for estimating the propagated error in a mathematical model. We began with simple examples such as the difference between two measured or estimated values. One example might be the estimation of moisture content from a simple drying experiment. The moisture content might simply be the difference in mass of the sample before and after drying. This is the difference of two measured numbers. For sums and differences we showed that the variance of the total measurement is simply the sum of the variances of the individual measurements.
We then proceeded to more complex calculations and finally to a general statement. We stated the final conclusion in terms of a function of random variables. We repeat the conclusions below:
Let x, y, z,... represent random variables whose true values are X, Y, Z.... Let u represent a derived quantity whose true value is given by:
U = f(X, Y, Z,…) Equation 1
Let ε1, ε2, ε3,... represent the statistically independent and relatively small errors of x, y, z, ... respectively. Then the error induced in u, which is denoted as ξ, as a result of the errors ε1, ε2, ε3,..., has a variance equal to:
V(ξ) = (δf/δX)2V(ξ1) + (δf/δY)2V(ξ2) + (δf/δZ)2V(ξ3) + … Equation 2
Where (δf/δX) is the partial derivative of f with respect to X.
You may want to review the full Blog by visiting:
http://tek-dev.typepad.com/technology-development/2015/02/estimating-total-process-error.html .
It was noted that for this approach to be valid the following conditions needed to be met:
- The mathematical model must be known,
- The variance of each factor must be known,
- The model equation must be differentiable, and
- The variance of each factor must be “small”
The first two conditions are not trivial, but they often can be met. The third condition occasionally cannot be met with non-linear functions (logarithmic, inverse, or power functions). The fourth condition is frequently difficult to meet – especially with Binomial and Poisson distributed factors. When this occurs a Monte Carlo Simulation may give the best results.
A Monte Carlo Simulation consists of basically three parts:
1. Construct a mathematical model that expresses the interaction of all the variables of interest,
2. Measure or estimate the statistical distribution of the variables of interest,
3. Create independent tables of the variables of interest that represent a “fair” distribution of possible inputs and
4. Create a model for the results by a populating the model randomly selected possible inputs.
This is best illustrated by an example. Let us suppose that we have come up with a very spiffy model that predicts the recovery of a liquid product from a gas stream by a gas impingement system. Let us suppose that the recovery depends on the square of the particle velocity and the velocity of the gas stream. To keep it simple let us assume that the yield in useful units to us can be given by:
Yield (kg/hr) = X2 + Y Equation 3
Where:
X = Velocity of the particle (meters/sec)
Y = Velocity of the gas (meters/sec)
Now let us suppose that we cannot control the liquid particle and gas flows perfectly and we are wondering about how much our yield would vary. Of course we could just build our plant and try things out, but this could be a pretty expensive way to go at the problem. Suppose we are told by a manufacturer of various nozzles and systems that we could expect a linear velocity of the particles to be 5 meters/sec with a variance of 0.25 meters/sec and a gas velocity of 8 meters/sec with a variance of 0.25 meters/sec. What would we expect the variability of the Yield to be? Let us do this using a Monte Carlo Simulation.
This particular example more or less meets the criteria for a propagation of error calculation. Hence, we can use Equations 1 and 2. When we do the partial derivatives and apply them to Equation 3 above we get:
Variance of Yield = (4X2) x Variance X + Variance Y Equation 4
Thus the Yield is estimated to be 33 kg/hr with a variance of 25.25 kg/hr (of course we would round to no more than 2 digits).
Now let us show how to solve this problem using a Monte Carlo simulation. Let us first create a table of values for X and Y that represent a number of “fair” values distributed like our manufacturer said. This is actually easily done in Excel using a Data Analysis Add-In. Rather than try to explain how this is done, the reader is encouraged to view the short YouTube video posted at https://www.youtube.com/watch?v=PXDRTl8_fVM or click on the graphic below:
If you view the YouTube video you will see that we created 300 random numbers (X’s) with a mean of 5 and standard deviation of 0.5, 300 random numbers (Y’s) with a mean of 8 and a standard deviation of 0.5 and then we created 300 results according to the equation X2 + Y. Then we calculated the mean, standard deviation and variance of the 300 results. We got:
Mean = 33.38697
STD = 4.957101
VAR = 24.57285
You can see that these results are very close to our propagation of error calculation. This makes sense, of course, because the criteria for the propagation of error calculation are met. If, however, these criteria had not been met, the Monte Carlo Simulation would have given us much more useful results.
The Monte Carlo Simulation methodology has a number of advantages over a simple propagation of error estimation. It can be especially helpful when the variables have much different behavior. This happens frequently when the variables are a mix of continuous variable which are usually Normally distributed, and discrete variables which are often Binomial or Poisson distributed. The Monte Carlo Simulation can be used with any variable where a pile of representative results can be tabulated and then used randomly to form multiple result calculations. Hence, it can be used when a random variable fits no particular statistical model at all. And finally, a Monte Carlo Simulation can give indications of skewness that is not apparent in a simple propagation of error estimation.
In our current example, the non-linear X term introduces skewness that would not be evident in a simple propagation of error estimation. If we again use an Excel Data Analysis Add In (this time the Histogram) we can plot a frequency chart that shows us that the distribution of results is not symmetric. This is shown below:
One word of caution is needed here. It can take a surprisingly large number of data points for a Monte Carlo Simulation to give good results – especially for the variance. Just remember from the Central Limit Theorem that the mean of any data set derived from normally distributed variables quickly approaches the same mean as that of a very large set. This is not true with variance. Even for normally distributed variables the number of simulations can be large before a stable variance is achieved. When the variables have a mixed distributions the numbers grow larger and less predictable. It is best to run the simulation several times with what “seems” should be a large number of data points and compare the resultant means and variances. If they seem stable then you probably have enough data points. If not, you should expand the number of points by 2X and try again. You should repeat the process until variability in outcomes meet your data quality objectives.
In conclusion, Monte Carlo Simulations can:
1. Be very helpful in estimating the error and distribution of error for derived calculations from experimental data or predicting the error and distribution of error around projections,
2. They can be easily set up in Excel based on your knowledge of input variables and their relationships,
3. They can be applied to a wide range of distributions and relationships and
4. They can give additional information about the resulting distribution curve that might not otherwise be apparent.
Stites & Associates, LLC, is a group of technical professionals who work with clients to improve laboratory performance and evaluate and improve technology by applying good management judgment based on objective evidence and sound scientific thinking. For more information see: www.tek-dev.net.
Comments