next up previous contents
Next: 5.2 Linear regression Up: 5. Data analysis Previous: 5. Data analysis   Contents

5.1 Averages, medians, standard deviation, and all that

In both experimental physics and computational physics we often deal with data or results in which there are uncertainties. For example, suppose we attempt to measure the distance to the Moon by bouncing a laser beam off a mirror left behind by the Apollo astronauts. Because we cannot exert perfect control over the timing operation, if we make multiple measurements, we won't get exactly the same answer each time we make measurement. A systematic way to deal with uncertainties is to apply the ideas of statistical analysis. If we take a large number of measurements about half will be too small and half will be too large, so the average of all of our measurements (also called the mean) will be close to the ``true'' answer. Figure 5.1 shows two sets of data for your Moon-Earth distance experiment, perhaps taken with different equipment or by different scientists. Estimate the average pulse travel time for each data set. Which data set has the larger average?
Figure: Histograms Moon-Earth distance data. The measurements record the time for a pulse to travel to the Moon and back to the Earth. The Moon-Earth distance is computed by $d=\frac{1}{2}t/c$. The upper and lower panels are two different data sets for the same experiment.
The mean or the average is calculated merely by summing all the data and dividing by the number of data points. In this case, both sets of data seem to have similar averages.5.1 Both of the data sets shown in the figure look roughly like the well-known Bell curve. This kind of distribution of data is ubiquitous--it comes up everywhere. This distribution is called a normal distribution or a Gaussian distribution. Though the averages of these two data sets are nearly the same, we have a sense that the lower data set is less accurate because of the ``spread'' of the data. We can quantify what we mean by the ``spread'' by referring to the standard deviation of the data. The averages for both data sets is about 2.53 sec but the standard deviation of the first is 0.1 sec while it is 0.3 sec for the second. We interpret this to mean that, on average, any measurement in the first data set is within 0.1 seconds of the true or actual value. In other words, on average, we expect each measurement to be between 2.43 and 2.63 seconds. Or, the uncertainty in our measurements is $\pm0.1$ sec. More importantly is the question of the uncertainty in the mean rather than individual measurements since it's the mean that gives us an idea of the actual value. What we would like to know is the standard deviation of the mean. The standard deviation of the mean is simple to calculate if we already know the standard deviation. If the standard deviation is $\sigma_x$ and there are $N$ measurements in the data set, then the standard deviation of the mean is simply $\overline{\sigma}_x=\frac{\sigma_x}{\sqrt{N}}$. For the data in the top of Fig. 5.1, the standard deviation of the mean is 0.01 seconds so we would report the average value as $2.53\pm0.01$ indicating that we believe the actual value to be between 2.52 and 2.54 seconds. We can generate some sample data similar to that shown in Fig. 5.1 and then calculate the averages, medians, standard deviations, etc. with MATLAB's built in commands. Matlab has lots of built-in functions for data analysis and statistics. Try help datafun. Here are some examples to try.
% Generate sample time measurement data
clear all; close all;
data1 = .1*randn([100 1])+2.55;
data2 = .3*randn([100 1])+2.50;



If your data values are measurements of a quantity that in principle should be the same for each different measurement, it is useful to plot a histogram of the values to get a sense of the average, the ``spread'', and the ``shape'' of the distribution. Try hist(data1,20) and hist(data2,20). (Notice the different scales on the x-axis of each histogram.)
next up previous contents
Next: 5.2 Linear regression Up: 5. Data analysis Previous: 5. Data analysis   Contents
Gus Hart 2005-01-28