Next: 5.2 Linear regression
Up: 5. Data analysis
Previous: 5. Data analysis
  Contents
In both experimental physics and computational physics we often deal with
data or results in which there are uncertainties. For example, suppose we
attempt to measure the distance to the Moon by bouncing a laser beam off a
mirror left behind by the Apollo astronauts. Because we cannot exert
perfect control over the timing operation, if we make multiple
measurements, we won't get exactly the same answer each time we make
measurement. A systematic way to deal with uncertainties is to apply the
ideas of statistical analysis. If we take a large number of measurements
about half will be too small and half will be too large, so the average of
all of our measurements (also called the mean) will be close to the
``true'' answer.
Figure 5.1 shows two sets of data for your Moon-Earth distance
experiment, perhaps taken with different equipment or by different
scientists. Estimate the average pulse travel time for each data set. Which
data set has the larger average?
Figure:
Histograms Moon-Earth distance data. The measurements
record the time for a pulse to travel to the Moon and back to the
Earth. The Moon-Earth distance is computed by
. The
upper and lower panels are two different data sets for the same
experiment.
|
|
The mean or the average is calculated merely by summing all the data and
dividing by the number of data points. In this case, both sets of data seem
to have similar averages.5.1 Both of the data sets
shown in the figure look roughly like the well-known Bell curve. This kind
of distribution of data is ubiquitous--it comes up everywhere. This
distribution is called a normal distribution or a Gaussian
distribution. Though the averages of these two data sets are nearly the
same, we have a sense that the lower data set is less accurate because of
the ``spread'' of the data. We can quantify what we mean by the ``spread''
by referring to the standard deviation of the data.
The averages for both data sets is about 2.53 sec but the standard deviation
of the first is 0.1 sec while it is 0.3 sec for the second. We interpret
this to mean that, on average, any measurement in the first data set is
within 0.1 seconds of the true or actual value. In other words, on average,
we expect each measurement to be between 2.43 and 2.63 seconds. Or, the
uncertainty in our measurements is
sec.
More importantly is the question of the uncertainty in the mean
rather than individual measurements since it's the mean that gives us an
idea of the actual value. What we would like to know is the standard
deviation of the mean. The standard deviation of the mean is simple to
calculate if we already know the standard deviation. If the standard
deviation is
and there are
measurements in the data set,
then the standard deviation of the mean is simply
. For the data in the top of
Fig. 5.1, the standard deviation of the mean is 0.01
seconds so we would report the average value as
indicating
that we believe the actual value to be between 2.52 and 2.54 seconds.
We can generate some sample data similar to that shown in
Fig. 5.1 and then calculate the averages, medians, standard
deviations, etc. with MATLAB's built in commands. Matlab has lots of
built-in functions for data analysis and statistics.
Try help datafun.
Here are some examples to try.
% Generate sample time measurement data
clear all; close all;
data1 = .1*randn([100 1])+2.55;
data2 = .3*randn([100 1])+2.50;
mean(data1)
mean(data2)
median(data1)
median(data2)
std(data1)
std(data2)
If your data values are measurements of a quantity that in principle should
be the same for each different measurement, it is useful to plot a
histogram of the values to get a sense of the average, the ``spread'', and
the ``shape'' of the distribution. Try hist(data1,20) and
hist(data2,20). (Notice the different scales on the x-axis of each
histogram.)
Next: 5.2 Linear regression
Up: 5. Data analysis
Previous: 5. Data analysis
  Contents
Gus Hart
2005-01-28