Next: 5.3 Fitting with MATLAB
Up: 5. Data analysis
Previous: 5.1 Averages, medians, standard
  Contents
If the data you are analyzing depends on a parameter (e.g., the position of
a falling particle depends on time) then we can't take an average of the
data (not all the points should be the same!). In cases like these, we must
``fit'' a function to our data. Picking the kind of function to fit (a
line, a polynomial, exponential, etc.) relies on intuition and an
understanding of the physics involved. In this section, we will discuss how
to fit a straight line to a set of data that appears to have a linear form.
Recall a lab from PHY 161L where you rolled a racquetball down a ramp and
then recorded the elapsed time when the ball rolled down the hall past each
of a series of stations placed at 1 meter intervals in the hall. A plot of
sample data is shown in Fig. 5.2. Since the ball experiences
very little friction, it's velocity is nearly constant. The velocity of the
ball can be deduced by determining the slope of the
line. But
because of imperfect measurement of the times the ball passed each station,
the data doesn't follow a straight line exactly. What we need to do is a
draw a line through the data, that is, a line that comes as near as
possible to each of the points. How do we draw a line through the data that
gives the ``best fit'' to the data and what do we mean by ``best fit''?
Figure 5.2:
data for a ball rolling with very little friction.
|
|
Since we suspect that the function we want is a straight line, we know that
its form will be
where
is the y-intercept of the
line and
is the slope of the line. Our problem of determining the
``best fit'' line then becomes the problem of finding the values
and
that define a line that comes as close as possible to each point in
the data set. We can quantify what we mean by ``coming close'' by
calculating, for each data point, the difference between the data point and
our chosen line. The
-th data point at
is
and the value of our
function (a straight line) at
is
. The difference between
these two y-values is the ``error'' in the fit for this point:
. This is the error in the fit for one data
point. What we would like is an ``average'' error for all of the
data points. The obvious way to do this is to just take the mean of all of
the errors,
. Obvious but
perilous. This doesn't work because the average might turn out to be zero
even though individual errors could be large (negative and positive errors
can cancel each other out). Thus a line that doesn't come near any of the
point could have a near zero average error using this approach. Instead, we
square the individual errors before summing them so that negative
and positive errors won't cancel. Thus, the average error that we want to
minimize is
where we have now replaced
with
to be
consistent with normal notation. Our task now becomes finding a way to
choose
and
so that
is minimized. This is an
example of the technique known as least squares fitting for reasons
that should be obvious now.
How are we going to do that? Use the idea we learned in our calculus class
for finding minima/maxima. Take the derivative of the function and set the
result equal to zero (recall that the slope of a function is zero at all its
minima and maxima). Then solve for the constants we are trying to
determine,
and
in our case. Do this for our case yields
and
Note the slight differences here--the second expression has an extra
factor of
. Simplifying gives
and
The sums in these two expressions are easy to evaluate and the expression
for
and
are simply
where the indices on the sums have been dropped for simplicity.
We can generalize this approach for sets of data points with non-uniform
error bars (in our example we assumed that all points were equally
important, i.e., had the same error) or to other functions besides a
straight line. For more details you should look in a real book.
Next: 5.3 Fitting with MATLAB
Up: 5. Data analysis
Previous: 5.1 Averages, medians, standard
  Contents
Gus Hart
2005-01-28