next up previous contents
Next: 5.3 Fitting with MATLAB Up: 5. Data analysis Previous: 5.1 Averages, medians, standard   Contents

5.2 Linear regression

If the data you are analyzing depends on a parameter (e.g., the position of a falling particle depends on time) then we can't take an average of the data (not all the points should be the same!). In cases like these, we must ``fit'' a function to our data. Picking the kind of function to fit (a line, a polynomial, exponential, etc.) relies on intuition and an understanding of the physics involved. In this section, we will discuss how to fit a straight line to a set of data that appears to have a linear form. Recall a lab from PHY 161L where you rolled a racquetball down a ramp and then recorded the elapsed time when the ball rolled down the hall past each of a series of stations placed at 1 meter intervals in the hall. A plot of sample data is shown in Fig. 5.2. Since the ball experiences very little friction, it's velocity is nearly constant. The velocity of the ball can be deduced by determining the slope of the $x(t)$ line. But because of imperfect measurement of the times the ball passed each station, the data doesn't follow a straight line exactly. What we need to do is a draw a line through the data, that is, a line that comes as near as possible to each of the points. How do we draw a line through the data that gives the ``best fit'' to the data and what do we mean by ``best fit''?
Figure 5.2: $x(t)$ data for a ball rolling with very little friction.
\includegraphics[width=0.5\linewidth]{balldata.eps}
Since we suspect that the function we want is a straight line, we know that its form will be $f(x) = a_1 + a_2x$ where $a1$ is the y-intercept of the line and $a_2$ is the slope of the line. Our problem of determining the ``best fit'' line then becomes the problem of finding the values $a_1$ and $a_2$ that define a line that comes as close as possible to each point in the data set. We can quantify what we mean by ``coming close'' by calculating, for each data point, the difference between the data point and our chosen line. The $i$-th data point at $x_i$ is $y_i$ and the value of our function (a straight line) at $x_i$ is $a_1+a_2x_i$. The difference between these two y-values is the ``error'' in the fit for this point: $\Delta_i=a_1+a_2x_i-y_i$. This is the error in the fit for one data point. What we would like is an ``average'' error for all of the data points. The obvious way to do this is to just take the mean of all of the errors, $\overline\Delta=\sum_{i=1}^N\Delta_i$. Obvious but perilous. This doesn't work because the average might turn out to be zero even though individual errors could be large (negative and positive errors can cancel each other out). Thus a line that doesn't come near any of the point could have a near zero average error using this approach. Instead, we square the individual errors before summing them so that negative and positive errors won't cancel. Thus, the average error that we want to minimize is

\begin{displaymath}
\chi^2(a_1,a_2)=\sum_{i=1}^N\Delta_i^2=\sum_{i=1}^N[a_1+a_2x_i-y_i]^2
\end{displaymath}

where we have now replaced $\overline\Delta$ with $\chi^2$ to be consistent with normal notation. Our task now becomes finding a way to choose $a_1$ and $a_2$ so that $\chi^2$ is minimized. This is an example of the technique known as least squares fitting for reasons that should be obvious now. How are we going to do that? Use the idea we learned in our calculus class for finding minima/maxima. Take the derivative of the function and set the result equal to zero (recall that the slope of a function is zero at all its minima and maxima). Then solve for the constants we are trying to determine, $a_1$ and $a_2$ in our case. Do this for our case yields

\begin{displaymath}
\frac{\partial\chi^2}{\partial a_1}=2\sum^N_{i=1}(a_1+a_2x_i-y_i)=0
\end{displaymath}

and

\begin{displaymath}
\frac{\partial\chi^2}{\partial a_2}=2\sum^N_{i=1}(a_1+a_2x_i-y_i)x_i=0
\end{displaymath}

Note the slight differences here--the second expression has an extra factor of $x_i$. Simplifying gives

\begin{displaymath}
a_1+a_2\sum^N_{i=1}x_i-\sum^N_{i=1}y_i=0
\end{displaymath}

and

\begin{displaymath}
a_1\sum^N_{i=1}x_i+a_2\sum^N_{i=1}x_i^2-\sum^N_{i=1}x_iy_i=0
\end{displaymath}

The sums in these two expressions are easy to evaluate and the expression for $a_1$ and $a_2$ are simply

\begin{displaymath}
a_1=\frac{\sum y\sum x^2-\sum x\sum xy}{\sum x^2-(\sum x)^2};\
a_2=\frac{\sum xy -\sum y\sum x}{\sum x^2-(\sum x)^2}\end{displaymath}

where the indices on the sums have been dropped for simplicity. We can generalize this approach for sets of data points with non-uniform error bars (in our example we assumed that all points were equally important, i.e., had the same error) or to other functions besides a straight line. For more details you should look in a real book.
next up previous contents
Next: 5.3 Fitting with MATLAB Up: 5. Data analysis Previous: 5.1 Averages, medians, standard   Contents
Gus Hart 2005-01-28