The graph shows a simple straight-line example (Y = a + bX), a small set of data where X = concentration and Y = instrument reading. The blue dots are the data points. The least-squares algorithm computes the values of a (intercept) and b (slope) of the straight line that is a least-squares best fit to the data points. For this set of data, the slope is 9.7926 and the intercept is 0.199. The least squares procedure also calculates R2, the coefficient of determination, also called the correlation coefficient, which is an indicator of the "goodness of fit". R2 is exactly 1.0000 when the fit is perfect, less than that when the fit is imperfect. The closer to 1.0000 the better. The points do not all fall on the straignt line because of random noise in the data. (In this example, the "true" value of the slope is 10 and of the intercept is zero, so the presense of the noise caused this particular measurement of slope to be off by about 2%. Had there been a larger number of points in this data set, the calculated values of slope and intercept would probably have been better).
A simple way to evaluate a curve fit is to look at a plot of the
"residuals", the differences between the y values in the original data
and the y values computed by the fit equation. If the residuals are
completely random and show no systematic variation, then the fit is as
good as possible for that level of noise; but if the residual plot
shows a regular curve, for example a U-shaped or an S-shaped curve, it may mean that another fitting function would
be more appropriate.
Least-squares best fits can be calculated by some hand-held calculators, spreadsheets, and dedicated computer programs (see below for details). Although it is possible to estimate the best-fit straight line by visual estimation and a straightedge, the least-square method is more objective and easier to automate. (If you were to give this set of data to five different people and ask them to estimate the best-fit line visually, you'd get five slightly different answers, but it you gave the data set to five different computer programs, you'd get the exact same answer every time).
In some cases a fundamentally non-linear relationship can be transformed into a form that is amenable to polynomial curve fitting by means of a coordinate transformation (e.g. taking the log or the reciprocal of the data), and then least-squares method can be applied to the resulting linear equation. For example, the points in Figure 16 are from the simulation of an exponential decay (X=time, Y=signal intensity) that has the mathematical form Y = a exp(bX), where a is the Y-value at X=0 and b is the decay constant. In this particular example, a = 1 and b = -1. This is a fundamentally non-linear problem because Y is a non-linear function of the parameter b. However, by taking the natural log of both sides of the equation, we obtain ln(Y)=a + bX. In this equation, Y is a linear function of both parameters a and b, so it can be fit by the least squares method in order to estimate a and b.
The best fit equation, shown by the green solid line in the figure, is Y = 0.9897
exp(- 0.98896 X), that is a = 0.9897 and b
= -0.98896, very close to the
expected values. Thus, even in the presence of substantial random
noise, it is possible to get good estimates of the parameters of the
underlying
equation. The most important requirement is that the model be good,
that is, that
the equation selected for the model accurately describes the underlying
behavior of
the system (except for noise). Often that is the most difficult aspect,
because
the underlying models are not always known with certainty.
Other examples of non-linear relationships that can be linearized by coordinate transformation include the logarithmic (Y = a ln(bX)) and power (Y=aXb) relationships. Methods of this type used to be very common back in the days before computers, when fitting anything but a straight line was difficult. It is still used today to extend the range of functional relationships that can be handled by common linear least-squares routines available in spreadsheets hand-held calculators. (Only a few non-linear relationships can be handled this way, however. To fit any arbitrary custom function, you may have to resort to the more difficult non-linear iterative curve fitting method).
Another example of the use of transformation to convert a non-linear relationship into a form that is amenable to polynomial curve fitting is the use of the natural log (ln) transformation to convert a Gaussian peak, which has the fundamental functional form exp(-x2), into a parabola of the form -x2, which can be fit with a second order polynomial (quadratic) function (y = a + bx + cx2). (The equation for a Gaussian peak is y = height*exp(-((x-position)./(0.6006*width)) ^2), where height is the peak height, position is the x-axis location of the peak maximum, and width is the width of the peak at half-maximum). All three parameters of the peak (height, maximum position, and width) can be calculated from the three quadratic coefficients a, b, and c; the peak height is given by exp(a-c*(b/(2*c))^2), the peak position by -b/(2*c), and the peak half-width by 2.35703/(sqrt(2)*sqrt(-c)).
An
example is of this type of Gaussian curve fitting is shown in the
figure on the left. The signal is a Gaussian
peak with a true peak height of 100 units, a true peak position of 100
units, and a true half-width of 100 units, but it is sampled only every
31 units on the x-axis. The resulting data set, shown by the red points
in the upper left, has only 6 data points on the peak, too few for
accurate measurement of peak height, position, and width. The
maximum
of those 6 points is at x=87 and has an amplitude of 95, not that close
to the true values of peak position and height (100). However, taking
the
natural log of the data (upper right) produces a parabola that can be
fit with a quadratic least-squares fit (shown by the blue line in the
lower left). From the three coefficients of the quadratic fit we can
calculate much more accurate values of the Gaussian peak parameters,
shown at the bottom of the figure. The plot in the lower right shows
the resulting Gaussian fit (in blue) displayed with the original data
(red points). The accuracy is limited only by the noise in the data
(about 1% in this example). This figure was created in Matlab, using this script. A simple user-defined Matlab function that uses this method is gaussfit.m, which takes the form [Height,Position,Width]=gaussfit(x,y). You can also download an OpenOffice spreadsheet that implements this method (Download link; Screen shot). Note:
in order for this method to work properly, the data set must not
contain any zeros or negatives points, and the original Gaussian peak
signal must be a single isolated peak with a zero baseline, that is, must tend to zero far from
the peak center. In practice, this means that any non-zero baseline
must
be subtracted from the data set before applying this method. A
more general approch to fitting Gaussian peaks, which works for data
sets with zeros and negative numbers and also for data with multiple
overlapping peaks, is the non-linear iterative curve fitting method. (By the way, a quick way to test the above methods is to use the simplest Gaussian data set: x=1,2,3 and y=1,2,1, which has a height, width, and position all equal to 2).
Note 1: you can right-click on any of the m-file links above and select Save Link As... to download them to your computer for use within Matlab.
Note 2:
In the curve fitting techniques described here and in the next two
sections, there is no requirement that the x-axis interval between data
points be uniform, as is the assumption in many of the other signal
processing techniques previously covered. Curve fitting
algorithms typically accept a set of arbitrarily-spaced x-axis values
and a corresponding set of y-axis values.
The
least-squares best fit for an x,y data set can be computed using only
basic arithmetic. Here are the relevant equations for computing
the slope and intercept of the first-order best-fit equation, y
= intercept + slope*x, as well as the coefficient of
determination, R2, which is an indicator of the "goodness of fit". (R2 is 1.0000 if the fit is perfect and less than that if the fit is imperfect).
| n = number of x,y data points sumx = Σx sumy = Σy sumxy = Σx*y sumx2 = Σx*x meanx = sumx / n meany = sumy / n slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx) intercept = meany-(slope*meanx) ssy = Σ(y-meany)^2 ssr = Σ(y-intercept-slope*x)^2 R2 = 1-(ssr/ssy) |
(In
these equations, Σ represents summation; for example, Σx means the sum
of all the x values, and Σx*y means the sum of all the x*y products,
etc).
These calculations could be performed by hand, with the aid of a
calculator or a spreadsheet, or with a program written in any programming language. A similar set of equations can be written to fit a second-order (quadratic or parabolic) equations to a set of data.
You can also download spreadsheets in OpenOffice Calc format, pictured above, that automate the computation of those equations and also plot the data and the best-fit line, requiring only that you type in (or paste in) the x-y data. There is one spreadsheet for linear fits and also a version for quadratic (parabolic) fits. For the application to analytical calibration, there are specific versions of these spreadsheets that also calculate the concentrations of the unknowns.
Recent versions of Matlab have a convenient tool for interactive manually-controlled (rather than programmed) polynomial curve fitting in the Figure window. Click for a video example: (external link to YouTube).