The graph shows a simple example taken from analytical chemistry: a straight-line calibration data set where X = concentration and Y = instrument reading (Y = a + bX). Click to download that data. The blue dots are the data points. They don't all fall in a perfect straight line because of random noise and measurement error. The least-squares algorithm computes the values of a (intercept) and b (slope) of the straight line that is a "least-squares best fit" to the data points. For this set of data, the measured slope is 9.7926 and the intercept is 0.199. The least squares procedure also calculates R2, the coefficient of determination, also called the correlation coefficient, which is an indicator of the "goodness of fit". R2 is exactly 1.0000 when the fit is perfect, less than that when the fit is imperfect. The closer to 1.0000 the better. The points do not all fall on the straight line because of random noise in the data. (In this example, the "true" value of the slope is 10 and of the intercept is zero, so the presence of the noise caused this particular measurement of slope to be off by about 2%. Had there been a larger number of points in this data set, the calculated values of slope and intercept would probably have been better).
A simple way to evaluate a curve fit is to look at a plot of the
"residuals", the differences between the y values in the original data
and the y values computed by the fit equation. If the residuals are
completely random and show no systematic variation, then the fit is as
good as possible for that level of noise; but if the residual plot
shows a regular curve, for example a U-shaped or an S-shaped curve, it may mean that another fitting function would
be more appropriate.
Least-squares best fits can be calculated by some hand-held calculators, spreadsheets, and dedicated computer programs (see below for details). Although it is possible to estimate the best-fit straight line by visual estimation and a straightedge, the least-square method is more objective and easier to automate. (If you were to give this set of data to five different people and ask them to estimate the best-fit line visually, you'd get five slightly different answers, but it you gave the data set to five different computer programs, you'd get the exact same answer every time).
In some cases a fundamentally non-linear relationship can be transformed into a form that is amenable to polynomial curve fitting by means of a coordinate transformation (e.g. taking the log or the reciprocal of the data), and then least-squares method can be applied to the resulting linear equation. For example, the signal in Figure 16 is from the simulation of an exponential decay (X=time, Y=signal intensity) that has the mathematical form Y = a exp(bX), where a is the Y-value at X=0 and b is the decay constant. This is a fundamentally non-linear problem because Y is a non-linear function of the parameter b. However, by taking the natural log of both sides of the equation, we obtain ln(Y)=a + bX. In this equation, Y is a linear function of both parameters a and b, so it can be fit by the least squares method in order to estimate a and b. In this particular example, a = 1 and b = -1, but random noise has been added to each data point, with a standard deviation equal to 10% of the value of that data point. The best estimate of the values of a and b, knowing only the noise data points, can be determined by least-ssquares curve fitting of ln(Y) vs X.
The best fit equation, shown by the green solid line in the figure, is Y = 0.9897
exp(- 0.98896 X), that is, a = 0.9897 and b
= -0.98896, which are reasonably close to the
expected values of 1 and -1, respectively. Thus, even in the presence of substantial random
noise, it is possible to get reasonable estimates of the parameters of the
underlying
equation. The most important requirement is that the model be good,
that is, that
the equation selected for the model accurately describes the underlying
behavior of
the system (except for noise). Often that is the most difficult aspect,
because
the underlying models are not always known with certainty.
Other examples of non-linear relationships that can be linearized by coordinate transformation include the logarithmic (Y = a ln(bX)) and power (Y=aXb) relationships. Methods of this type used to be very common back in the days before computers, when fitting anything but a straight line was difficult. It is still used today to extend the range of functional relationships that can be handled by common linear least-squares routines available in spreadsheets hand-held calculators. (Only a few non-linear relationships can be handled this way, however. To fit any arbitrary custom function, you may have to resort to the more difficult non-linear iterative curve fitting method).
An interesting example of the use of transformation to convert a non-linear relationship into a form that is amenable to polynomial curve fitting is the use of the natural log (ln) transformation to convert a Gaussian peak, which has the fundamental functional form exp(-x2), into a parabola of the form -x2, which can be fit with a second order polynomial (quadratic) function (y = a + bx + cx2). The equation for a Gaussian peak is y = height*exp(-((x-position)./(0.6005615*width)) ^2), where height is the peak height, position is the x-axis location of the peak maximum, and width is the width of the peak at half-maximum. A little algebra will show that all three parameters of the peak (height, maximum position, and width) can be calculated from the three quadratic coefficients a, b, and c; the peak height is given by exp(a-c*(b/(2*c))^2), the peak position by -b/(2*c), and the peak half-width by 2.35703/(sqrt(2)*sqrt(-c)).
An
example is of this type of Gaussian curve fitting is shown in the
figure on the left. The signal is a Gaussian
peak with a true peak height of 100 units, a true peak position of 100
units, and a true half-width of 100 units, but it is sparsely sampled
only every
31 units on the x-axis. The resulting data set, shown by the red points
in the upper left, has only 6 data points on the peak itself. If
we were to take the
maximum
of those 6 points (the 3rd point from the left, with x=87, y=95) as the
peak maximum, that would not be very close to the true values of peak
position and height (100). If we were to take the distance
between the 2nd the 5th data points as the peak width, we'd get
3*31=93, again not very close to the true value of 100.
However, taking the natural log of the data (upper right) produces a parabola that can be fit with a quadratic least-squares fit (shown by the blue line in the lower left). From the three coefficients of the quadratic fit we can calculate much more accurate values of the Gaussian peak parameters, shown at the bottom of the figure (height=100.57; position=98.96; width=99.2). The plot in the lower right shows the resulting Gaussian fit (in blue) displayed with the original data (red points). The accuracy of those peak parameters (about 1% in this example) is limited only by the noise in the data.
This above figure was created in Matlab, using this script. A simple user-defined Matlab function that uses this method is gaussfit.m, which takes the form [Height,Position,Width]=gaussfit(x,y). You can also download an OpenOffice spreadsheet that implements this method (Download link; Screen shot). Note: in order for this method to work properly, the data set must not contain any zeros or negatives points, and the original Gaussian peak signal must be a single isolated peak with a zero baseline, that is, must tend to zero far from the peak center. In practice, this means that any non-zero baseline must be subtracted from the data set before applying this method. A more general approch to fitting Gaussian peaks, which works for data sets with zeros and negative numbers and also for data with multiple overlapping peaks, is the non-linear iterative curve fitting method. (By the way, a quick way to test the above methods is to use the simplest Gaussian data set: x=1,2,3 and y=1,2,1, which has a height, width, and position all equal to 2, assuming a baseline of zero).
Note 1: you can right-click on any of the m-file links above and select Save Link As... to download them to your computer for use within Matlab.
Note 2:
In the curve fitting techniques described here and in the next two
sections, there is no requirement that the x-axis interval between data
points be uniform, as is the assumption in many of the other signal
processing techniques previously covered. Curve fitting
algorithms typically accept a set of arbitrarily-spaced x-axis values
and a corresponding set of y-axis values.
The
least-squares best fit for an x,y data set can be computed using only
basic arithmetic. Here are the relevant equations for computing
the slope and intercept of the first-order best-fit equation, y
= intercept + slope*x, as well as the coefficient of
determination, R2, which is an indicator of the "goodness of fit". (R2 is 1.0000 if the fit is perfect and less than that if the fit is imperfect).
| n = number of x,y data points sumx = Σx sumy = Σy sumxy = Σx*y sumx2 = Σx*x meanx = sumx / n meany = sumy / n slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx) intercept = meany-(slope*meanx) ssy = Σ(y-meany)^2 ssr = Σ(y-intercept-slope*x)^2 R2 = 1-(ssr/ssy) |
(In
these equations, Σ represents summation; for example, Σx means the sum
of all the x values, and Σx*y means the sum of all the x*y products,
etc).
These calculations could be performed by hand, with the aid of a
calculator or a spreadsheet, or with a program written in any programming language. A similar set of equations can be written to fit a second-order (quadratic or parabolic) equations to a set of data.
You can also download spreadsheets in OpenOffice Calc format, pictured above, that automate the computation of those equations and also plot the data and the best-fit line, requiring only that you type in (or paste in) the x-y data. There is one spreadsheet for linear fits and also a version for quadratic (parabolic) fits. For the application to analytical calibration, there are specific versions of these spreadsheets that also calculate the concentrations of the unknowns.
Recent versions of Matlab have a convenient tool for interactive manually-controlled (rather than programmed) polynomial curve fitting in the Figure window. Click for a video example: (external link to YouTube).