index previous next

Curve fitting A: Linear Least Squares

[Basics]  [Transforming non-linear relationships]  [Math details]  [Spreadsheets]  [Matlab]

LinearFit.GIF The objective of curve fitting is to find a mathematical equation that describes a set of data and that is minimally influenced by random noise. The most common approach is the "linear least squares" method, also called "polynomial least squares", a well-known mathematical procedure for finding the coefficients of polynomial equations that are a "best fit" to a set of X,Y data. A polynomial equation expresses the dependent variable Y as a polynomial in the independent variable X, for example as a straight line (Y = a + bX, where a is the intercept and b is the slope), or a quadratic (Y = a + bX + cX2), or a cubic (Y = a + bX + cX2 + dX3), or higher-order polynomial. In all these cases, Y is a linear function of the parameters a,b,c, and d. This is why this is called "linear" least-squares fit, not because the plot of X vs Y is linear. Only for the first-order polynomial is the plot of X vs Y linear. (Sometimes this type of curve fitting is called "curvilinear").

The graph shows a simple example taken from analytical chemistry: a straight-line calibration data set where X = concentration and Y = instrument reading (Y = a + bX). Click to download that data. The blue dots are the data points. They don't all fall in a perfect straight line because of random noise and measurement error. The least-squares algorithm computes the values of a (intercept) and b (slope) of the straight line that is a "least-squares best fit" to the data points.  For this set of data, the measured slope is 9.7926 and the intercept is 0.199. The least squares procedure also calculates R2, the coefficient of determination, also called the correlation coefficient, which is an indicator of the "goodness of fit".  R2 is exactly 1.0000 when the fit is perfect, less than that when the fit is imperfect.  The closer to 1.0000 the better.  The points do not all fall on the straight line because of random noise in the data. (In this example, the "true" value of the slope is 10 and of the intercept is zero, so the presence of the noise caused this particular measurement of slope to be off by about 2%. Had there been a larger number of points in this data set, the calculated values of slope and intercept would probably have been better).  

Residuals.GIF A simple way to evaluate a curve fit is to look at a plot of the "residuals", the differences between the y values in the original data and the y values computed by the fit equation. If the residuals are completely random and show no systematic variation, then the fit is as good as possible for that level of noise; but if the residual plot shows a regular curve, for example a U-shaped or an S-shaped curve, it may mean that another fitting function would be more appropriate.

Least-squares best fits can be calculated by some hand-held calculators, spreadsheets, and dedicated computer programs (see below for details). Although it is possible to estimate the best-fit straight line by visual estimation and a straightedge, the least-square method is more objective and easier to automate. (If you were to give this set of data to five different people and ask them to estimate the best-fit line visually, you'd get five slightly different answers, but it you gave the data set to five different computer programs, you'd get the exact same answer every time).

Transforming non-linear relationships

In some cases a fundamentally non-linear relationship can be transformed into a form that is amenable to polynomial curve fitting by means of a coordinate transformation (e.g. taking the log or the reciprocal of the data), and then least-squares method can be applied to the resulting linear equation. For example, the signal in Figure 16 is from the simulation of an exponential decay (X=time, Y=signal intensity) that has the mathematical form Y = a exp(bX), where a is the Y-value at X=0 and b is the decay constant. This is a fundamentally non-linear problem because Y is a non-linear function of the parameter b. However, by taking the natural log of both sides of the equation, we obtain ln(Y)=a + bX. In this equation, Y is a linear function of both parameters a and b, so it can be fit by the least squares method in order to estimate a and b. In this particular example, a = 1 and b = -1, but random noise has been added to each data point, with a standard deviation equal to 10% of the value of that data point. The best estimate of the values of a and b, knowing only the noise data points, can be determined by least-ssquares curve fitting of ln(Y) vs X. 


An exponential least-squares fit (solid line) applied to a noisy data set (points) in order to estimate the decay constant.

The best fit equation, shown by the green solid line in the figure, is Y = 0.9897 exp(- 0.98896 X), that is, a = 0.9897 and b = -0.98896, which are reasonably close to the expected values of 1 and -1, respectively. Thus, even in the presence of substantial random noise, it is possible to get reasonable estimates of the parameters of the underlying equation. The most important requirement is that the model be good, that is, that the equation selected for the model accurately describes the underlying behavior of the system (except for noise). Often that is the most difficult aspect, because the underlying models are not always known with certainty.

Other examples of non-linear relationships that can be linearized by coordinate transformation include the logarithmic (Y = a ln(bX)) and power (Y=aXb) relationships. Methods of this type used to be very common back in the days before computers, when fitting anything but a straight line was difficult. It is still used today to extend the range of functional relationships that can be handled by common linear least-squares routines available in spreadsheets hand-held calculators.  (Only a few non-linear relationships can be handled this way, however.  To fit any arbitrary custom function, you may have to resort to the more difficult non-linear iterative curve fitting method).

An interesting example of the use of transformation to convert a non-linear relationship into a form that is amenable to polynomial curve fitting is the use of the natural log (ln) transformation to convert a Gaussian peak, which has the fundamental functional form exp(-x2), into a parabola of the form -x2, which can be fit with a second order polynomial (quadratic) function (y = a + bx + cx2). The equation for a Gaussian peak is y = height*exp(-((x-position)./(0.6005615*width)) ^2), where height is the peak height, position is the x-axis location of the peak maximum, and width is the width of the peak at half-maximum. A little algebra will show that all three parameters of the peak (height, maximum position, and width) can be calculated from the three quadratic coefficients a, b, and c; the peak height is given by exp(a-c*(b/(2*c))^2), the peak position by -b/(2*c), and the peak half-width by 2.35703/(sqrt(2)*sqrt(-c)).

Click to view enlarged figureAn example is of this type of Gaussian curve fitting is shown in the figure on the left. The signal is a Gaussian peak with a true peak height of 100 units, a true peak position of 100 units, and a true half-width of 100 units, but it is sparsely sampled only every 31 units on the x-axis. The resulting data set, shown by the red points in the upper left, has only 6 data points on the peak itself.  If we were to take the maximum of those 6 points (the 3rd point from the left, with x=87, y=95) as the peak maximum, that would not be very close to the true values of peak position and height (100).  If we were to take the distance between the 2nd the 5th data points as the peak width, we'd get 3*31=93, again not very close to the true value of 100.  

However, taking the natural log of the data (upper right) produces a parabola that can be fit with a quadratic least-squares fit (shown by the blue line in the lower left). From the three coefficients of the quadratic fit we can calculate much more accurate values of the Gaussian peak parameters, shown at the bottom of the figure (height=100.57; position=98.96; width=99.2). The plot in the lower right shows the resulting Gaussian fit (in blue) displayed with the original data (red points). The accuracy of those peak parameters (about 1% in this example) is limited only by the noise in the data. 

This above figure was created in Matlab, using this script.  A simple user-defined Matlab function that uses this method is gaussfit.m, which takes the form [Height,Position,Width]=gaussfit(x,y). You can also download an OpenOffice spreadsheet that implements this method (Download link; Screen shot). Note: in order for this method to work properly, the data set must not contain any zeros or negatives points, and the original Gaussian peak signal must be a single isolated peak with a zero baseline, that is, must tend to zero far from the peak center. In practice, this means that any non-zero baseline must be subtracted from the data set before applying this method.  A more general approch to fitting Gaussian peaks, which works for data sets with zeros and negative numbers and also for data with multiple overlapping peaks, is the non-linear iterative curve fitting method.  (By the way, a quick way to test the above methods is to use the simplest Gaussian data set: x=1,2,3 and y=1,2,1, which has a height, width, and position all equal to 2, assuming a baseline of zero).

Note 1: you can right-click on any of the m-file links above and select Save Link As... to download them to your computer for use within Matlab.

Note 2: In the curve fitting techniques described here and in the next two sections, there is no requirement that the x-axis interval between data points be uniform, as is the assumption in many of the other signal processing techniques previously covered.  Curve fitting algorithms typically accept a set of arbitrarily-spaced x-axis values and a corresponding set of y-axis values.


Math details.

The least-squares best fit for an x,y data set can be computed using only basic arithmetic.  Here are the relevant equations for computing the slope and intercept of the first-order best-fit equation, y = intercept + slope*x, as well as the coefficient of determination, R2, which is an indicator of the "goodness of fit". (R2 is 1.0000 if the fit is perfect and less than that if the fit is imperfect).

n = number of x,y data points   
sumx = Σx   
sumy = Σy   
sumxy = Σx*y   
sumx2 = Σx*x   
meanx = sumx / n   
meany = sumy / n   
slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx)   
intercept = meany-(slope*meanx)   
ssy = Σ(y-meany)^2   
ssr = Σ(y-intercept-slope*x)^2   
R2 = 1-(ssr/ssy)

(In these equations, Σ represents summation; for example, Σx means the sum of all the x values, and Σx*y means the sum of all the x*y products, etc).  These calculations could be performed by hand, with the aid of a calculator or a spreadsheet, or with a program written in any programming language. A similar set of equations can be written to fit a second-order (quadratic or parabolic) equations to a set of data.

LeastSquaresSmallQuadraticLeastSquaresSmall.GIF

You can also download spreadsheets in OpenOffice Calc format, pictured above, that automate the computation of those equations and also plot the data and the best-fit line, requiring only that you type in (or paste in) the x-y data.  There is one spreadsheet for linear fits and also a version for quadratic (parabolic) fits.  For the application to analytical calibration, there are specific versions of these spreadsheets that also calculate the concentrations of the unknowns.  


Popular spreadsheets have built-in facilities for computing polynomial least-squares curve fits of any order. For example, the LINEST function in both Excel and OpenOffice Calc can be used to compute polynomial and other curvilinear least-squares fits.  For some examples of applications to analytical chemistry, see Calibration Curve Fitting Methods in Absorption Spectroscopy and Error propagation in Analytical Calibration.
SPECTRUM, the freeware signal-processing application that accompanies this tutorial, includes least-squares curve fitting for linear (straight-line), polynomials of order 2 through 5, and exponential, logarithmic, and power relationships.
 Matlab has simple built-in functions for least-squares curve fitting: polyfit and polyval. For example, if you have a set of x,y data points in the vectors "x" and "y", then the coefficients for the least-squares fit are given by coef=polyfit(x,y,n), where "n" is the order of the polynomial fit: n = 1 for a straight-line fit, 2 for a quadratic (parabola) fit, etc. For a straight-line fit (n=1), coef(1) is the slope ("b") and coef(2) is the intercept ("a"). For a quadratic fit (n=2), coef(1) is the x2 term ("c"), coef(2) is the x term ("b") and coef(3) is the constant term ("a"). The fit equation can be evaluated using the function polyval, for example fity=polyval(coef,x). This works for any order of polynomial fit ("n"). You can plot the data and the fitted equation together using the plot function: plot(x,y,'ob',x,polyval(coef,x),'-r'), which plots the data as blue circles and the fitted equation as a red line. You can plot the residuals by writing Click to view enlarged figureplot(x,y-polyval(coef,x)). (When the number of data points is small, you can get a smoother plot of the fitted equation, evaluated at more finely-divided values of x, by defining xx=linspace(min(x),max(x)); and then using xx rather than x to evaluate and plot the fit: plot(x,y,'ob',xx,polyval(coef,xx),'-r')).

The graph on the left (click to see a larger size) was generated by a simple user-defined function plotfit(x,y,polyorder), that uses all the techniques mentioned in the previous paragraph.  It accepts data in the form of a single vector, or a pair of vectors "x" and "y", or a 2xn or nx2 matrix with x in first row or column and y in the second, and plots the data points as red dots, then if the optional input argument "polyorder" is provided, it fits a polynomial of order "polyorder" to the data and plots the fit as a green line and displays the fit coefficients and the goodness-of-fit measure R2 in the upper left corner of the graph.

Recent versions of Matlab have a convenient tool for interactive manually-controlled (rather than programmed) polynomial curve fitting in the Figure window. Click for a video example: (external link to YouTube).


index previous next
Last updated February, 2012. This page is maintained by Prof. Tom O'Haver , Department of Chemistry and Biochemistry, The University of Maryland at College Park. Comments, suggestions and questions should be directed to Prof. O'Haver at toh@umd.edu.
Unique visits since May 17, 2008: