(1) Select a model for the data;This continues until the fitting error is less than the specified error. One popular technique for doing this is called the Nelder-Mead Modified Simplex. This is essentially a way of organizing and optimizing the changes in parameters (step 4, above) to shorten the time required to fit the function to the required degree of accuracy. With contemporary personal computers, the entire process typically takes only a fraction of a second to a few seconds, depending on the complexity of the model and the number of independently adjustable parameters in the model. (The animation on the right shows the working of the iterative process for a 2-peak unconstrained Gaussian fit to a small set of x,y data. In order to allow the process to be observed in action, it's slowed down artificially with a "pause()" statement, is given a bad initial guess, and is plotted step-by-step The entire process normally takes only about 0.05 seconds on a standard PC).
(2) Make first guesses of all the non-linear parameters;
(3) A computer program computes the model and compares it to the data set, calculating a fitting error;
(4) If the fitting error is greater than the required fitting accuracy, the program systematically changes the parameters and loops back around to the previous step and repeats until the required fitting accuracy is achieved or the maximum number or iterations is reached.
A simple example is fitting the blackbody equation to the spectrum of an incandescent body for the purpose of estimating its color temperature. In this case there is only one nonlinear parameter, temperature. The script BlackbodyDataFit.m demonstrates the technique, placing the experimentally measured spectrum in the vectors "wavelength" and "radiance" and then calling fminsearch with the fitting function fitblackbody.m.
Another application is demonstrated by Matlab's built-in demo
fitdemo.m and its corresponding
fitting function fitfun.m, which model
the sum of two exponential decays. To see this, just type
"fitdemo" in the Matlab command window. (Octave does not have
this demo function).
Fitting peaks. Many instrumental methods of measurement produce signals in the form of peaks of various shapes; a common requirement is to measure the positions, heights, widths, and/or areas of those peaks, even when they are noisy or overlapped with one another. This cannot be done by linear least-squares methods, because such signals can not be modeled as polynomials with linear coefficients (the positions and widths of the peaks are not linear functions), so iterative curve fitting techniques are used instead, often using Gaussian, Lorentzian, or some other fundamental simple peak shapes as a model.
The Matlab/Octave demonstration script Demofitgauss.m demonstrates fitting a Gaussian function to a set of data, using the fitting function fitgauss.m. In this case there are two non-linear parameters: peak position and peak width (the peak height is a linear parameter and is determined by regression in a single step in line 9 of the fitting function fitgauss.m and is returned in the global variable "c"). Compared to the simpler polynomial least-squares methods for measuring peaks, the iterative method has the advantage of using all the data points across the entire peak, including zero and negative points, and it can be applied to multiple overlapping peaks as demonstrated in in Demofitgauss2.m (shown on the left.
To accommodate the possibility that the baseline may shift, we can add a column of 1s to the A matrix, just as was done in the CLS method. This has the effect of introducing an additional "peak" into the model that has an amplitude but no position or width. The baseline amplitude is returned along with the peak heights in the global vector c; Demofitgaussb.m and fitgauss2b.m illustrates this addition. (Demofitlorentzianb.m and fitlorentzianb.m for Lorentzian peaks).
This peak fitting technique is easily extended to any number of
overlapping peaks of the same type using the same
fitting function fitgauss.m, which easily adapts to any number
of peaks, depending on the length of the first-guess "start"
vector lambda that
is passed to the function as input arguments, along with the
data vectors t and y:
1 function err = fitgauss(lambda,t,y) 2 % Fitting functions for a Gaussian band spectrum. 3 % T. C. O'Haver, 2006 Updated to Matlab 6, March 2006 4 global c 5 A = zeros(length(t),round(length(lambda)/2)); 6 for j = 1:length(lambda)/2, 7 A(:,j) = gaussian(t,lambda(2*j-1),lambda(2*j))'; 8 end 9 c = A\y'; 10 z = A*c; 11 err = norm(z-y');If there are n peaks in the model, then the length of lambda is 2n, one entry for each iterated variable ([position1 width1 position2 width2....etc]). The "for" loop (lines 5-7) constructs a n ื length(t) matrix containing the model for each peak separately, using a user-defined peak shape function (in this case gaussian.m), then it computes the n-length peak height vector c by least-squares regression in line 9, using the Matlab shortcut "\" notation. The resulting peak heights are used to compute z, the sum of all n model peaks, by matrix multiplication in line 10, and then "err", the root-mean-square difference between the model z and the actual data y, is computed in line 11 by the Matlab 'norm' function and returned to the calling function ('fminsearch'), which repeats the process many times, trying different values of the peak positions and the peak widths until the value of "err" is low enough.
Variable shape types, such as the Voigt
profile, Pearson, Breit-Wigner-Fano,
Gauss-Lorentz blend, and the exponentially-broadened Gaussian
and Lorentzian, are defined not only by a peak position,
height, and width, but also by an additional parameter that
fine-tunes the shape of the peak. If that parameter is equal
and known for all peaks in a group, it can be passed as
an additional input argument to the shape function, as shown
in VoigtFixedAlpha.m. If the
shape parameter is allowed to be different for each
peak in the group and is to be determined by iteration (just
as is position and width), then the routine must be modified
to accommodate three, rather than two, iterated
variables, as shown in VoigtVariableAlpha.m.
Although the fitting error is lower with variable alphas, the
execution time is longer and the alphas values so determined
are not very stable, with respect to noise in the data and the
starting guess values, especially for multiple peaks. (These
are self-contained functions). Version 7 of the downloadable
Matlab/Octave function peakfit.m
includes variable shape types for the Pearson, ExpGaussian,
Voigt, and Gaussian/Lorentzian blend, as well as the 3-parameter logistic or Gompertz function, whose three parameters are labeled Bo, Kh, and L, rather than position, width, and shape factor).
Signals with peaks of different shape types
in one signal can be fit by the fitting function fitmultiple.m,
which takes as input arguments a vector of peak types and a
vector of shape variables. The sequence of peak types and
shape parameters must be determined beforehand. To see how
this is used, see Demofitmultiple.m.
You can create your own fitting functions for any purpose; they are not limited to single algebraic expressions, but can be arbitrarily complex multiple step algorithms. For example, in the TFit method for quantitative absorption spectroscopy, a model of the instrumentally-broadened transmission spectrum is fit to the observed transmission data, using Matlab's fminsearch function, using a fitting function that performs Fourier convolution of the transmission spectrum model with the known slit function of the spectrometer. The result is an alternative method of calculating absorbance that extends the dynamic range and calibration linearity of absorption spectroscopy far beyond the normal limits.
The uncertainty of the peak parameters determined by iterative least-squares fitting is readily estimated by the bootstrap sampling method. A simple demonstration of bootstrap estimation of the variability of an iterative least-squares fit to a single noisy Gaussian peak is given by the custom downloadable Matlab/Octave function "BootstrapIterativeFit.m", which creates a single x,y data set consisting of a single noisy Gaussian peak, extracts bootstrap samples from that data set, performs an iterative fit to the peak on each of the bootstrap samples, and plots the histograms of peak height, position, and width of the bootstrap samples. The syntax is BootstrapIterativeFit(TrueHeight, TruePosition, TrueWidth, NumPoints, Noise, NumTrials) where TrueHeight is the true peak height of the Gaussian peak, TruePosition is the true x-axis value at the peak maximum, TrueWidth is the true half-width (FWHM) of the peak, NumPoints is the number of points taken for the least-squares fit, Noise is the standard deviation of (normally-distributed) random noise, and NumTrials is the number of bootstrap samples. An typical example for BootstrapIterativeFit(100,100,100,20,10,100); is displayed in the figure on the right.
Peak Height Peak Position Peak Width
mean: 99.27028 100.4002 94.5059
STD: 2.8292 1.3264 2.9939
IQR: 4.0897 1.6822 4.0164
IQR/STD Ratio: 1.3518
A similar demonstration function
for two overlapping
Gaussian peaks is available in "BootstrapIterativeFit2.m".
Type "help BootstrapIterativeFit2" for
more information. In both these simulations, the standard
deviation (STD) as well as the interquartile
range (IQR) of each of the peak parameters are
computed. This is done because the interquartile range is
much less influenced by outliers. The distribution
of peak parameters measured by iterative fitting is often
non-normal, exhibiting a greater fraction of large deviations
from the mean than is expected for a normal distribution. This
is because the iterative procedure sometimes converges on an
abnormal result, especially for multiple peak fits with a large
number of variable parameters. (You may be able to see this in
the histograms plotted by these simulations, especially for the
weaker peak in BootstrapIterativeFit2).
In those cases the standard deviation will be too high
because of the outliers, and the IQR/STD
ratio will be much less than the value of 1.34896 that is expected for a normal distribution. In that
case a better estimate of the standard deviation of the
central portion of the distribution (without the outliers)
For the quantitative measurement of peaks, it's
to compare the iterative least-squares method with
simpler, less computationally-intensive, methods. For example,
the measurement of the peak height of a single peak of uncertain
width and position could be done simply by taking the maximum of
the signal in that region. If the signal is noisy, a more
accurate peak height will be obtained if the signal is smoothed beforehand. But smoothing
can distort the signal and reduce peak heights. Using an
iterative peak fitting method, assuming only that the peak shape
is known, can give the best possible accuracy and precision,
without requiring smoothing even under high noise conditions,
e.g. when the signal-to-noise ratio is 1, as in the demo
True peak height = 1 NumTrials =
100 SmoothWidth = 50
Method Maximum y Max Smoothed y Peakfit
Average peak height 3.65 0.96625 1.0165
Standard deviation 0.36395 0.10364 0.11571
If peak area is measured rather than peak height, smoothing is unnecessary (unless to locate the peak beginning and end) but peak fitting still yields the best precision. See SmoothVsFitArea.m.
also instructive to compare the iterative least-squares method
least-squares curve fitting, discussed in the previous section, which can
also fit peaks in a signal. The difference is that in
the classical least squares method, the positions, widths, and
shapes of all the individual components are all known
beforehand; the only
unknowns are the amplitudes (e.g. peak heights) of the
components in the mixture. In non-linear iterative curve
fitting, on the other hand, the positions, widths, and heights
of the peaks are all unknown
beforehand; the only
thing that is known is the fundamental underlying shape of the
peaks. The non-linear iterative curve
fitting is more difficult to do (for the computer, anyway) and more prone to error,
but it's necessary if you need to track shifts in peak
position or width or to decompose a complex overlapping peak
signal into fundamental components knowing only their shape.
The Matlab/Octave script CLSvsINLS.m
compares the classical least-squares (CLS) method with three
different variations of the iterative method (INLS) method for
measuring the peak heights of three Gaussian peaks in a noisy
test signal, demonstrating that the fewer the number of
unknown parameters, the faster and more accurate is the peak
Another comparison of multiple measurement techniques is presented in Case Study C.
can right-click on any of the m-file links on this page and
select Save Link As... to download them
to your computer, then place them in the Matlab path for use
Iterative curve fitting is often used to measure the
position, height, and width of peaks in a signal, especially
when they overlap significantly. There are four major sources
of error in measuring these peak parameters by iterative curve
a. Model errors.
Peak shape. If you have the wrong model for your peaks, the results can not be expected to be accurate; for instance, if your actual peaks are Lorentzian in shape, but you fit them with a Gaussian model, or vice versa. For example, a single isolated Gaussian peak at x=5, with a height of 1.000 fits a Gaussian model virtually perfectly, using the Matlab user-defined peakfit function, as shown on the right. (The 5th input argument for the peakfit function specifies the shape of peaks to be used in the fit; "1" means Gaussian).
>> [FitResults,MeanFitError]=peakfit([x' y'],5,10,1,1)
Peak# Position Height Width Area
1 5 1 1.6649 1.7724
The "FitResults" are, from left to right, peak number, peak position, peak height, peak width, and peak area. The MeanFitError, or just "fitting error", is the square root of the sum of the squares of the differences between the data and the best-fit model, as a percentage of the maximum signal in the fitted region. (Note the good agreement of the area measurement (1.7724) with the theoretical area under the curve of exp(-x2), which is the square root of p, or about 1.7725...).
But this same peak, when fit with a Logistic model (peak shape number 3, above right), gives a fitting error of 1.4% and height and width errors of 3% and 6%, respectively:
Peak# Position Height Width Area
1 5.0002 0.96652 1.762 1.7419
When fit with a Lorentzian model
(peak shape 2, shown on the right), this peak gives a 6%
fitting error and height and width errors of 8% and 20%,
Peak# Position Height Width Area
1 5 1.0876 1.3139 2.0579
So clearly the larger the fitting
errors, the larger are the parameter errors, but the parameter
errors are of course not equal
to the fitting error (that would just be too easy). Also,
it's clear that the peak width
and area are
the parameters most susceptible to errors. The peak positions, as you can see
here, are measured accurately even if the model is way wrong,
as long as the peak is symmetrical and not highly overlapping
with other peaks.
If you do not know the
shape of your peaks, you can use use peakfit.m or ipf.m to try
different shapes to see if one of the standard shapes included
in those programs fits the data; try to find a peak in your
data that is typical, isolated, and that has a good
signal-to-noise ratio. For example, the Matlab function ShapeTest.m creates a test signal
consisting of a single (asymmetrical) peak, adds random white
noise, fits it with six different candidate model peak shapes
using peakfit.m, plots each fit in a separate figure window,
and prints out a table of fitting errors in the command
window. In this particular case, the last two model shapes
fit almost equally well (because they are mathematically the same, just parameterized differently). You can set the noise level in line 5. If there is too much noise, the results can be misleading; for example, if Noise=.2, the "three Gaussians" model is likely to fit slightly better
because it has more degrees of freedom and can "fit the noise". ShapeTest.m has only six potential
candidate shape in its current fotm; the Matlab function peakfit.m has many more
built-in shapes to choose from, but still it is a finite list
and there is always the possibility that the actual underlying
peak shape is not available in the software you are using.
A good fit is not by itself proof that the shape function you have chosen is the correct one; in some cases the wrong function can give a fit that looks perfect. For example, this fit of a real data set to a 5-peak Gaussian model has a low percent fitting error and the residuals look random - usually an indicator of a good fit. But in fact in this case the model is wrong; that data came from an experimental domain where the underlying shape is fundamentally non-Gaussian but in some cases can look very like a Gaussian. As another example, a data set consisting of peaks with a Voigt profile peak shape can be fit with a weighted sum of a Gaussian and a Lorentzian almost as well as an with an actual Voigt model, even though those models are not the same mathematically; the difference in fitting error is so small that it would likely be obscured by the random noise if it were a real experimental signal. The same thing can occur in sigmoidal signal shapes: a pair of simple 2-parameter logistic functions seems to fit this example data pretty well, with a fitting error of less than 1%; you would no reason to doubt the goodness of fit unless the random noise is low enough so you can see that the residuals are wavy. Alternatively, a 3-parameter logistic (Gompertz function) fits much better, and the residuals are random, not wavy. In such cases you can not depend solely on what looks like a good fit to determine whether the fit is model is optimum; sometimes you need to know more about the peak shape.
Number of peaks. Another
of model error occurs if you have the wrong number of peaks in your
model, for example if the signal actually has two peaks but
you try to fit it with only one peak. In the example below, a
line of Matlab code generates a simulated signal with of two
Gaussian peaks at x=4 and x=6 with peaks heights of 1.000 and
0.5000 respectively and widths of 1.665, plus random noise
with a standard deviation 5% of the height of the largest peak
(a signal-to-noise ratio of 20):
In a real experiment you would not usually know the peak positions, heights, and widths; you would be using curve fitting to measure those parameters. Let's assume that, on the basis of previous experience or some preliminary trial fits, you have established that the optimum peak shape model is Gaussian, but you don't know for sure how many peaks are in this group. If you start out by fitting this signal with a single-peak Gaussian model, you get:
The residual plot shows a "wavy"
structure that's visible in the random scatter of points due
to the random noise in the signal. This means that the fitting
error is not limited by the random noise; it is a clue that
the model is not quite right.
But a fit with two peaks yields much better results (The 4th input argument for the peakfit function specifies the number of peaks to be used in the fit).
Now the residuals have a random
scatter of points, as would be expected if the signal is
accurately fit except for the random noise. Moreover, the
fitting error is much lower (less that half) of the error with
only one peak. In fact, the fitting error is just about what
we would expect in this case based on the 5% random noise in
the signal (estimating the relative standard deviation of the
points in the baseline visible at the edges of the signal).
Because this is a simulation in which we know beforehand the
true values of the peak parameters (peaks at x=4 and x=6 with
peaks heights of 1.0 and 0.50 respectively and widths of
1.665), we can actually calculate the parameter errors (the
difference between the real peak positions, heights, and
widths and the measured values). Note that they are quite
accurate (in this case within about 1% relative on the peak
height and 2% on the widths), which is actually better than
the 5% random noise in this signal because of the averaging
effect of fitting to multiple data points in the signal.
But if going from one peak to two
peaks gave us a better fit, why not go to three peaks? If
there were no noise in the data, and if the underlying peak
shape were perfectly matched by the model, then the fitting
error would have already been essentially zero with two model
peaks, and adding a third peak to the model would yield a
vanishingly small height for that third peak. But in our
examples here, as in real data, there is always some random
noise, and the result is that the third peak height will not
be zero. Changing the number of peaks to three gives these
Peak# Position Height Width Area
1 4.0748 0.51617 1.7874 0.98212
2 6.7799 0.089595 2.0455 0.19507
3 5.9711 0.94455 1.53 1.5384
MeanFitError = 4.3878
The fitting algorithm has now tried to fit an additional low-amplitude peak (numbered peak 2 in this case) located at x=6.78. The fitting error is actually lower that for the 2-peak fit, but only slightly lower, and the residuals are no less visually random that with a 2-peak fit. So, knowing nothing else, a 3-peak fit might be rejected on that basis alone. In fact, there is a serious downside to fitting more peaks than are actually present in the signal: it increases the parameter measurement errors of the peaks that are actually present. Again, we can prove this because we know beforehand the true values of the peak parameters: clearly the peak positions, heights, and widths of the two real peaks than are actually in the signal (peaks 1 and 3) are significantly less accurate than those of the 2-peak fit.
Moreover, if we repeat that fit
with the same signal
but with a different
sample of random noise (simulating a repeat measurement of a
stable experimental signal in the presence or random noise),
the additional third peak in the 3-peak fit will bounce around
all over the place (because the third peak is actually fitting
the random noise,
not an actual peak in the signal).
>> [FitResults,MeanFitError]=peakfit([x' y'],5,10,3,1)
Peak# Position Height Width Area
1 4.115 0.44767 1.8768 0.89442
2 5.3118 0.09340 2.6986 0.26832
3 6.0681 0.91085 1.5116 1.4657
MeanFitError = 4.4089
With this new set of data, two of
the peaks (numbers 1 and 3) have roughly the same position,
height, and width, but peak number 2 has changed substantially
compared to the previous run. Now we have an even more
compelling reason to reject the 3-peak model: the 3-peak
solution is not stable.
And because this is a simulation in which we know the right
answers, we can verify that the accuracy of the peak heights
is substantially poorer (about 10% error) than expected with
this level of random noise in the signal (5%). If we were to
run a 2-peak fit on the same new data, we get much better
measurements of the peak heights.
Peak# Position Height Width Area
1 4.1601 0.49981 1.9108 1.0167
2 6.0585 0.97557 1.548 1.6076
MeanFitError = 4.4113
If this is repeated several times,
it turns out that the peak parameters of the peaks at x=4
and x=6 are, on average, more accurately measured by the
2-peak fit. In practice, the best way to evaluate a proposed
fitting model is to fit several repeat measurements
of the same signal (if that is practical experimentally) and
to compute the standard deviation of the peak parameter
In real experimental work, of course, you usually don't know the right answers beforehand, so that's why it's important to use methods that work well when you do know. The real data example mentioned above was fit with a succession of 2, 3, 4 and 5 Gaussian models, until the residuals became random. Beyond that point, there is little to be gained by adding more peaks to the model. Another way to determine the minimum number of models peaks needed is to plot the fitting error vs the number of model peaks; the point at which the fitting error reaches a minimum, and increases afterward, would be the fit with the "ideal combination of having the best fit without excess/unnecessary terms" (http://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2). The Matlab/Octave function testnumpeaks.m (R = testnumpeaks(x, y, peakshape, extra, NumTrials, MaxPeaks)) applies this idea by fitting the x,y data to a series of models of shape peakshape containing 1 to MaxPeaks model peaks. The correct number of peaks underlying peaks is either the fit with the lowest fitting error, or, if two or more fits have about the same fitting error, the fit with the least number of peaks. The Matlab/Octave demo script NumPeaksTest.m uses this function with noisy computer-generated signals containing a user-selected 3, 4, 5 or 6 underlying peaks. With very noisy data, however, the technique is not always reliable.
Peak width constraints.
Finally, there is one more thing that we can do that
might improve the peak parameter measurement accuracy, and it
concerns the peak widths. In all the above simulations, the
basic assumption that all
the peak parameters were unknown and independent of one
another. In some types of measurements, however, the peak
widths of each group of adjacent peaks are all expected to be
equal to each other, on the basis of first principles or
previous experiments. This is a common situation in analytical
chemistry, especially in atomic spectroscopy and in
chromatography, where the peak widths are determined largely
by instrumental factors. In the current simulation, the true
peaks widths are in fact both equal to 1.665, but all the
results above show that the measured
peak widths are close but not quite equal, due to random noise
in the signal. The unequal peak widths are a consequence of the random noise, not real differences in peak width.
But we can introduce an equal-width constraint
into the fit by using peak shape 6 (Equal width Gaussians) or
peak shape 7 (Equal-width Lorentzians) in peakfit
version 1.4 and later. Using peak shape 6 on the same
set of data as the previous example:
Peak# Position Height Width Area
1 4.0293 0.52818 1.5666 0.8808
2 5.9965 1.0192 1.5666 1.6997
MeanFitError = 4.5588
This "equal width" fit forces all the peaks within one group to have exactly the same width, but that width is determined by the program from the data. The result is a slightly higher fitting error (in this case 4.5 rather than 4.4), but - perhaps surprisingly - the peak parameter measurements are usually more accurate and more reproducible (Specifically, the relative standard deviations are on average lower for the equal-width fit than for an unconstrained-width fit to the same data, assuming of course that the true underlying peak widths are really equal). This is an exception to the general expectation that lower fitting errors result in lower peak parameter errors. It is an illustration of the general rule that the more you know about the nature of your signals, and the closer your chosen model adheres to that knowledge, the better the results. In this case we knew that the peak shape was Gaussian (although we could have verified that choice by trying other candidate peaks shapes). We determined that the number of peaks was 2 by inspecting the residuals and fitting errors for 1, 2, and 3 peak models. And then we introduced the constraint of equal peak widths within each group of peaks (based on prior knowledge of the experiment rather than on inspection of residuals and fitting errors). Not every experiment can be expected to yield peaks of equal width, but when it does, it's better to make use of that constraint.
Fixed-width shapes. Going one step beyond equal widths (in peakfit version 7.6 and later), you can also specify a fixed-width shapes (shape numbers 11, 12, 34-37), in which the width of the peaks are known beforehand, but are not necessarily equal, and are specified as a vector in input argument 10, one element for each peak, rather than being determined from the data as in the equal-width fit above. Introducing this constraint onto the previous example, and supplying an accurate width as the 10th input argument:
Peak# Position Height Width Area
1 3.9943 0.49537 1.666 0.8785
2 5.9924 0.98612 1.666 1.7488
MeanFitError = 4.8128
Comparing to the previous
equal-width fit, the fitting error is larger here (because
there are fewer degrees of freedom to minimize the error), but
the parameter errors, particularly the peaks heights, are more
accurate because the width information provided in the input
argument was more accurate (1.666) than the width determined
by the equal-width fit (1.5666). Again, not every
experiment yields peaks of known width, but when it does, it's
better to make use of that constraint. For example, see Example 35 and the Matlab/Octave script WidthTest.m (typical results for a Gaussian/Lorentzian blend shape shown below, showing that the more constraints, the greater the fitting error but the lower the parameter errors, if the constraints are accurate).
|Relative percent error||Fitting error||Position Error||Height Error||Width Error|
|Unconstrained shape factor and widths: shape 33||0.78||0.39||0.80||1.66|
|Fixed shape factor and variable widths: shape 13||0.79||0.25||1.3||0.98|
|Fixed shape factor and fixed widths: shape 35|| 0.8||0.19||0.69||0.0|
Finally, note that if the peak positions are also known,
and only the peak heights
are unknown, you don't even need to use the iterative fitting
method at all; you can use the much easier and faster
multilinear regression technique (classical least squares).
Here's a a
screen video (MorePeaksLowerFittingError.mp4)
of a real-data experiment using the interactive peak fitter ipf.m with a complex experimental signal in which
several different fits were performed using models from 4 to 9
variable-width, equal-width, and fixed-width Gaussian peaks.
The fitting error gradually decreases from 11% to 1.4% as more
peaks are used, but is that really justified? If the objective
is simply to get a good fit, then do whatever it takes. But if
the objective is to extract some useful information from the
model peak parameters, then more specific knowledge about that
particular experiment is needed: how many peaks are really
expected; are the peak widths really expected to be
constrained? Note that in this particular case the
residuals (bottom panel) are never really random and
always have a distinct "wavy" character, suggesting that the
data may have been smoothed before curve fitting
(usually not a good idea: see http://wmbriggs.com/blog/?p=195). Thus there is a real possibility
that some of those 9 peaks are simply "fitting the noise", as
will be discussed further in Appendix A.
The peaks that are measured in many scientific instruments are sometimes superimposed on a non-specific background or baseline. Ordinarily the experiment protocol is designed to minimize the background or to compensate for the background, for example by subtracting a "blank" signal from the signal of an actual specimen. But even so there is often a residual background that can not be eliminated completely experimentally. The origin and shape of that background depends on the specific measurement method, but often this background is a broad, tilted, or curved shape, and the peaks of interest are comparatively narrow features superimposed on that background. The presence of the background has little effect on the peak positions, but it is impossible to measure the peak heights, width, and areas accurately unless something is done to account for the background.
There are various methods
described in the literature for estimating and subtracting the
background in such cases. The simplest assumption is that the
background can be approximated as a simple function in the
local region of group of peaks being fit together, for example
as a constant (flat), straight line (linear) or curved line
(quadratic). This is the basis of the "autozero" modes in the
iSignal.m, and iPeak.m functions,
which are selected by the T key to cycle thorough OFF,
linear, quadratic, and flat modes. In
the flat mode, a constant baseline is included in the
curve fitting calculation, as described above.
In linear mode, a straight-line baseline connecting
the two ends of the signal segment in the upper panel will be
automatically subtracted before the iterative curve fitting. In quadratic mode, a
parabolic baseline is subtracted. In the last two modes, you
must adjust the pan and zoom controls to isolate the group of
overlapping peaks to be fit, so that the signal returns to the
local background at the left and right ends of the window.
Example of an
experimental chromatographic signal. From left to right, (1)
Raw data with peaks superimposed on a tilted baseline. One
group of peaks is selected using the the pan and zoom
controls, adjusted so that the signal returns to the local
background at the edges of the segment displayed in the upper
window; (2) The linear baseline is subtracted when the
autozero mode set to 1 in ipf.m by pressing the T key;
(3) Fit with a three-peak Gaussian model, activated by
pressing 3, G, F (3 peaks, Gaussian, Fit).
Alternatively, it may
be better to subtract the background from the entire
signal first, before further operations are performed. As
before, the simplest assumption is that the background is
piecewise linear, that is, can be approximated as a series of
small straight line segments. This is the basis of the
multiple point background subtraction mode in ipf.m,
iPeak.m, and in iSignal. The user enters the
number of points that is thought to be sufficient to define
the baseline, then clicks where the baseline is thought to be
along the entire length of the signal in the lower
whole-signal display (e.g. on the valleys between the
peaks). After the last point is clicked, the program
interpolates between the clicked points and subtracts the
piecewise linear background from the original signal.
From left to right, (1) Raw
data with peaks superimposed on baseline. (2) Background
subtracted from the entire signal using the multipoint
background subtraction function in iPeak.m (ipf.m and iSignal have the
Sometimes, even without an actual baseline
present, the peaks may overlap enough so that the signal never
return to the baseline, making it seem that there is a
baseline to be corrected. This can occur especially with peaks
shapes that have gradually sloping sides, such as the
Lorentzian, as shown in
this example. Curve fitting without baseline
correction will work in that case.
In some cases the background may
be modeled as a broad peak whose maximum falls outside of
the range of data acquired, as in the real data example on the
left. It may be possible to fit the off-screen peak simply by
including extra peak in the model to account for the baseline.
(Don't use the equal-width shapes for this, because it's
likely that the background peak is broader than the measured
peaks). In the example on the left, there are three clear
peaks visible, superimposed on a tilted baseline. In
this case the signal was fit nicely with four, rather than
three, variable-width Gaussians, with an error of only 1.3%.
The additional broad Gaussian, with a peak at x = -38.7,
serves as the baseline.
The Matlab/Octave function peakfit.m can employ a peakshape input argument that is a vector of
different shapes, which can be useful for baseline correction, as in these examples:
Weak Gaussian peak on sloped straight-line baseline, 2-peak fit with one Gaussian and one variable-slope straight line ('slope', shape 26, peakfit version 6 only).
[FitResults,GOF]= peakfit([x;y],0,0,2,[1 26],[1 1],1,0)
1 10 1 1.6651 1.7642
2 4.485 0.22297 0.05 40.045
If the baseline seems to be curved, you can model the baseline with a quadratic (shape 46) rather than a linear slope (peakfit version 8 and later).
If the baseline seems to be different on either side of the peak, it might be useful to to model the baseline with an S-shape (sigmoid), either an up-sigmoid, shape 10 (click for graphic), peakfit([x;y],0,0,2,[1 23],[0 0], or a down-sigmoid, shape 23 (click for graphic), peakfit([x;y],0,0,2,[1 23],[0 0].
If the signal is very weak compared to the baseline, the fit can be helped by adding rough first guesses ('start') using the 'polyfit' function to generate automatic first guesses for the sloping baseline. For example, with two overlapping signal peaks and a 3-peak fit with peakshape=[1 1 26].
start=[8 1 10 1 polyfit(x,y,1)];
peakfit([x;y],0,0,3,[1 1 26],[1 1 1],1,start)
The downside to including the baseline as a variable component is that it increases the number of degrees of freedom, increases the execution time, and increases the possibility of unstable fits.
Random noise in the signal.
Any experimental signal has a certain amount of random noise, which means that the individual data points scatter randomly above and below their mean values. The assumption is ordinarily made that the scatter is equally above and below the true signal, so that the long-term average approaches the true mean value; the noise "averages to zero" as it is often said. The practical problem is that any given recording of the signal contains only one finite sample of the noise. If another recording of the signal is made, it will contain another independent sample of the noise. These noise samples are not infinitely long and therefore do not represent the true long-term nature of the noise. This presents two problems: (1) an individual sample of the noise will not "average to zero" and thus the parameters of the best-fit model will not necessarily equal the true values, and (2) the magnitude of the noise during one sample might not be typical; the noise might have been randomly greater or smaller than average during that time. This means that the mathematical "propagation of error" methods, which seek to estimate the likely error in the model parameters based on the noise in the signal, will be subject to error (underestimating the error if the noise happens to be lower than average and overestimating the errors if the noise happens to be larger than average).
A better way to estimate the parameter errors is to record multiple samples of the signal, fit each of those separately, compute the models parameters from each fit, and calculate the standard error of each parameter. But if that is not practical, it is possible to simulate such measurements by adding random noise to a model with known parameters, then fitting that simulated noisy signal to determine the parameters, then repeating the procedure over and over again with different sets of random noise. This is exactly what the script DemoPeakfit.m (which requires the peakfit.m function) does for simulated noisy peak signals such as those illustrated below. It's easy to demonstrate that, as expected, the average fitting error precision and the relative standard deviation of the parameters increases directly with the random noise level in the signal. But the precision and the accuracy of the measured parameters also depend on which parameter it is (peak positions are always measured more accurately than their heights, widths, and areas) and on the peak height and extent of peak overlap (the two left-most peaks in this example are not only weaker but also more overlapped than the right-most peak, and therefore exhibit poorer parameter measurements). In this example, the fitting error is 1.6% and the percent relative standard deviation of the parameters ranges from 0.05% for the peak position of the largest peak to 12% for the peak area of the smallest peak.
Overlap matters: The errors in the values of peak parameters measured by curve fitting depend not only on the characteristics of the peaks in question and the signal-to-noise ratio, but also upon other peaks that are overlapping it. From left to right: (1) a single peak at x=100 with a peak height of 1.0 and width of 30 is fit with a Gaussian model, yielding a relative fit error of 4.9% and relative standard deviation of peak position, height, and width of 0.2%, 0.95%, and 1.5% , respectively. (2) The same peak, with the same noise level but with another peak overlapping it, reduces the relative fit error to 2.4% (because the addition of the second peak increases overall signal amplitude), but increases the relative standard deviation of peak position, height, and width to 0.84%, 5%, and 4% - a seemingly better fit, but with poorer precision for the first peak. (3) The addition of a third (non-overlapping) peak reduces the fit error to 1.6% , but the relative standard deviation of peak position, height, and width of the first peak are still 0.8%, 5.8%, and 3.64%, about the same as with two peaks, because the third peak does not overlap the first one significantly.
If the average noise noise in the signal is not known or its probability distribution is uncertain, it is possible to use the bootstrap sampling method to estimate the uncertainty of the peak heights, positions, and widths, as described above. Version 8 of the keypress operated interactive version of ipf.m has added a function (activated by the 'v' key) that estimates the expected standard deviation of the peak parameters using this method.
One way to reduce the effect of
noise is to take more data. If the experiment makes it
possible to reduce the x-axis interval between points, or to
take multiple readings at each x-axis values, then the
resulting increase in the number of data points in each peak
should help reduce the effect of noise. As a
demonstration, using the script DemoPeakfit.m
to create a simulated overlapping peak signal like that shown
above right, it's possible to change the interval between x
values and thus the total number of data points in the signal.
With a noise level of 1% and 75 points in the signal, the
fitting error is 0.35 and the average parameter error is 0.8%.
With 300 points in the signal and the same noise level, the
fitting error is essentially the same, but the average
parameter error drops to 0.4%, suggesting that the accuracy of
the measured parameters varies inversely with the square root
of the number of data points in the peaks.
on the right illustrates the importance of sampling interval
and data density. You can download the data file "udx" in TXT format or in Matlab MAT format. The signal consists of two
Gaussian peaks, one located at x=50 and the second at x=150.
Both peaks have a peak height of 1.0 and a peak half-width of
10, and normally-distributed random white noise with a
standard deviation of 0.1 has been added to the entire signal.
The x-axis sampling interval, however, is different for the
two peaks; it's 0.1 for the first peak and 1.0 for the second
peak. This means that the first peak is characterized by
ten times more points than the second peak. When you fit
these peaks separately to a Gaussian model (e.g., using
peakfit.m or ipf.m), you will find that all the parameters of
the first peak are measured more accurately than the second,
even though the fitting error is not much different:
First peak: Second peak:
Percent Fitting Error = 7.6434% Percent Fitting Error = 8.8827%
Peak# Position Height Width Peak# Position Height Width
1 49.95 1.0049 10.111 1 149.64 1.0313 9.941
So far this discussion has applied
to white noise. But other noise colors have
different effects. Low-frequency weighted (pink) noise has a
greater effect on the accuracy of peak parameters
measured by curve fitting, and, in a nice symmetry,
high-frequency blue noise has a smaller effect on
the accuracy of peak parameters that would be expected on the
basis of its standard deviation, because the information in a
smooth peak signal
is concentrated at low frequencies. An example of this
occurs when curve fitting is applied to a signal that has been
deconvoluted to remove a
broadening effect. This is why smoothing before curve fitting does not help, because the peak signal information is concentrated in the low frequency range, but smoothing reduces mainly the noise in the high frequency range.
Sometime you may notice that the residuals in a curve fitting operation are structured into bands or lines rather than being completely random. This can occur if either the independent variable or the dependent variable is quantized into discrete steps rather than continuous. It may look strange, but it has little effect on the results as long as the random noise is larger than the steps.
d. Iterative fitting
Unlike multiple linear regression curve fitting, iterative methods may not always converge on the exact same model parameters each time the fit is repeated with slightly different starting values (first guesses). The Interactive Peak Fitter ipf.m makes it easy to test this, because it uses slightly different starting values each time the signal is fit (by pressing the F key in ipf.m, for example). Even better, by pressing the X key, the ipf.m function silently computes 10 fits with different starting values and takes the one with the lowest fitting error. A basic assumption of any curve fitting operation is that the fitting error (the root-mean-square difference between the model and the data) is minimized, the parameter errors (the difference between the actual parameters and the parameters of the best-fit model) will also be minimized. This is generally a good assumption, as demonstrated by the graph to the right, which shows typical percent parameters errors as a function of fitting error for the left-most peak in one sample of the simulated signal generated by DemoPeakfit.m (shown in the previous section). The variability of the fitting error here is caused by random small variations in the first guesses, rather than by random noise in the signal. In many practical cases there is enough random noise in the signals that the iterative fitting errors within one sample of the signal are small compared to the random noise errors between samples.
Remember that the variability in measured peak parameters from fit to fit of a single sample of the signal is not a good estimate of the precision or accuracy of those parameters, for the simple reason that those results represent only one sample of the signal, noise, and background. The sample-to-sample variations are likely to be much greater than the within-sample variations due to the iterative curve fitting. (In this case, a "sample" is a single recording of signal). To estimate the contribution of random noise to the variability in measured peak parameters when only a single sample if the signal is available, the bootstrap method can be used.
e. A difficult case. As
a dramatic example of the ideas in parts c and d, consider this
simulated example signal, consisting of two Gaussian peaks of
equal height = 1.00, overlapping closely enough so that their
sum is a single symmetrical peak that looks very much like a
If there were no noise in the signal, the peakfit.m or ipf.m routines could easily extract the two equal Gaussian components to an accuracy of 1 part in 1000.
>> peakfit([x y],5,19,2,1)Peak# Position Height Width Area
But in the presence of even a little noise (for example, 1% RSD), the results are uneven; one peak is almost always significantly higher than the other:
1 4.4117 0.83282 1.61 1.43
2 5.4022 1.1486 1.734 2.12
The fit is stable with any one
sample of noise (if peakfit.m was run again with slightly
different starting values, for example by pressing the F key several times in
ipf.m), so the problem is not iterative fitting errors caused
by different starting values. The problem is the noise:
although the signal is completely symmetrical, any particular
sample of the noise is not perfectly symmetrical (e.g. the
first half of the noise usually averages a slightly higher or
lower than the second half, resulting in an asymmetrical fit
result). The surprising thing is that the error in the peak
heights are much larger (about 15% relative, on average) than
the random noise in the data (1% in this example). So even
though the fit looks good
- the fitting error is low (less than 1%) and the residuals
are random and unstructured - the model parameters can
still be very far off. If you were to make another
measurement (i.e. generate another independent set of noise),
the results would be different but still inaccurate (the
first peak has an equal chance of being larger or smaller than
the second). Unfortunately, the expected error is not
accurately predicted by the bootstrap method,
which seriously underestimates the standard deviation of the
peak parameters with repeated measurements of independent
signals (because a bootstrap sub-sample of asymmetrical
noise is likely to remain asymmetrical). A Monte Carlo
simulation would give a more reliable estimation of
uncertainty in such cases.
Better results can be obtained in cases where the peak widths are expected to be equal, in which case you can use peak shape 6 (equal-width Gaussian) instead of peak shape 1: peakfit([x y],5,19,2,6). It also helps to provide decent first guesses (start) and to set the number of trials (NumTrials) to a number above 1): peakfit([x,y],5,10,2,6,0,10,[4 2 5 2],0,0). The best case will be if the shape, position, and width of the two peaks are known accurately, and if the only unknown is their heights. Then the Classical Least Squares (multiple regression) technique can be employed and the results will be much better.So, to sum up, we can make the following observations about the accuracy of model parameters: (1) the parameter errors depend on the accuracy of the model chosen and on number of peaks; (2) the parameter errors are directly proportional to the noise in the data (and worse for low-frequency or pink noise); (3) all else being equal, parameter errors are proportional to the fitting error, but a model that fits the underlying reality better, e.g. equal or fixed widths or shapes) often gives lower parameter errors even if the fitting error is larger; (4) the errors are typically least for peak position and worse for peak width and area; (5) the errors depend on the data density (number of independent data points in the width of each peak) and on the extent of peak overlap (the parameters of isolated peaks are easier to measure than highly overlapped peaks); (6) if only a single signal is available, the effect of noise on the standard deviation of the peak parameters in many cases can be predicted approximately by the bootstrap method, but if the overlap of the peaks is too great, the error of the parameter measurements can be much greater than predicted.
|Actual peak parameters
|Gaussian fit to
|ExpGaussian fit tosignal||2800.1302||0.51829906