[Introduction] [Signal arithmetic] [Signals and noise] [Smoothing] [Differentiation] [Peak Sharpening] [Harmonic analysis] [Fourier convolution] [Fourier deconvolution] [Fourier filter] [Wavelets] [Peak area measurement] [Linear Least Squares] [Multicomponent Spectroscopy] [Iterative Curve Fitting] [Hyperlinear quantitative absorption spectrophotometry] [Appendix and Case Studies] [Peak Finding and Measurement] [iPeak] [iSignal] [Peak Fitters] [iFilter] [iPower] [List of downloadable software] [Interactive tools]
Here are two examples that show that the answer to this question is yes. The first example is shown on the left. This shows iSignal displaying a computer-generated 4000-point signal consisting of pure random noise, smoothed with a 19-point Gaussian smooth. The upper window shows a tiny slice of this signal that looks like a Gaussian peak with a calculated SNR over 1000. Only by looking at the entire signal (bottom window) do you see the true picture; that “peak” is just part of the noise, smoothed to look nice. Don't fool yourself.
The second example is a simple series of three Matlab commands that uses the 'randn' function to generate a 10000-point data set containing only normally-distributed white noise. Then it uses 'fastmooth' to smooth that noise, resulting in a 'signal' with a standard deviation of about 0.3 and a maximum value around 1.0. That signal is then submitted to iPeak. If the peak detection criteria (e.g. AmpThreshold and SmoothWidth) are set too low, many peaks will be found. But setting the AmpThreshold to 3 times the standard deviation (3 x 0.3 = 0.9) will greatly reduce the incidence of these false peaks.
The peak identification function, which identifies peaks based on their exact x-axis peak position and a stored table of identified peak positions, is even less likely to be fooled by random noise, because in addition to the peak detection criteria of the findpeaks algorithm, any detected peak must also match closely to a peak position in the table of known peaks.
This experimental signal in this case study was unusual because it did not look like a typical signal when plotted; in fact, it looked a lot like noise at first glance. The figure below compares the raw signal (bottom) with the same number of points of normally-distributed white noise (top) with a mean of zero and a standard deviation of 1.0 (obtained from the Matlab/Octave 'randn' function).
As you can see, the main difference is that the signal has more large 'spikes', especially in the positive direction. This difference is evident when you look at the descriptive statistics of the signal and the randn function:
random noise (randn function)
about 5 - 6
Standard Deviation (STD)
Inter-Quartile Range (IQR)
You can see that the standard deviations of these two are nearly the same, but the other statistics (especially the kurtosis and skewness) indicate that the probability distribution of the signal is far from normal; there are far more positive spikes in the signal than expected for pure noise. Most of these turned out to be the peaks of interest for this signal; they look like spikes only because the length of the signal (over 1,000,000 points) causes the peaks to be compressed into one screen pixel or less when the entire signal is plotted on the screen. In the figures on the left, iSignal is used to "zoom in" on some of the larger of these peaks (using the cursor arrow keys). The peaks are very sparsely separated (by an average of 1000 half-widths between peaks) and are well above the level of background noise (which has a standard deviation of roughly 0.9 throughout the signal).
The researcher who obtained this signal said that a 'good' peak was 'bell shaped', with an amplitude above 5 and a width of 500-1000 x-axis units. So that means that we can expect the signal-to-background-noise ratio to be at least 5/0.9 = 5.5. You can see in the three example peaks on the left that the peak widths do indeed meet those expectations. The interval between adjacent x-axis points is 25, so that means the we can expect the peaks to have about 20 to 40 points in their widths. Based on that, we can expect that the positions, heights and widths of the peaks should be able to be be measured fairly accurately using least-squares methods (which reduce the uncertainty of measured parameters by about the square root of the number of points used - about a factor of 5 in this case). However, the noise appears to be signal-dependent; the noise on the top of the peaks is distinctly greater than the noise on the baseline. The result is that the actual signal-to-noise (S/N) ratio of peak parameter measurement for the larger peaks will not be as good as might be expected based on the ratio of the peak height to the noise on the background. Most likely, the total noise in this signal is the sum of two major components, one with a fixed standard deviation of 0.9 and the other roughly equal to 10% of the peak height.
To automate the detection of large numbers of peaks, we can use the findpeaksG or iPeak functions. Reasonable values of the input arguments AmplitudeThreshold, SlopeThreshold, SmoothWidth, and FitWidth for those functions can be estimated based on the expected peak height (5) and width (20 to 40 data points) of the "good" peaks. For example, using AmplitudeThreshold=5, SlopeThreshold=.001, SmoothWidth=25, and FitWidth=25, these function detect and measure 76 peaks above an amplitude of 5 and with an average peak width of 523. The interactive iPeak function is especially convenient for exploring the effect of these peak detection parameters and for graphically inspecting the peaks that it finds. Ideally the objective is to find a set of peak detection arguments that detect and accurately measure all the peaks that you would consider 'good' and skip all the 'bad' ones. But in reality the criteria for good and bad peaks is at least partly subjective, so it's usually best to err on the side of caution and avoid skipping 'good' peaks at the risk of including a few 'bad' peaks in the mix, which can be weeded out manually based on unusual position, height, width, or appearance.
Of course, it must be expected that the values of the peak position, height, and width given by the findpeaksG or iPeak functions will only be approximate and will vary depending on the exact setting of the peak detection arguments; the noisier the data, the greater the uncertainty in the peak parameters. In this regard the peak-fitting functions peakfit.m and ipf.m usually give more accurate results, because they make use of all the data across the peak, not just the top of the peak as do findpeaksG and iPeak. For example, compare the results of the peak near x=3035200 measured with iPeak (click to view) and with peakfit (click to view). Also, the peak fitting functions are better for dealing with overlapping peaks and for estimating the uncertainty of the measured peak parameters, using the bootstrap options of those functions. For example, the largest peak in this signal has an x-axis position of 2.8683e+007, height of 32, and width of 500; the bootstrap method determines that the standard deviations are 4, 0.92, and 9.3, respectively.
Because the signal in the case study was so large (over 1,000,000 points), the interactive programs such as iPeak, iSignal, and ipf may be sluggish in operation, especially if your computer is not fast computationally or graphically. If this is a serious problem, it may be best to break the signal up into two or more segments and deal with each segment separately, then combine the results. Alternatively, you can use the condense function to average the entire signal into a smaller number of points by a factor of 2 or 3 (at the risk of slightly reducing peak heights and increasing peak widths), but then you should reduce SmoothWidth and FitWidth by the same factor to compensate for the reduced number of data points across the peaks. Run testcondense.m for a demonstration of the condense function.
C. Buried treasure.
The experimental signal in this case study had a number of narrow spikes above a seemingly flat baseline.
Using iSignal to investigate the signal, it was found that the visible positive spikes were single points of very large amplitude (up to 106), whereas the regions between the spikes were not really flat but contained bell-shaped peaks that were so much smaller (below 103) that they were not even visible on this scale. For example, using iSignal to zoom in to the region around x=26300, you can see one of those bell-shapes peaks with a small single-point negative-going spike artifact near its peak.
Very narrow spikes like this are common artifacts in some experimental signals; they are easy to eliminate by using a median filter. The iSignal function has such a filter, activated by the “M” key. The result shows that the single-point spike artifacts have been eliminated, with little effect on the character of the bell-shaped peak.
Other filter types, like most forms of smoothing, would be far less effective than a median filter for this type of artifact and would distort the peaks.
The negative spikes in this signal turned out to be steep steps, which can either be reduced by using iSignal's slew-rate limit function (the ` key) or manually eliminated by using the semicolon key (;) to set the selected region between the dotted red cursor lines to zero. Using the latter approach, the entire cleaned-up signal is shown below. The remaining peaks are all positive, bell shaped and have amplitudes from about 6 to about 750.
iPeak can automate the measurement of peak positions and heights for the entire signal.
If required, individual peaks can be measured more accurately by fitting the whole peak with iPeak's “N” key or with peakfit.m or ipf.m. The peaks are all slightly asymmetrical; they fit an exponentially-broadened Gaussian model to a fitting error less than about 0.5%, as shown on the left. The smooth residual plots suggests that the signal was smoothed before the spikes were introduced.
Note that fitting with an exponentially-broadened Gaussian model gives the peak parameters of the Gaussian before broadening. iSignal and iPeak estimate the peak parameters of the broadened peak. As before, the effect of the broadening is to shift the peak position to larger values, reduce the peak height, and increase the peak width.
Position Height Width Area error S/N Ratio
isignal 16871 788.88 32.881 27612 172
ipeak 16871 785.34 33.525 28029
peakfit(G) 16871 777.9 33.488 27729 1.68%
peakfit(E) 16863 973.72 27.312 28308 0.47%
G = Gaussian model; E = Exponentially-broadened Gaussian model
This case study demonstrates the application of
several techniques described in this paper to the quantitative
measurement of a peak that is buried in an unstable background,
a situation that can occur in the quantitative analysis
applications of various forms of spectroscopy, process
monitoring, and remote sensing. The objective is to derive a
measure of peak amplitude that varies linearly with the actual
peak amplitude but that is not effected by the changes in the
background and the random noise. In this example, the peak to be
measured is located at a fixed location in the center of the
recorded signal, at x=100 and has a fixed shape (Gaussian) and
width (30). The background, on the other hand, is highly
variable, both in amplitude and in shape. The simulation shows
six superimposed recordings of the signal with six different
peak amplitudes and with randomly varying background amplitudes
and shapes (top row left in the following figures). The methods
that are compared here include smoothing,
differentiation, classical least squares
multicomponent method (CLS), and iterative non-linear curve fitting.
CaseStudyC.m is a self-contained Matlab/Octave demo function that demonstrates this case. To run it, download it. place it in the path, and type “CaseStudyC” at the command prompt. Each time you run it, you'll get the same series of true peak amplitudes (set by the vector “SignalAmplitudes”, in line 12) but a different set of background shapes and amplitudes. The background is modeled as a Gaussian peak of randomly varying amplitude, position, and width; you can control the average amplitude of the background by changing the variable “BackgroundAmplitude” and the average change in the background by changing the variable “BackgroundChange”.
The five methods compared in the figures below are:
1: Top row center. A simple zero-to-peak measurement of the smoothed signal, which assumes that the background is zero.
2: Top row right. The difference between the peak signal and the average background on both sides of the peak (both smoothed), which assumes that the background is flat.
3: Bottom row left. A derivative-based method, which assumes that the background is very broad compared to the measured peak.
4: Bottom row center. Classical least squares (CLS), which assumes that the background is a peak of known shape, width, and position (the only unknown being the height).
5: Bottom row right. iterative non-linear curve fitting (INLS), which which assumes that the background is a peak of known shape but unknown width and position. This method can track changes in the background peak position and width (within limits), as long as the measured peak and the background shapes are independent of the concentration of the unknown.
These five methods are listed roughly in the order of increasing mathematical and geometrical complexity. They are compared below by plotting the actual peak heights (set by the vector “SignalAmplitudes”) vs the measure derived from that method, fitting the data to a straight line, and computing the coefficient of determination, R2., which ideally is 1.0000.
For the first test (shown in the figure above), both “BackgroundAmplitude” and “BackgroundChange” are set to zero, so that only the random noise is present. In that case all the methods work well, with R2.values all very close to 0.9999. With a 10x higher noise level (click to view), all methods still work about equally well, but with a lower coefficient of determination R2, as might be expected.
For the second test (shown in the figure immediately above), “BackgroundAmplitude”=1 and “BackgroundChange”=0, so the background has significant amplitude variation but a fixed shape, position, and width. In that case, the first two methods fail, but the derivative, CLS, and INLS methods work well.
For the third test, shown in the figure above, “BackgroundAmplitude”=1 and “BackgroundChange”=100, so the background varies in position, width, and amplitude (but remains broad compared to the signal). In that case, the CLS methods fails as well, because it assumes that the background varies only in amplitude. However, if we go one step further (click to view) and set “BackgroundChange”=1000, the background shape is now so unstable that even the INLS method fails, but still the derivative method remains effective as long as the background is broader than the measured peak, no matter what its shape. On the other hand, if the width and position of the measured peak changes from sample to sample, the derivative method will fail and the INLS method is more effective (click to view), as long as the fundamental shape of both measured peak and the background are both known (e.g. Gaussian, Lorentzian ,etc).
Not surprisingly, the more mathematically complex methods perform better, on average. Fortunately, software can "hide" that complexity, in the same way, for example, that a hand-held calculator hides the complexity of long division.
Ensemble averaging is a powerful method of reducing the effect of random noise in experimental signals, when it can be applied. The idea is that the signal is repeated, preferably a large number of times, and all the repeats are averaged. The signal builds up, and the noise gradually averages towards zero, as the number of repeats increases.
An important requirement is that the repeats be aligned
or synchronized, so that in the absence of random noise, the
repeated signals would line up exactly. There are two ways of
the signal repeats are triggered by some external event and the
data acquisition can use that trigger to synchronize the
(b) the signal itself has some feature that can be used to detect each repeat, whenever it occurs.
The first method (a) has the advantage that the signal-to-noise (S/N) ratio can be arbitrarily low and the average signal will still gradually emerge from the noise if the number of repeats is large enough. However, not every experiment has a reliable external trigger.
The second method (b) can be used to average repeated
patterns in one continuous signal without an external trigger
that corresponds to each repeat, but the signal must then
contain some feature (for example, a peak) with a
signal-to-noise ratio large enough to detect reliably in each
repeat. This method can be used even when the signal patters
occur at random intervals, when the timing of the repetitions is
not of interest. The interactive peak detector iPeak
6 has a built-in
ensemble averaging function (Shift-E) can compute the average of all the repeating
waveforms. It works by detecting a single peak in each
repeat in order to synchronize the repeats.
The Matlab script iPeakEnsembleAverageDemo.m (on http://tinyurl.com/cey8rwh) demonstrates this idea, with a signal that contains a repeated underlying pattern of two overlapping Gaussian peaks, 12 points apart, with a 2:1 height ratio, both of width 12. These patterns occur a random intervals, and the noise level is about 10% of the average peak height. Using iPeak (above left), you adjust the peak detection controls to detect only one peak in each repeat pattern, zoom in to isolate any one of those repeat patterns, and press Shift-E. In this case there are about 60 repeats, so the expected signal-to-noise (S/N) ratio improvement is sqrt(60) = 7.7. You can save the averaged pattern (above right) into the Matlab workspace as “EA” by typing
>> load EnsembleAverage; EA=EnsembleAverage;
then curve-fit this averaged pattern to a 2-Gaussian model using the peakfit.m function (figure on the right):
Position Height Width Area
32.54 13.255 12.003 169.36
44.72 6.7916 12.677 91.648
You'll see a big improvement in the accuracy of the peak separation, height ratio and width, compared to fitting a single pattern in the original x,y signal:
The wav file “horngoby.wav” (Ctrl-click to open) is a 2-second recording of the sound of a passing automobile horn, exhibiting the familiar Doppler effect. The sampling rate is 22000 Hz. Download this file and place it in your Matlab path. You can then load this into the Matlab workspace as the variable “doppler” and display it using iSignal:
Within iSignal, you can switching to frequency spectrum mode (press Shift-S), and zoom in on different portions of the waveform, so you can observe the downward frequency shift and measure it quantitatively. (Actually, it's much easier to hear the frequency shift - press Shift-P to play the sound - than to see it graphically; the shift is rather small on a percentage basis, but human hearing is very sensitive to small pitch (frequency) changes). It helps to re-plot the data to stretch out the frequency region around the fundamental frequency or one of the harmonics. I used iSignal to zoom in on three slices of this waveform and then I plotted the frequency spectrum (Shift-S) near the beginning (plotted in blue), middle (green), and end (red) of the sound. The frequency region between 150 Hz and 550 Hz are plotted in the figure below:
The group of peaks near 200 are the fundamental frequency of the lowest note of the horn and the group of peaks near 400 are its second harmonic. (Pitched sounds usually have a harmonic structure of 1, 2, 3... times a fundamental frequency). The group of peaks near 250 are the fundamental frequency of the next higher note of the horn and the group of peaks near 500 are its second harmonic. (Car and train horns often have two or three harmonious notes sounded together). In each of these groups of harmonics, you can clearly see that the blue peak (the spectrum measured at the beginning of the sound) has a higher frequency than the red peak (the spectrum measured at the end of the sound). The green peak, taken in the middle, has an intermediate frequency. The peaks are ragged because the amplitude and frequency varies over the sampling interval, but you can still get good quantitative measures of the frequency of each component by curve fitting to a Gaussian peak model using peakfit.m or ipf.m:
Peak Position Height Width Area
Beginning 206.69 3.0191e+005 0.81866 2.4636e+005
Middle 202.65 1.5481e+005 2.911 4.797e+005
End 197.42 81906 1.3785 1.1994e+005
The estimated precision of
the peak position (i.e. frequency) measurements is about
0.2% relative, based on the bootstrap method,
good enough to allow accurate calculation of the
frequency shift (about 4.2%) and the
of the vehicle
and to demonstrate that the ratio of the second
harmonic to the fundamental for these data is 2.0023, which is
very close to the theoretical value of 2.
also possible to plot the evolving spectrum of such a signal as
a contour graph, using the PlotSegFreqSpect.m
function, as show in the graphic below. You can see the various
frequency components drifting steadily down in frequency as the
time passes. See HarmonicAnalysis.html#Software.
% 40 segments, 240 harmonics
Spikes, narrow pulses with a width of only one or a few points, are sometimes encountered in signals as a result of an electronic “glitch” or stray pickup from nearby equipment, and they can easily be eliminated by the use of a “median” filter. But it is possible that in some experiments the spikes themselves might be the important part of the signal and that it is required to count or measure them. This situation was recently encountered in a research application, and it opens up some interesting twists on the usual procedures.
As a demonstration, the Matlab/Octave script SpikeDemo1.m creates a waveform (top panel of figure below) in which a series of spikes are randomly distributed in time, contaminated by two types of noise: white noise and a large-amplitude oscillatory interference simulated by a swept-frequency sine wave. The objective is to count the spikes and locate their position on the x (time) axis. Direct application of findpeaks or iPeak to the raw signal does not work well.
A single-point spike, called a delta function in mathematics, has a power spectrum that is flat; that is, it has equal power at all frequencies, just like white noise. But the oscillatory interference in this case is located in a specific range of frequencies, which opens some interesting possibilities. One approach would be to use a Fourier filter, for example a notch or band-reject filter, to remove the troublesome oscillations selectively. But if the objective of the measurement is only to count the spikes and measure their times, a simpler approach would be to (1) compute the second derivative (which greatly amplifies the spikes relative to the oscillations), (2) smooth the result (to limit the white noise amplification caused by differentiation), then (3) take the absolute value (to yield positive-pointing peaks). This can be done in a single line of nested Matlab/Octave code:
The result, shown the lower panel of the figure on the left above, is an almost complete extraction of the spikes, which can then be counted with findpeaksG.m or peakstats.m or iPeak.m:
The second example, SpikeDemo2.m, is similar except that in this case the oscillatory interference is caused by two fixed-frequency sine waves at a higher frequency, which completely obscure the spikes in the raw signal (top panel of the left figure below). In the power spectrum (bottom panel, in red), the oscillatory interference shows as two sharp peaks that dominate the spectrum and reach to y=106, whereas the spikes show as the much lower broad flat plateau at about y=10. In this case, use can be made of an interesting property of sliding-average smooths, such as the boxcar, triangular, and Gaussian smooths; their frequency responses exhibit a series of deep cusps at frequencies that are inversely proportional to their filter widths. So this opens up the possibility of suppressing specific frequencies of oscillatory interference by adjusting the filter widths until the cusps occur at or near the frequency of the oscillations. Since the signal in this cases are spikes that have a flat power spectrum, they are simply smoothed by this operation, which will reduce their heights and increase their widths, but will have little or no effect on their number or x-axis positions. In this case a 9 point pseudo-Gaussian smooth puts the first (lowest frequency) cusp right in between the two oscillatory frequencies.
In the figure on the right, you can see the effect of applying this filter; the spikes, which were not even visible in the original signal, are now cleanly extracted (upper panel), and you can see in the power spectrum (right lower panel, in red) that the two sharp peaks of oscillatory interference is reduced by about a factor of about 1,000,000! This operation can be performed by a single command-line function, adjusting the smooth width (the second input argument, here a 9) by trial and error to minimize the oscillatory interference:
(If the interference varies
substantially in frequency across the signal, you could
use a segmented
smooth rather than the standard fastsmooth).
Alternatively, the segmented
Fourier spectrum could be used to visualize this signal, and a
Fourier filter in “notch” mode
could be employed to specifically eliminate the interfering
The extracted peaks can then be counted with any of the peak finding functions, such as:
The simple script “iSignalDeltaTest” demonstrates the power spectrum of the smoothing and differentiation functions of iSignal by a applying them to a delta function. Change the smooth type, smooth width, and derivative order and other functions in order to see how the power spectrum changes.
Some experiments produce peaks that are distorted by being convoluted by processes that make peaks less distinct and modify peak position, height and width. Exponential broadening is one of the most common of these processes. Fourier deconvolution and iterative curve fitting are two methods that can help to measure the true underlying peak parameters. Those two methods are conceptually distinct, because in Fourier deconvolution, the underlying peak shape is unknown but the nature and width of the broadening function (e.g. exponential) is assumed to be known; whereas in iterative least-squares curve fitting it's just the reverse: the underlying peak shape (i.e. Gaussian, Lorentzian, etc) must be known, but the width of the broadening process is initially unknown.
In the script shown below and the resulting graphic shown on the right (Download this script), the underlying signal (uyy) is a set of four Gaussians with peak heights of 1.2, 1.1, 1, 0.9 located at x=10, 20, 30, 40 and peak widths of 3, 4, 5, 6, but in the observed signal (yy) these peaks are broadened exponentially by the exponential function cc, resulting in shifted, shorter, and wider peaks, and then a little constant white noise is added after the broadening. The deconvolution of cc from yy successfully removes the broadening (yydc), but at the expense of a substantial noise increase. However, the extra noise in the deconvoluted signal is high-frequency weighted ("blue") and so is easily reduced by smoothing and has less effect on least-square fits than does white noise. (For a greater challenge, try adding more noise in line 6 or use a bad estimate of time constant in line 10). To plot the recovered signal overlaid with underlying signal: plot(xx,uyy,xx,yydc). To plot the observed signal overlaid with the underlying signal: plot(xx,uyy,xx,yy). Excellent values for the original underlying peak positions, heights, and widths can be obtained by curve-fitting the recovered signal to four Gaussians: [FitResults,FitError]= peakfit([xx;yydc],26,42,4,1,0,10). With ten times the previous noise level (Noise=.01), the values of peak parameters determined by curve fitting are still quite good, and even with 100x more noise (Noise=.1) the peak parameters are more accurate than you might expect for that amount of noise (because that noise is blue). Visually, the noise is so great that the situation looks hopeless, but the curve-fitting actually works pretty well.
% Underlying Gaussian peaks with unknown heights, positions, and widths.
uyy=modelpeaks2(xx,[1 1 1 1],[1.2 1.1 1 .9],[10 20 30 40],[3 4 5 6],...
[0 0 0 0]);
% Observed signal yy, with noise added AFTER the broadening convolution
Noise=.001; % <---Try more noise to see how this method handles it.
yy=modelpeaks2(xx,[5 5 5 5],[1.2 1.1 1 .9],[10 20 30 40],[3 4 5 6],...
[-40 -40 -40 -40])+Noise.*randn(size(xx));
% Compute transfer function, cc,
cc=exp(-(1:length(yy))./40); % <---Change exponential time constant here
% Attempt to recover original signal uyy by deconvoluting cc from yy
% It's necessary to zero-pad the observed signal
yy as shown here.
% Plot and label everything
subplot(2,2,1);plot(xx,uyy);title('Underlying signal, uyy');
subplot(2,2,2);plot(xx,cc);title('Exponential transfer function, cc')
subplot(2,2,3);plot(xx,yy);title('observed broadened, noisy signal, yy');
subplot(2,2,4);plot(xx,yydc);title('Recovered signal, yydc')
An alternative to the above deconvolution approach is to curve fit the observed signal directly with an exponentially broadened Gaussian (shape number 5): [FitResults,FitError]=peakfit([xx;yy],26,50,4,5,40,10). Both methods give good values of the peak parameters, but the deconvolution method is considerably faster, because curve fitting with a simple Gaussian model is faster than fitting with a exponentially broadened peak model, especially if the number of peaks is large. Also, if the exponential factor is not known, it can be determined by curve fitting one or two of the peaks in the observed signal, using ipf.m, adjusting the exponential factor interactively to get the best fit. Note that you have to give peakfit a reasonably good value for the time constant ('extra'), the input argument right after the peakshape number. If the value is too far off, the fit may fail completely, returning all zeros. A little trial and error suffice. Alternatively, you could try to use peakfit.m version 7 with the independently variable time constant exponentially-broadened Gaussian shape number 31 or 39, to measure the time constant as an iterated variable (to understand the difference, see example 39). If the time constant is expected to be the same for all peaks, better results will be obtained by using shape number 31 or 39 initially to measure the time constant of an isolated peak (preferably one with a good S/N ratio), then apply that fixed time constant in peak shape 5 to all the other groups of overlapping peaks.
Digitization noise, also called quantization noise, is an artifact caused by the rounding or truncation of numbers to a fixed number of figures. It can originate in the analog-to-digital converter that converts an analog signal to a digital one, or in the circuitry or software involved in transmitting the digital signal to a computer, or even in the process of transferring the data from one program to another, as in copying and pasting data to and from a spreadsheet. The result is a series of non-random steps of equal height. The frequency distribution is white, because of the sharpness of the steps, as you can see by observing the power spectrum.
The figure on the left, top panel, shows the effect of integer digitization on a sine wave with an amplitude of +/- 10. Ensemble averaging, which is usually the most effective of noise reduction techniques, does not reduce this type of noise (bottom panel) because it is non-random.
Interestingly, if additional random noise is present in the signal, then ensemble averaging becomes effective in reducing both the random noise and the digitization noise. In essence, the added noise randomizes the digitization, allowing it to be reduced by ensemble averaging. Moreover, if there is insufficient random noise already in the signal, it can be beneficial to add additional noise artificially! The script RoundingError.m illustrates this effect, as shown the animation on the right, which shows the digitized sine wave with gradually increasing amounts of added random noise in line 8 (generated by the randn.m function) followed by ensemble averaging of 100 repeats (in lines 17-20). Look closely at the waveform in this animation as it changes in response to the random noise addition shown in the title. You can clearly see how the noise starts out mostly quantization noise but then quickly decreases as small but increasing amounts of random noise are added before the ensemble averaging step, then eventually increases as too much noise is added. The optimum standard deviation of random noise is about 0.36 times the quantization size, as you can demonstrate by adding lesser or greater amounts via the variable Noise in line 6 of this script. This technique is called "dithering" and it is also used in audio and in image processing.
An audible example of this idea is illustrated by the Matlab/Octave script DigitizedSpeech.m, which starts with an audio recording of the spoken phrase "Testing, one, two, three", previously recorded at 44000 Hz and saved in WAV format (TestingOneTwoThree.wav) and in .mat format (testing123.mat), rounds off the amplitude data progressively to 8 bits (256 steps; sound link), shown on the left, 4 bits (16 steps; sound link), and 1 bit (2 steps; sound link), and then the 2-step case again with random white noise added before the rounding (2 steps + noise; sound link), plots the waveforms and plays the resulting sounds, demonstrating both the degrading effect of rounding and the remarkable improvement caused by adding noise. (Click on these sound links to hear the sounds on your computer). Although the computer program in this case does not perform an explicit ensemble averaging operation as does RoundingError.m, it's likely that the neurons of the hearing center of your brain provide that function by virtue of their response time and memory effect.
This is a simulation of several techniques described in this paper to the quantitative measurement of a peak that is buried in excess of random noise, where the signal-to-noise (S/N) ratio is below 2. (Ordinarily, a S/N ratio of 3 is desired for reliable detection).
The Matlab/Octave script LowSNRdemo.m performs the simulations and calculations and compares the results graphically, focusing on the behavior of each method as the S/N ratio approaches zero. Four methods are compared:
(1) smoothing, followed by the peak-to-peak measure of the smoothed signal and background;
(2) a peak finding method based on the findpeakG function;
(3) unconstrained iterative least-squares fitting (INLS) based on the peakfit.m function; and
(4) constrained classical least squares fitting (CLS) based on the cls2.m function.
The measurements are carried out over a range of peak heights for which the S/N ratio varies from 0 to 2. The noise is random, constant, and white. Each time you run the script, you get the same set of underlying signals but independent samples of the random noise.
Results for the initial values in the script are shown in the plots on the left and in the table printed below, both of which are created by the script LowSNRdemo.m. The graphs on the left show correlation plots of the measured peak height vs the real peak height, which should ideally be a straight line with a slope of 1, an intercept of zero, and an R-squared of 1. As you can see, the simplest smoothed-peak method (upper left) is completely inadequate, with a low slope (because smoothing reduces peak height) and a high intercept (because even smoothed noise has a non-zero peak-to-peak value). The findpeaks function (upper right) works OK for height for higher peak heights but fails completely below a S/N ratio of 0.5 because the peak height falls below the amplitude threshold setting. In comparison, the two least-squares techniques work much better, reporting much better values of slope, intercept of zero, and R-squared. But if you look closely at the low end of the peak height range, near zero, you can see that the values reported by the unconstrained fit (lower left) occasionally stray from the line, whereas the constrained fit (lower right) decrease gracefully all the way to zero every time you run the script. Essentially the reason why it's even possible to make measurements at such low S/N ratios is that the data density is very high: that is, there are many data points in each signal (about 1000 points across the half-width of the peak with the initial script values).
The results are summarized in the table below. The height errors are reported as a percentage of the maximum height (initially 2). (For the first three methods, the peak position is also measured and its relative accuracy is reported. The constrained classical least squares fitting does not measure peak position but rather assumes that it remains fixed at the initial value of 100). You can see that the CLS method has a slight edge in accuracy, but you have to consider also that this method works well only if the peak shape, position, and width are known. The unconstrained iterative method can track changes in peak position and width.
Number of points in half-width of
Method Height Error Position Error
Smoothed peak 21.2359% 120.688%
findpeaksG.m 32.3709% 33.363%
peakfit.m 2.7542% 4.6466%
You can change
several of the factors in this simulation to test the robustness
of these methods. Search for the word 'change' in the comments
for values that can be changed. Reduce MaxPeakHeight (line 8) to
make the problem harder. Change peak position and./or width
(lines 9 and 10) to show how the CLS method fails. As usual, the
more you know, the better your results. Change the increment
(line 4) to change the data density; more data is always better.
(Surprisingly, as it will be shown in Appendix U: Measurement Calibration, it is not even necessary to have an accurate peak shape model in order to get a good correlation between measured and actual height).
LowSNRdemo.m also computes the power spectrum of the signal and the amplitude (square root of the power) of the fundamental, where most of the power of a broad Gaussian peak falls, and plots it in Figure(2). The correlation to peak height is similar to the CLS method, but the intercept is higher because there is a non-zero quantity of noise even in that one frequency slice of the power spectrum.
We are now, in the 21st century, into the era of "big data", where high-speed automated data systems can acquire, store, and process greater quantities of data than ever before. As this little example shows, greater quantities of data allow researchers to probe deeper and measure smaller effects that previously.
The signal detection problems facing those who search the sky for evidence of extraterrestrial civilizations or interesting natural phenomena are enormous. Among those problems are the fact that we don't know much about what to expect. In particular, we don't know exactly where to look in the sky, or what frequencies might be used, or the possible forms of the transmissions. Moreover, the many powerful sources of natural and terrestrial sources of interfering signals must not be confused for extraterrestrial ones. There is also the massive computer power required, which has driven the development of specialized hardware and software as well as distributed computation over thousands of Internet-connected personal computers across the world using the SETI@home computational screen-saver. Although many of the computational techniques used in this search are far more sophisticated than those covered in this web site, they begin with the basic concepts covered here.
One of the reoccurring themes of this site has
been that the more you know about your data, the more
likely you are to obtain a reliable measurement. In the
case of possible extraterrestrial signals, we don't know
much, but we do know a few things.
We know that electromagnetic radiation over a wide range of frequencies is used for long-distance transmission on earth and between earth and satellites and probes far from earth. Astronomers already use radio telescopes to receive natural radiations from vast distances. In order to look at different frequencies at once, Fourier transforms of the raw telescope signals can be computed over multiple time segments. We previously saw a simulation that showed how hard it is to see a periodic component in the presence of on equal amount of random noise and yet how easy it is to pick it out in the frequency spectrum.
Also, transmissions from extraterrestrial civilizations might be in the form of equally-spaced pulses, so their detection and verification is also part of SETI signal processing. Interestingly, triplets and other groups of equally spaced pulses appear in the Fourier transforms of high frequency carrier waves that are amplitude or frequency modulated (like AM or FM radio). Of course, there is no reason to assume, nor to reject, that extraterrestrial civilizations might use the same methods of communication as ourselves.
One thing that we know for sure is that the earth rotates around its axis once a day and that it revolves around the sun once a year. So if we look at a fixed direction out from the earth, the distant stars will seem to move in a predictable pattern, whereas terrestrial sources will remain fixed on earth. The huge Arecibo Observatory dish in Puerto Rico is fixed in position and is often used to look in one selected direction for extended periods of time. The field of view of this telescope is such that a point source at a distance takes 12 seconds to pass. As SETI says: “Radio signals from a distant transmitter should get stronger and then weaker as the telescope's focal point moves across that area of the sky. Specifically, the power should increase and then decrease with a bell shaped curve (a Gaussian curve). Gaussian curve-fitting is an excellent test to determine if a radio wave was generated 'out there' rather than a simple source of interference somewhere here on Earth, since signals originating from Earth will typically show constant power patterns rather than curves”. Also, any observed 12 second peaks can be re-examined with another focal point shifted towards the west to see if it repeats with the expected time and duration.
We also know that there will be a Doppler shift in the frequencies observed if the source is moving relative to the receiver; this is observed with sound waves as well as with electromagnetic waves like radio or light. Because the earth is rotating and revolving at a known and constant speed, we can accurately predict and compensate for the Doppler shift caused by earth's motion (this is called “de-chirping” the data).
For more on the details of SETI signal processing, see SETI@home.
This appendix examines more closely the question of measuring peak area rather than peak height to reduce the effect of peak broadening, which commonly occurs in chromatography, for reasons that are discussed previously, and also in some forms of spectroscopy. Under what conditions the measurement of peak area might be better than peak height?
The Matlab/Octave script “HeightVsArea.m” simulates the measurement
of a series of standard samples whose concentrations
are given by the vector 'standards'. Each standard
produces an isolated peak whose peak height is
directly proportional to the corresponding value in
'standards' and whose underlying shape is a Gaussian with a constant peak
position ('pos') and width ('wid'). To simulate the
measurement of these samples under typical conditions,
the script changes the shape of the peaks (by
exponential broadening) and adds a variable baseline
and random noise. You can control, by means of the
variable definitions in the first few lines of the
script, the peak beginning and end, the sampling rate
'deltaX' (increment between x values), the peak
position and width ('pos' and 'wid'), the sequence of
peak heights ('standards'), the baseline amplitude
('baseline') and its degree of variability ('vba'),
the extent of shape change ('vbr'), and the amount of
random noise added to the final signal ('noise').
The resulting peaks are shown in Figure 1, above. The script prepares a series of “calibration curves” plotting the values of 'standard' against the measured peak heights or areas for each measurement method. The measurement methods include peak height in Figure 2, peak area in Figure 3, and curve fitting height and area in Figures 4 and 5, respectively. These plots should ideally have an intercept of zero and an R2 of 1.000, but the slope is greater for the peak area measurements because area has different units and is numerically greater than peak height. All the measurement methods are baseline corrected; that is, they include code that attempts to compensate for changes in the baseline (controlled by the variable 'baseline').
With the initial values of 'baseline', 'noise', 'vba', and 'vbr', you can clearly see the advantage of peak area measurements (figure 3) compared to peak height (figure 2). This is primarily due to the effect of the variability of peak shape broadening ('vbr') and to the averaging out of random noise in the computation of area.
Figure 2 Figure 3
If you set 'baseline', 'noise', 'vba', and 'vbr' all to zero, you've simulated a perfect world in which all methods work perfectly.
Curve fitting can measure both peak height and area; it is not even absolutely necessary to use an accurate peak shape model. Using a simple Gaussian model in this example works much better for peak area (Figure 5) than for peak height (Figure 4) but is not significantly better than a simple peak area measurement (Figure 3). The best results are obtained if an exponentially-broadened Gaussian model (shape 31 or 39) is used, using the code in line 30, but that computation takes longer. Moreover, if the measured peak overlaps another peak significantly, curve fitting both of those peaks together can give much more accurate results that other peak area measurement methods.
Microsoft Excel and OpenOffice Calc have the ability to automate
repetitive tasks using “macros”, saved sequences of commands or
keystrokes that are stored for later use. Macros can be most
easily created using the built-in “Macro Recorder”, which will
literally watch all your clicks, drags, and keystrokes and
record them for later playback. Or you can write or edit your
macros in the macro language of that spreadsheet (VBA in Excel;
recorder first, then edit the resulting code manually to modify
To enable macros in Excel, click on File, Options, click Customize Ribbon Tab and check 'Developer' and click 'OK'. To access the macro recorder, click Developer, Record Macro, give the macro a name, click Options, assign a Ctrl-key shortcut, and click OK. Then perform your spreadsheet operations, and when finished, click Stop Recording and save the spreadsheet. Thereafter, simply pressing your Ctrl-key shortcut will run the macro and perform all the spreadsheet operations that you recorded.
Here I will demonstrate two applications in Excel using macros with the Solver function. (See http://peltiertech.com/Excel/SolverVBA.html#Solver2 for information about setting up macros and solver on your version of Excel).
A previous section described the use of the Solver function applied to the iterative fitting of overlapping peaks in a spreadsheet. The steps listed in the second paragraph on that page can easily be captured with the macro recorder and saved with the spreadsheet. However, a different macro will needed for each different number of peaks, because the block of cells representing the “Proposed Model” will be different for each number of peaks. For example, the template CurveFitter2Gaussian.xlsm includes a macro named 'fit' for a 2-peak fit, activated by pressing Ctrl-f. Here is the text of that macro:
' fit Macro
' Keyboard Shortcut: Ctrl+f
SolverOk SetCell:="$C$12", MaxMinVal:=2, ValueOf:=0, ByChange:="$C$8:$D$9", _
Engine:=1, EngineDesc:="GRG Nonlinear"
You can see that the text of the macro uses only two macro instructions: "SolverOK" and "SolverSolve". SolverOK specifies all the information in the "Solver Parameters" dialog box in its input arguments: 'SetCell' sets the objective as the percent fitting error in cell C12, 'MaxMinVal' is set to the second choice (Minimum), and 'ByChange' specifies the table of cells representing the proposed model (C8:D9) whose values are to be changed to minimize the objective in cell C12. The last argument sets the solver engine to 'GRC Nonlinear', the best one for iterative peak fitting. Finally, "SolverSolve" starts the Solver engine. You could easily modify this macro for curve fitter templates with other numbers of peaks just by changing the cells referenced in the 'ByChange' argument, e.g. C8:E9 for a 3-peak fit. In this case, though, is probably just as easy to use the macro recorder to record a macro for each curve fitter template.
A more elaborate example of a spreadsheet using a macro is TransmissionFittingCalibrationCurve.xls (screen image) that creates a calibration curve for a series of standard concentrations in the TFit method, which was previously described on TFit.html#spreadsheet. Here's a portion of that macro:
Application.CutCopyMode = False
Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
SolverOk SetCell:="$H$6", MaxMinVal:=2, ValueOf:=0, ByChange:="$I$6", Engine:=1 _
, EngineDesc:="GRG Nonlinear"
SolverOk SetCell:="$H$6", MaxMinVal:=2, ValueOf:=0, ByChange:="$I$6", Engine:=1 _
, EngineDesc:="GRG Nonlinear"
Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
The macro in this spreadsheet repeats this chunk of code several times, once for each concentration in the calibration curve (changing only the "AF10" in the first line to pick up a different concentration from the "Results table" in column AF). This macro uses several additional instructions, to select ranges ("Range...Select"), copy ("Selection.Copy") and paste ("Selection.PasteSpecial Paste:=xlPasteValues") values from one place to another, and re-calculate the spreadsheet ("Calculate"). Each separate click, menu selection, or key press creates one or more lines of macro text. The syntax is wordy but quite explicit and clear; you can learn quite a bit just by recording various spreadsheet actions and looking at the resulting macro text.
The random walk was mentioned
in the section on signals
and noise as a type of low-frequency ("pink") noise. Wikipedia says:
random walk is a mathematical formalization of a path that
consists of a succession of random steps. For example, the path
traced by a molecule as it travels in a liquid or a gas, the
search path of a foraging animal, superstring behavior, the
price of a fluctuating stock and the financial status of a
gambler can all be modeled as random walks, although they may
not be truly random in reality."
Random walks describe and serve as a model for many kinds of unstable behavior. Whereas white, 1/f, and blue noises are anchored to a mean value to which they tend to return, random walks tend to be more aimless and often drift off on one or another direction, possibly never to return. Mathematically, a random walk can be modeled as the cumulative sum of some random process, for example the 'randn' function. The graph on the right compares a 200-point sample of white noise (computed as 'randn' and shown in blue) to a random walk (computed as a cumulative sum, 'cumsum', and shown in red). Both samples are scaled to have exactly the same standard deviation, but their behavior is vastly different. The random walk has much more low frequency behavior, in this case wandering off beyond the amplitude range of the white noise. This type of random behavior is very disruptive to the measurement process, distorting the shapes of peaks and causing baselines to shift and making them hard to define, and it can not be reduced significantly by smoothing (See NoiseColorTest.m). In this particular example, the random walk has an overall positive slope and a "bump" near the middle that could be confused for a real signal peak (it's not; it's just noise). But another sample might have very different behavior. Unfortunately, it is not uncommon to observe this behavior in experimental signals.
To demonstrate the measurement difficulties, the script RandomWalkBaseline.m simulates a Gaussian peak with randomly variable position and width, on a random walk baseline, with a S/N ratio is 15. The peak is measured by least-squares curve fitting methods using peakfit.m with two different methods of baseline correction in an attempt to handle the random walk:
(a) a single-component Gaussian model (shape 1) with autozero set to 1 (meaning a linear baseline is first interpolated from the edges of the data segment and subtracted from the signal): peakfit([x;y],0,0,1,1,0,10,1);
(b) a 2-component model, the first being a Gaussian (shape 1) and the second a linear slope (shape 26), with autozero set to 1: peakfit([x;y],0,0,2,[1 26],[0 0],10,0).
In this particular case the fitting error is lower for the second method, especially if the peak falls near the edges of the data range.
relative percent errors of the peak parameters show that the first method gives a
lower error for position and width, at least in this case. On
average, the peak parameters are about the same.
Height Error Width Error
Method a: 0.2772 3.0306 0.0125
Method b: 0.4938 2.3085 1.5418
You can compare this to WhiteNoiseBaseline.m which has a similar signal and S/N ratio, except that the noise is white. Interestingly, the fitting error with white noise is greater, but the parameter errors (peak position, height, width, and area) are lower, and the residuals are more random and less likely to produce false noise peaks. This is because the random walk noise is very highly concentrated at low frequencies where the signal frequencies usually lie, whereas white noise also has considerable power at higher frequencies, which increases the fitting error but does comparatively little damage to signal measurement accuracy. This may be slightly counter-intuitive, but it's important to realize that fitting error does not always correlate with peak parameter error. Bottom line: random walk is troublesome.
Depending on the type of experiment, an instrumental design based on modulation techniques may help, and ensemble averaging multiple measurements can help with any type of unpredictable random noise, which is discussed in the very next section.
In some experimental designs it may be beneficial to apply the technique of modulation, in which one of the controlled independent variables is oscillated in a periodic fashion, and then to detect the resulting oscillation in the measured signal. With the right instrumental design, some types of noise and drift may thereby be reduced or eliminated.
A simple example applies to optical measurement systems like the one pictured on the right. A light source illuminates a test object (DUT="Device Under Test") and the resulting light from the test object is measured by photo detector. Depending on the objective of the experiment and the arrangement of the parts, the detector might measure the light transmitted by, reflected by, scattered by, or excited by the light beam. A optical chopper rapidly and repeatedly interrupts the light beam falling on the test object, so that the photo detector sees an oscillating signal, and the following electronic system is designed to measure only the oscillating component and to ignore the constant (direct current) component. The advantage of this arrangement is that any interfering signals introduced after the chopper - such as constant light that comes from the test object itself, or ambient light that leaks in from the outside, or any constant background signal generated by the photo detector itself - are not oscillating and are thus rejected. This works best if the electronics is synchronized to the chopper frequency. That's actually the function of the lock-in amplifier, which receives a synchronizing reference signal directly from the chopper to guarantee synchronization even if the chopper frequency were to vary (c.f. the interactive simulation on https://terpconnect.umd.edu/~toh/models/lockin.html and T. C. O'Haver, "Lock-in Amplifiers," J. Chem. Ed. 49, March and April (1972). The lock-in amplifier is sometimes viewed as a "black box" with almost magical abilities, but in fact it is actually performing a rather simple (but very useful) operation, as shown in this simulation.
Matlab/Octave script simulation of modulation and synchronous
detection, in which the signal created when the light beam
scans the test sample is modeled as a Gaussian band ('y'),
whose parameters are defined in the first few lines of the
script (updated for recent version of Matlab). As the spectrum
of the sample is scanned, the light beam is amplitude
modulated by the chopper, represented as a square wave defined
by the bipolar vector 'reference', which switches between +1
and -1, shown in the top panel on the left. The modulation
frequency is many times faster than the rate at which the
sample is scanned. The light emerging from the sample
therefore shows a finely chopped Gaussian ('my'), shown in the
second panel on the left. But the total signal seen by the detector also includes
an unstable background introduced after the modulation ('omy'), such as
lighted emitted by the sample itself or detector background,
which in this simulation this is modeled as a random walk
(Appendix O), which seriously distorts the signal, shown in
the third panel. The detector signal is then sent to a lock-in
amplifier that is synchronized to the reference waveform. The
essential action of the lock-in is to multiply the signal by
the bipolar reference waveform, inverting the signal
when the light of off and passing it
unchanged when the light is on. This causes the unmodulated background signal to be
converted into a bipolar square wave, whereas the modulated
signal is not effected because it is "off" when the reference
signal is negative. The result ('dy') is shown in the 4th
this signal is low-passed
filtered by the last stage in the lock-in amplifier to
remove the modulation frequency, resulting in the recovered
signal peak 'sdy' shown in the bottom panel. In effect, the
modulation transforms the signal to a higher frequency
('frequency' in line 44) where low-frequency weighted noise on
the baseline (line 50) are less intense.
SignalToNoiseRatio = 4
Relative % Error: Position Height Width
Original: 8.07 23.1 13.7
Modulated: 0.11 0.22 1.01
Each time you run it you will get the same signal peak but a very different random walk background. The S/N ratio will vary from about 4 to 9. It's not uncommon to see a 100-fold improvement in peak height accuracy with modulation, as in the example shown here. (If you wish, you can change the signal peak parameters and the noise level in the first few code lines of this simulation. For an even greater challenge, change line 47 to "baseline=10.*noise+cumsum(noise);" to make the noise a mixture of white and random walk drift, which results in a really ugly raw signal; you can see that the white noise makes it through the synchronous detector but is reduced by the smoothing lowpass filter in the last stage). In effect, the low-pass filter determines the frequency bandwidth of the lock-in system, but it also increases the response time to step changes (as in the Morse Code example).
This improvement in measurement accuracy works only because the dominant random error in this case is
(a) introduced after the modulation, and
(b) a mostly low-frequency noise.
If the noise were white, there would be no improvement, because white noise is the same at all frequencies; in fact there would be a slight reduction in precision because of the fact that the chopper blocks half of the light on average. If the sample (device under test) generates an "absorption" peak that starts a some positive value and then dips down to a lower value before returning, the demodulated output will the a negative-goring peak rather than a positive peak (see AmplitudeModulationAbsorption.m).
explores the problem of measuring the height of a small peak (a
"child peak") that is buried in the tail of a much stronger overlapping peak (a "parent peak"), in
the especially challenging case that the smaller peak is not even visible to
the unaided eye. Three different measurement tools will be
explored: iterative least-squares,
classical least-squares regression,
and peak detection,
using the Matlab/Octave tools peakfit.m,
cls.m, or findpeaksG.m,
respectively. (Alternatively, you could use the corresponding spreadsheet templates).
In this example the larger peak is located at x=4 and has a height of 1.0 and a width of 1.66; the smaller measured peak is located at x=5 and has a height of 0.1; both have a width of 1.66. Of course, for the purposes of this simulation, we pretend that we don't necessarily know all of these facts and we will try to find methods that will extract such information as possible from the data, even if the signal is noisy. The measured peak is small enough and close enough to the stronger overlapping peak (separated by less than the width of the peaks) that it never forms a maximum in the total signal. So it looks like there is only one peak, as shown on the figure on the right. For that reason the findpeaks.m function (which automatically finds peak maxima) will not be useful by itself to locate the smaller peak. Simpler methods for detecting the second peak also fail to provide a way to measure the smaller second peak, such as inspecting the derivatives of the signal (the smoothed fourth derivative shows some evidence of asymmetry, but that could just be due to the shape of the larger peak), or Fourier self-deconvolution to narrow the peaks so they are distinguishable, but that is unlikely to be successful with this much noise. Least-squares methods work better when the signal-to-noise ratio is poor, and they can be fine-tuned to make use of available information or constraints, as will be demonstrated below.
The selection of the best method will depend on what is known about the signal and the constraints that can be imposed; this will depend in your knowledge of your experimental signal. In this simulation (performed by the Matlab/Octave script SmallPeak.m), the signal is composed of two Gaussian peaks (although that can be changed if desired in line 26). The first question is: are there more than one peak there? If we perform an unconstrained iterative fit of a single Gaussian to the data, as shown in the figure on the right, it shows little or no evidence of a second peak - the residuals look pretty random. (If you could reduce the noise, or ensemble-average even as few as 10 repeat signals, then the noise would be low enough to see evidence of a second peak). But as it is, there is nothing that pops out at you suggesting a second peak.
But suppose we suspect that there should be another peak of the same Gaussian shape just on the right side of the larger peak. We can try fitting a pair of Gaussians to the data (figure on the left), but in this case the random noise is enough that the fit is not stable. When you run SmallPeak.m, the script performs 20 repeat fits (NumSignals in line 20) with the same underlying peaks but with 20 different random noise samples, revealing the stability (or instability) of each measurement method. The fitted peaks in Figure window 1 bounce around all over the place as the script runs, as you can see in the animation on the left. The fitting error is on average lower that the single-Gaussian fit, but that by itself does not mean that the peak parameters so measured will be reliable; it could just be "fitting the noise". If it were isolated all by itself, the small peak would have a S/N ratio of about 5 and it could be measured to a peak height precision of about 3%, but the presence of the larger interfering peak makes the measurement much more difficult. (Hint: After running SmallPeak.m the first time, spread out all the figure windows so they can all be seen separately and don't overlap. That way you can compared the stability of the different methods more easily.)
But suppose that we have reason to expect that the two peaks will have the same width, but we don't know what that width might be. We could try an equal width Gaussian fit (peak shape #6, shown in Matlab/Octave Figure window 2); the resulting fit is much more stable and shows that a small peak is located at about x=5 on the right of the bigger peak, shown below on the left. On the other hand, if we know the peak positions beforehand, but not the widths, we can use a fixed-position Gaussian fit (shape #16) shown on the right (Figure window 3). In the very common situation where the objective is to measure an unknown concentration of a known component, then it's possible to prepare standard samples where the concentration of the sought component is high enough for its position or width to be determined with certainty.
So far all of these examples have used iterative
peak fitting with at least one peak parameter (position and/or
width) unknown and determined by
measurement. If, on the other hand, all
the peak parameters are known except the peak height,
then the faster and more direct classical
least-squares regression (CLS) can be employed (Figure
window 4). In this case you need to know the peak position and
width of both the measured and the larger interfering peaks (the
computer will calculate their heights). If the positions and the
heights really are constant and known, then this method gives
the best stability and precision of measurement. It's also
computationally faster, which might be important if you have
lots of data to process automatically.
The problem with CLS is that it fails to give accurate measurements if the peak position and/or width changes without warning, whereas two of the iterative methods (unconstrained Gaussian and equal-width Gaussian fits) can adapt to such changes. It some experiments it quite common to have small unexpected shifts in the peak position, especially in chromatography or other flow-based measurements, caused by unexpected changes in temperature, pressure, flow rate or other instrumental factors. In SmallPeaks.m, such x-axis shifts can be simulated using the variable "xshift" in line 18. It's initially zero, but if you set it to something greater (e.g. 0.2) you'll find that the equal-width Gaussian fit (Figure window 2) works better because it can keep up with the changes in x-axis shifts.
But with a greater x-axis shift (xshift=1.0) even the equal-width fit has trouble. Still, if we know the separation between the two peaks, it's possible to use the findpeaksG function to search for and locate the larger peak and to calculate the position of the smaller one. Then the CLS method, with the peak positions so determined for each separate signal, shown in Figure window 5 and labeled "findpeaksP" in the table below, works better. Alternatively, another way to use the findpeaks results is a variation of the equal-width iterative fitting method in which the first guess peak positions (line 82) are derived from the findpeaks results, shown in Figure window 6 and labeled "findpeaksP2" in the table below; that method does not depend on accurate knowledge of the peak widths, only their equality.
Each time you run
SmallPeaks.m, all of these methods are
computed “NumSignals” times (set in
line 20) and compared in a table giving the average peak height
accuracy of all the repeat runs:
Unconstr. EqualW FixedP FixedP&W findpeaksP findpeaksP2
35.607 16.849 5.1375 4.4437 13.384 16.849
Unconstr. EqualW FixedP FixedP&W findpeaksP findpeaksP2
31.263 44.107 22.794 46.18 10.607 10.808
Bottom line. The more you know about your signals, the better you can measure them. A stable signal with known peak positions and widths is the most precisely measurable in the presence of random noise ("FixedP&W"), but if the positions or widths vary from measurement to measurement, different methods must be used, and precision is degraded because more of the available information is used to account for changes other than the ones you want to measure.
signal-to-noise perspective, the stock market is an interesting
example. A national or global stock market is an aggregation of
large numbers of buyers and sellers of shares in publicly traded
companies. They are described by stock market indexes, which are
computed as the weighted average of a large number of selected
stocks. For example, the S&P
500 index is computed from the stock valuations of 500
large US companies. Millions of individuals and organizations
participate in the buying and selling of stocks on a daily
basis, so the S&P 500 index is a prototypical "big data"
conglomerate, reflecting the overall value of 500 of the largest
companies in the largest stock market on earth. Other stock
indices, such as the Russel 2000, include an even larger number
of smaller companies. Individual stocks can fail or fall
drastically in value, but the market indexes average out the
performance of hundreds of companies.
A plot of the daily value, V, of the S&P 500 index vs time, T, from 1950 through September of 2016 is shown in the following graphs.
Each plot contains 16608 data points, one for each business day, shown in red. The graph on the left plots V and the graph on the right plots the natural logarithm of V, ln(V). There are considerable up-and-down fluctuations over time that can be related to historical events: the oil crisis of the 1970s, the tech boom and bust of 2000, the subprime mortgage crisis of 2008. Still, the long-term trend of the value is upwards - the current value is over 100 times greater than its value in 1950. This is basically why people invest in the stock market, because on average, over the long run, stock values usually go up. The most common way to model this overall long-term increase over time is based on the equation for compound interest that predicts the growth of investments that have a constant rate of return, such as savings accounts or certificates of deposit:
V = S*(1 + R)T
where V is the value, S is the starting value, R is the annual rate of return, and T is time. By itself, this expression would yield a smooth curve, without all the peaks and dips. The values of S and R that result in the best fit to the stock market data (shown by the blue lines in the graphs) can be determined in two ways:
(1) directly, using the iterative curve fitting method, shown on the left above, or
(2) by taking the logarithm of the values and fitting those to a straight line, shown on the right above.
FitSandP.m is a
Matlab/Octave script that performs both of these calculations
using the data in SandPfrom1950.mat.
When applied to the S&P 500 index data, the rate of return R is about 0.07 (or 7%), but
interestingly these two methods give slightly different results, even
though the exact same data
are used for both, and even though both methods yield the same 7% rate if applied to noiseless synthetic
data calculated from this
expression. This difference between methods is caused by the
irregularities in the stock data that deviate from a smooth line
- in other words, the noise - and it is exacerbated by the large range of the
value data V over time and by the fact that the average return
from 1950 to 1983 is slightly lower than that from 1983 to 2016.
From the point of view of curve fitting, the deviations from a smooth curve described by the compound interest expression is just noise. But from the point of view of the stock market investor, those deviations can be an opportunity and a warning. Naturally, most investors would like to know how the stock market will behave in the future, but that requires extrapolation beyond the range of the available data, which is always uncertain and dangerous. But still, it's most likely (but not certain) that the long term behavior of the market (say, over a period of 10 years or more) will be similar to the past - that is, growing exponentially at about the same rate as before but with unpredictable fluctuations similar to what has occurred in the past. We can take a closer look at those fluctuations by inspecting the residuals – that is, subtracting the fitted curve from the raw data, as shown in iSignal on the left. There are several notable features of this "noise". First, the deviations are roughly proportional to V and thus relatively equal when plotted on a log scale. Second, the noise has a distinctly low-frequency character; the periodogram (lower panel, in red) shows peaks at 33, 16, 8, and 4 years. There are also, notably, numerous instances over the years when there is a sharp dip followed by a slower recovery close to the previous value. And conversely, every peak is eventually followed by a dip. The conventional advice in investing is to "buy low" (on the dips) and "sell high" (on the peaks). But of course the problem is that you can not reliably determine in advance exactly where the peaks and dips will fall; you have only the past to guide you. Still, if the current market value is much higher than the long-term trend, it will likely fall, and if the market value is much lower than the long-term trend, it will likely rise, eventually. The only thing you can be sure of is that, in the long run, the market will rise. This is why saving for retirement by investing in the stock market, and starting as soon as possible, is so important: over a 30-year working life, the market is almost guaranteed to rise substantially. The most painless way to do this is with your employer's 401k or 403b automatic payroll withdrawal plan. You can not actually invest in the stock market as a whole, but you can invest in index mutual funds or exchange traded funds (ETFs), which are collections of stocks that are constructed to match or track the components of a market index. Such funds typically have very low management fees, an important factor in selecting an investment. Other mutual funds attempt to “beat the market” by carefully buying and selling stocks in an attempt to create a return that is greater than the overall market indexes; some are temporarily successful in doing that, but they charge higher management fees. Mutual finds and ETFs are much less risky investments than individual stocks.
Some companies periodically distribute payouts to investors called “dividends”. Those dividends are independent of the day-to-day variations in stock price, so even if the stock value drops temporarily, you still get the same dividend. For that reason it's important that you set your investment account to “automatically reinvest dividends”, so when the share price drops, the dividends are buying shares at the lower price. The S&P 500 index values used above, called price returns, did not include dividend reinvestment; the total returns with dividends reinvested (https://en.wikipedia.org/wiki/S%26P_500_Index#Versions) would have been substantially higher, closer to 11%. (With an average total annual return of 11%, and starting with an investment of $170 the first month - that's less than $6 a day - and increasing it 5% each year, you could accumulate over $600,000 over a 30 year working life, or $1,000,000 if you continued investing an additional 5 years, as shown by the spreadsheet graphic on the right). And that’s starting at just $6 per day, about the cost of a fancy coffee at Starbucks. Think about that the next time you see a line of young people waiting to order their daily coffee.
To illustrate how much influence stock market volatility fluctuation (“noise”) has on the market gains, the Matlab/Octave script SnPsimulation.m adds proportional noise to the compound interest calculation to mimic the S&P data, performs the two curve fitting methods described above, repeats the allocations over and over with independent samples of proportional noise, and then calculates the mean and the relative standard deviation (RSD) of the rates of return. A typical result is:
Measured Rate RSD
Coordinate transformation: 0.07112 8.9%
Iterative curve fitting: 0.07972 19.9%
can see, the two methods don't agree. In this example, the
return calculated by the iterative method is higher, but it could
just have easily been the other way. The fact is that the
standard deviations are fairly large, and the iterative method
always has a higher standard deviation, because it
weights the higher values more heavily, where deviations from
the line are higher, whereas the log transformation method
weights the data more evenly. Even with this uncertainty,
investing in a stock market index fund almost always performs
better in the long run than more predictable investments
such as saving accounts or CDs, which have much lower rates of
investing in the stock market, it's important to focus on the
long-term trends and not to be frightened by the short-term up
and down fluctuations. It's similar to the difference between weather
the large and dramatic short-term weather variations
tend to disguise the much smaller long term climate warming
that is slowly melting the icecaps and raising
the sea levels (whether it is caused by human activity or
by natural causes alone or by a combination of both).
a spreadsheet template that allows you to calculate the possible
returns on long-term investments in stock market mutual funds,
Note added in June 2020. You might be wondering what effect more recent events have had on the overall stock market performance, in particular the trade wars of 2019 and the Coronavirus pandemic of 2020. Extending the data plotted above to include the S&P results for the dates up to June 2020 has remarkably little effect, as seen in the log plot below.
The average return is still about 7%.
recent changes are more evident if you take a much closer look
at the period from 2016 to 2020, for which the return over that
short period was indeed greater, about 9.5%. These 2019 and the
2020 dips, although they were quite sharp and caused a lot of
anxiety at the time, recovered quickly and as a result had
little effect on the overall long-term performance. When stocks
drop, even for well-known and valid reasons, some investors buy
shares at the reduced prices, and when stocks rise, especially
when they hit all-time highs, some investors sell shares, to
"lock in their gains". This behavior acts as a natural brake on
the fluctuations of the market.
In the section
of Signals and Noise, I
said: “The quality of a signal is often expressed as the
signal-to-noise (S/N) ratio, which is the ratio of the true
signal amplitude ... to the standard deviation of the noise.”
That's a simple enough statement, but automating the measurement
of signal and the noise in real signals is not always
straightforward. Sometimes it's difficult to separate or
distinguish between the signal and the noise, because it depends
not only on the numerical nature of the data, but also on the
objectives of the measurement.
For a simple DC (direct current) signal, for example measuring a fluctuating voltage, the signal is just the average voltage value and the noise is its standard deviation. This is easily calculated in a spreadsheet or in Matlab/Octave:
But usually things are more complicated. For example, if the signal is a rectangular pulse (as in the figure on the left) with constant random noise, then the simple formulation above will not give accurate results. If the signal is stable enough that you can get two successive signal recordings m1 and m2 that are identical except for the noise, then you can simply subtract the signal out: the standard deviation of the noise is then given by sqrt((std(m1-m2)2)/2), where std is the standard deviation function (because random noise adds quadratically). But not every signal source is stable and repeatable enough for that to work perfectly. Alternatively, you can try to measure the average signal just over the top of the pulse and the noise only over the baseline interval before and/or after the pulse. That's not so hard to do by hand, but it's harder to automate with a computer, especially if the position or width of the pulse changes. It's basically the same for smooth peak shapes like the commonly-encountered Gaussian peak (as in the figure on the right). You can estimate the height of the peak by smoothing it and then taking the maximum of the smoothed peak as the signal: max(fastsmooth(y,10,3)), but the accuracy would degrade if you choose too high or two low a smooth width. And clearly all this depends on having a well-defined baseline in the data where there is only noise. It doesn't work if the noise varies with the amplitude of the peak.
In many cases, curve fitting can be helpful. For example, you could use peak fitting or a peak detector to locate multiple peaks and measure their peak heights and their S/N ratios on a peak-to-peak basis, by computing the noise as the standard deviation of difference between the raw data and the best-fit line over the top part of the peak. That's how iSignal measures S/N ratios of peaks. Also, iSignal has baseline correction capabilities that allow the peak to be measured relative to the nearby baseline.
Curve fitting also works for complex signals of indeterminate shape that can be approximated by a high-order polynomial or as the sum of a number of basic functions such as Gaussians, as in the example shown on the left. In this example, five Gaussians are used to fit the data to the point where the residuals are random and unstructured. The residuals (shown in blue below the graph) are then just the noise remaining in the signal, whose standard deviation is easily computed using the built-in standard deviation function in a spreadsheet ("STDEV") or in Matlab/Octave ("std"). In this example, the standard deviation of the residuals is 111 and the maximum signal is 40748, so the percent relative standard deviation of the noise is 0.27% and the S/N ratio is 367. (The positions, heights, and widths of the Gaussian components, usually the main results of the curve fitting, are not used in this case; curve fitting is used only to obtain a measure the noise via the residuals). The advantage of this approach over simply subtracting two successive measurements of the signal is that it adjusts for slight changes in the signal from measurement to measurement; the only assumption is that the signal is a smooth, low-frequency waveform that can be fit with a polynomial or a collection of basic peak shapes and that the noise is random and mostly high-frequency compared the the signal. But don't use too high a polynomial order' otherwise you are just "fitting the noise".
With periodic signal waveforms the situation is a bit more complicated. As an example, consider the audio recording of the spoken phrase "Testing, one, two, three" (click to download in .mat format or in WAV format) that was used previously. The Matlab/Octave script PeriodicSignalSNR.m loads the audio file into a variable “waveform”, then computes the average amplitude of the waveform (the “envelope”) by smoothing the absolute value of the waveform:
fastsmooth(abs(waveform), SmoothWidth, SmoothType);
The result is plotted on the left, where the waveform is in blue and the envelope is in red. The signal is easy to measure as the maximum or perhaps the average of the waveform, but the noise is not so evident. The human voice is not reproducible enough to get a second identical recording to subtract out the signal as above. Still, there will be often be gaps in the sound, during which the background noise will be dominant. In an audio (voice or music) recording, there will typically be such gaps at the beginning, then the recording process has already started but the sound has not yet begun, and possibly at other short periods when there are pauses in the sound. The idea is that, by monitoring the envelope of the sound and noting when it falls below some adjustable threshold value, we can automatically record the noise that occurs in those gaps, whenever they may occur in a recording. In PeriodicSignalSNR.m, this operation is done in lines 26-32, and the threshold is set in line 12. The threshold value has to be optimized for each recording. When the threshold value is set to 0.015 in the "Testing, one, two, three" recording, the resulting noise segments are located and are marked in red in the plot on the right. The program determines the average noise level in this recording simply by computing the standard deviation of those segments (line 46), then computes and prints out the peak-to-peak S/N ratio and the RMS (root mean square) S/N ratio.
PeakToPeak_SignalToNoiseRatio = 143.7296
RMS_SignalToNoiseRatio = 12.7966
The frequency distribution of the noise is also determined (lines 60-61) and shown in the figure on the left, using the PlotFrequencySpectrum function, or you could have used iSignal in the frequency spectrum mode (Shift-S). The spectrum of the noise shows a strong component very near 60 Hz, which is almost certainly due to power line pickup (the recording was made in the USA, where AC power is 60Hz); this suggests that better shielding and grounding of the electronics might help to clean up future recordings. The lack of strong components at 100 Hz and above suggests that the vocal sounds have been effectively suppressed at this threshold setting. The script can be applied to other sound recordings in WAV format simply by changing the file name and time axis in lines 8 and 9.
SegmentedSmooth.m, illustrated on the right, is a segmented variant of fastsmooth.m, which can be useful if the widths of the peaks or the noise level varies substantially across the signal. The syntax is that same as fastsmooth.m, except that the second input argument "smoothwidths" can be a vector: SmoothY = SegmentedSmooth (Y, smoothwidths,type,ends). The function divides Y into a number of equal-length regions defined by the length of the vector 'smoothwidths', then smooths each region with a smooth of type 'type' and width defined by the elements of vector 'smoothwidths'. In the simple example in the figure on the right, smoothwidths=[31 52 91], which divides up the signal into three regions and smooths the first region with smoothwidth 31, the second with 51, and the last with 91. Any number and sequence of smooth widths can be used. If the peak widths increase or decrease regularly across the signal, you can calculate the smoothwidths vector by giving only the number of segments ("NumSegments") , the first value, "startw", and the last value, "endw", like so:
SegmentedSmooth" for other examples examples. DemoSegmentedSmooth.m demonstrates the operation
with different signals consisting of noisy
variable-width peaks that get progressively wider, like
the figure on the right. FindpeaksSL.m is the same thing
for Lorentzian peaks.
SegmentedSmoothTemplate.xlsx is a segmented multiple-width data smoothing spreadsheet template, functionally similar to SegmentedSmooth.m. In this version there are 20 segments. SegmentedSmoothExample.xlsx is an example with data (graphic). A related spreadsheet GradientSmoothTemplate.xlsx (graphic) performs a linearly increasing (or decreasing) smooth width across the entire signal, given only the start and end values, automatically generating as many segments are are necessary. Of course, as is usual with spreadsheets, you'll have to modify these templates for your particular number of data points, usually by inserting rows somewhere in the middle and then drag-copying down from above the insert, plus you may have to change the x-axis range of the graph. (In contrast, the Matlab/Octave functions do that automatically).
Segmented Fourier filter. SegmentedFouFilter.m is a segmented version of FouFilter.m, which applies different center frequencies and widths to different segments of the signal. The syntax is the same as FouFilter.m except that the two input arguments 'centerFrequency' and 'filterWidth' must be vectors with the values of centerFrequency of filterWidth for each segment. The signal is divided equally into a number of segments determined by the length of centerFrequency and filterWidth, which must be equal. Type 'help SegmentedFouFilter' for help and examples.
findpeaksSG.m is a variant of the findpeaksG
function, with the same syntax, except that the
four peak detection parameters can be vectors, dividing up the signal into regions that are
optimized for peaks of different widths. Any number of segments can be
declared, based on the lengths of the SlopeThreshold
input argument. (Note: you need only enter vectors for
those parameters that you want to vary between segments;
to allow any of the other peak detection parameters to
remain unchanged across all segments, simply enter a
single scalar value for that parameter; only the
SlopeThreshold must be a vector). In the example shown
on the left, the
script TestPrecisionFindpeaksSG.m creates a noisy signal with
three peaks of widely different widths, measures the
peak positions, heights and widths of each peak using
findpeaksSG, and prints out the percent relative
standard deviations of parameters of the three peaks in
100 measurements with independent random noise. With
3-segment peak detection parameters, findpeaksSG
reliably detects and accurately measures all three
peaks. In contrast, findpeaksG, tuned to the middle peak
(using line 26 instead of line 25), measures the first
and last peaks poorly. You can also see that the
precision of peak parameter measurements gets
progressively better (smaller relative standard
deviation) the larger the peak widths, simply because
there are more data points in those peaks. (You can
change any of the variables in lines 10-18).
findpeaksSb.m is a segmented variant of the findpeaksb.m function. It has the same syntax as findpeaksSb, except that the arguments "SlopeThreshold", "AmpThreshold", "smoothwidth", "peakgroup", "window", "PeakShape", "extra", "NumTrials", "autozero", and "fixedparameters" can all be optionally scalars or vectors with one entry for each segment. This allows the function to handle widely varying signals with peaks of very different shapes and widths and backgrounds. In the example on the right, the Matlab/Octave script DemoFindPeaksSb.m creates a series of Gaussian peaks whose widths increase by a factor of 25 and that are superimposed in a curved baseline with random white noise that increases gradually across the signal. In this example, four segments are used, changing the peak detection and curve fitting values so that all the peaks are measured accurately.
= [.01 .005 .002 .001];
AmpThreshold = 0.7;
SmoothWidth = [5 15 30 35];
FitWidth = [10 12 15 20];
windowspan = [100 125 150 200];
peakshape = 1;
autozero = 3;
The script also computes the relative percent error of the measurement of peak position, height, width, and area for each peak.
measurepeaks.m is an automatic peak detector peaks of arbitrary shape. It is based on findpeaksSG, with which it shares the first 6 input arguments; the four peak detection arguments can be vectors to accommodate signals with peaks of widely varying widths. It returns a table containing the peak number, peak position, absolute peak height, peak-valley difference, perpendicular drop area, and tangent skim area of each peak it detects. If the last input argument ('plots') is set to 1, it will plot the entire signal with numbered peaks and will also plot the individual peaks with the peak maximum (red), valley points (magenta), and tangent lines (cyan), as shown on the right. The peak height and x-position is indicated by the red circle, the perpendicular drop area is the total area measured between the two magenta vertical lines down to zero, and the tangent skim area is the area between the cyan baseline and the blue peak (which compensates for a flat baseline). Type “help measurepeaks” and try the examples there or run testmeasurepeaks.m (graphic animation). The related functions autopeaks.m and autopeaksplot.m are similar, except that the four peak detection parameters can be omitted and the function will calculate estimated initial values.
The script HeightAndArea.m tests the accuracy of peak height and area measurement with signals that have multiple peaks with variable width, noise, background, and peak overlap. Generally, the values for absolute peak height and perpendicular drop area are best for peaks that have no background, even if they are slightly overlapped, whereas the values for peak-valley difference and for tangential skim area are better for isolated peaks on a straight or slightly curved background. Note: this function uses smoothing (specified by the SmoothWidth input argument) only for peak detection; it performs its measurements on the raw unsmoothed y data. If the raw data are noisy, it may be best to smooth the y data yourself before calling measurepeaks.m, using any smooth function of your choice.
Other segmented functions. The same segmentation code used in SegmentedSmooth.m (lines 53-65) can be applied to other functions simply by editing the first line in the first for/end loop (line 59) to refer to the function that you want to apply in a segmented fashion. For example, segmented peak sharpening can be useful when a signal has multiple peaks that vary in width, and segmented deconvolution can be useful when a signal has multiple peaks that vary in width or tailing vary substantially across the signal: SegExpDeconv(x,y,tc) deconvolutes y with a vector of exponential functions whose time constants are specified by the vector tc. SegExpDeconvPlot.m is the same except that it plots the original and deconvoluted signals and shows the divisions between the segments by vertical magenta lines.
measurements involve the use of an instrument that actually
measures something else and converts it to the desired
measure. Examples are simple weight scales (which actually
measure the compression of a spring), thermometers (which
actually measure thermal expansion), pH meters (which
actually measure a voltage), and devices for measuring
hemoglobin in blood or CO2 in air (which actually measure the
intensity of a light beam). These instruments are
single-purpose, designed to measure one quantity, and
automatically convert what they actually measure into the
the desired quantity and display it directly. But to insure
accuracy, such instruments must be calibrated,
that is, used to measure one or more calibration standards
of known accuracy, such as a standard weight or a sample
that is carefully prepared to a known temperature, pH, or
sugar content. Most are pre-calibrated at the factory for
the measurement of a specific substance in a specific type
Analytical calibration. General-purpose instrumental techniques that are used to measure the quantity of many different chemical components in unknown samples, such as the various kinds of spectroscopy, chromatography, and electrochemistry, or combination techniques like “GC-mass spec”, must also be calibrated, but because those instruments can be used to measure a wide range of compounds or elements, they must be calibrated by the user for each substance and for each type of sample. Usually this is accomplished by carefully preparing (or purchasing) one or more “standard samples” of known concentration, such as solution samples in a suitable solvent. Each standard is inserted or injected into the instrument, and the resulting instrument readings are plotted against the known concentrations of the standards, using least-squares calculations to compute the slope and intercept, as well as the standard deviation of the slope (sds) and intercept (sdi). Then the "unknowns" (that is, the samples whose concentrations are to be determined) are measured by the instrument and their signals are converted into concentrations with the aid of the calibration curve. If the calibration is linear, the sample concentration C of any unknown is given by (A - intercept) / slope, where A is the measured signal (height or area) of that unknown. The predicted standard deviation in the sample concentration is C*SQRT((sdi/(A-intercept))^2+(sds/slope)^2) by the rules for propagation of error. All these calculations can be done in a spreadsheet, such as CalibrationLinear.xls.
In some cases the thing measured can not be detected directly but must undergo a chemical reaction that makes it measurable; in that case the exact same reaction must be carried out on all the standard solutions and unknown sample solutions, as demonstrated in this animation (thanks to Cecilia Yu of Wellesley College).
Various calibration methods are used to compensate for problems such as random errors in standard preparation or instrument readings, interferences, drift, and non-linearity in the relationship between concentration and instrument reading. For example, the standard addition calibration technique can be used to compensate for multiplicative interferences. I have prepared a series of “fill-in-the-blanks” spreadsheet templates for various calibrations methods, with instructions, as well as a series of spreadsheet-based simulations of the error propagation in widely-used analytical calibration methods, including a step-by-step exercise.
Calibration and signal processing. Signal processing often intersects with calibration. For example, if you use smoothing or filtering to reduce noise, or differentiation to reduce the effect of background, or measure peak area to reduce the effect of peak broadening, or use modulation to reduce the effect of low-frequency drift, then you must use the exact same signal processing for both the standard samples and the unknowns, because the choice of signal processing technique can have a big impact on the magnitude and even on the units of the resulting processed signal (as for example in the derivative technique and in choosing between peak height and peak area).
PeakCalibrationCurve.m is an Matlab/Octave example of this. This script simulates the calibration of a flow injection system that produces signal peaks that are related to an underlying concentration or amplitude ('amp'). In this example, six known standards are measured sequentially, resulting in six separate peaks in the observed signal. (We assume that the detector signal is linearly proportional to the concentration at any instant). To simulate a more realistic measurement, the script adds four sources of "disturbance" to the observed signal:
a. noise - random white noise added to all the signal data points, controlled by the variable "Noise";
b. background - broad curved background of random amplitude, tilt, and curvature, controlled by "background";
c. broadening - exponential peak broadening that varies randomly from peak to peak, controlled by "broadening";
d. a final smoothing before the peaks are measured, controlled by "FinalSmooth".
script uses measurepeaks.m as
an internal function to determine the absolute peak height,
peak-valley difference, perpendicular drop area, and tangent
skim area. It plots separate calibration curve for each of
these measures in figure windows 2-5 against the true
underlying amplitudes (in the vector "amp"), fitting the
data to a straight line and computing the slope, intercept,
and R2. (If the detector response were non-linear, a
quadratic or cubic least-square would work better). The
slope and intercept of the best-fit line is different for
the different methods, but if the R2 is close to 1.000, a
successful measurement can be made. (If all the random
disturbances are set to zero in lines 33-36, the R2 values
will all be 1.000. Otherwise the measurements will not be
perfect and some methods will result in better measurements
- R2 closer to 1.000 - than others). Here is a typical
Peak Position PeakMax Peak-val. Perp drop Tan skim
1 101.56 1.7151 0.72679 55.827 11.336
2 202.08 2.1775 1.2555 66.521 21.425
3 300.7 2.9248 2.0999 58.455 29.792
4 400.2 3.5912 2.949 66.291 41.264
5 499.98 4.2366 3.7884 68.925 52.459
6 601.07 4.415 4.0797 75.255 61.762
R2 values: 0.9809 0.98615 0.7156 0.99824
In this case, the tangent skim method works best, giving a linear calibration curve (shown on the left) with the highest R2.
In this type of application, the peak heights and/or area measurements do not actually have to be accurate, but they must be precise. That's because the objective of an analytical method such as flow injection or chromatography is not to measure peak heights and areas, but rather to measure concentrations, which is why calibration curves are used. Figure 6 shows the correlation between the measured tangent skim areas and the actual true areas under the peaks in the signal shown above, right; the slope of this plot shows that the tangent skim areas are actually about 6% lower that the true areas, but that does not make a difference in this case because the standards and the unknown samples are measured the same way. In some other application, you may actually need to measure the peak heights and/or areas accurately, in which case curve fitting is generally the best way to go.
If the peaks partly overlap, the measured peak heights and areas may be effected. To reduce the problem, it may be possible to reduce the overlap by using peak sharpening methods, for example the derivative method, deconvolution or the power transform method, as demonstrated by the self-contained Matlab/Octave function PowerTransformCalibrationCurve.m.
Curve fitting the signal data. Ordinary in curve fitting, such as the classical least squares (CLS) method and in iterative nonlinear least-squares, the selection of a model shape is very important. But in quantitative analysis applications of curve fitting, where the peak height or area measured by curve fitting is used only to determine the concentration of the substance that created the peak by constructing a calibration curve, having the exact model shape is surprisingly uncritical. The Matlab/Octave script PeakShapeAnalyticalCurve.m shows that, for a single isolated peak whose shape is constant and independent of concentration, if the wrong model shape is used, the peak heights measured by curve fitting will be inaccurate, but that error will be exactly the same for the unknown samples and the known calibration standards, so the error will “cancel out” and the measured concentrations will still be accurate, provided you use the same inaccurate model for both the known standards and the unknown samples. In the example shown on the right, the peak shape of the actual peak is Gaussian (blue dots) but the model used to fit the data is Lorentzian (red line). That's an intentionally bad fit to the signal data; the R2 value for the fit to the signal data is only 0.962 (a poor fit by the standards of measurement science). The result of this is that the slope of the calibration curve (shown below on the left) is greater than expected; it should have been 10 (because that's the value of the “sensitivity” in line 18), but it's actually 10.867 in the figure on the left, but nevertheless the calibration curve is still linear and its R2 value is 1.000, meaning that the analysis should be accurate. (Note that curve fitting is actually applied twice in this type of application, once using iterative curve fitting to fit the signal data, and then again using polynomial curve fitting to fit the calibration data).
Despite all this, it's still better to use as accurate a model peak shape as possible for the signal data, because the percent fitting error of the signal fit can be used as a warning that something unexpected is wrong, such as the appearance of an interfering peak from a foreign substance.
out by computer software with non-integer numbers have a
natural limit to the precision with which they can be
represented; for example, the number 1/3 is represented as
0.3333333... with a large but finite number of “3”s, whereas
theoretically there are an infinite string of “3”s
in the decimal representation of 1/3. It's the same with
irrational numbers such as "pi" and the square root of 2;
they can never have a exact decimal representation. In
principle, these tiny errors could accumulate in a very
complex multiple-step calculation and could conceivably become
a significant source of error. In the vast majority of
applications to scientific computation, however, these
limits will be minuscule compared to the errors and random
noise that is already present in most real-world
measurements. But it is best to know what those numerical
limits are, under what circumstances they might occur, and
how to minimize them.
Multicomponent spectroscopy. Probably the most common calculation where numerical precision is an issue is in the matrix methods that are used in multicomponent spectroscopy. In the derivation of the Classical Least Squares (CLS) method, the matrix inverse is used to solve large systems of linear equations. The matrix inverse is a standard function in programming languages such as Matlab, Octave, Wolfram's Mathematica, and in spreadsheets. But if you use that function in Matlab, the function name (“inv”) is automatically flagged by the editor with the following warning:
“For solving a system of linear equations, the inverse of a matrix is primarily of theoretical value. Never use the inverse of a matrix to solve a linear system Ax=b with x=inv(A)*b, because it is slow and inaccurate.... Instead of multiplying by the inverse, use matrix right division (/) or matrix left division (\). That is: Replace inv(A)*b with A\b...[and]...replace b*inv(A) with b/A”
"Slow and inaccurate"?
OK, now I'm scared. But how serious a problem is this really
in actual applications? To answer that question, the
Matlab/Octave script RegressionNumericalPrecisionTest.m
applies the CLS method to a mixture of two very
closely-spaced noiseless overlapping
Gaussian peaks (blue and green lines in the figure on the
left) using three different mathematical formulations of the
least-squares calculation that give different results. The
difficulty of such a measurement depends on the ratio of the
peak separation to the peak half-width; small ratios mean
very highly overlapped peaks which are hard to measure
accurately. In this example the separation-to-width ratio is
0.0033, which is very small (i.e. difficult); this is
equivalent to trying to measure a mixture of two absorption
spectroscopy peaks that are 300 nm wide and separated
by only 1 nm, a
tiny difference that you
wouldn't even notice with the naked eye. The
results of this script shows that the matrix inverse ("inv")
method does indeed have an error
thousands of times larger than the method using matrix
division, but even the matrix division
error is still very small. Practically, the
difference between these methods is unlikely to be
significant when applied to real experimental data, because
even the tiniest bit of signal instability (like that
caused by small changes in the temperature of the sample or
random noise in the signal, which
you can simulate in line 15) produces a far greater error.
So basically that warning message is the voice of a
mathematician or computer programmer, not an experimental
Analog-to-digital resolution. Potentially more significant than the computer's numerical resolution is the resolution of the analog-to-digital converter (ADC) that is used to convert analog signals (e.g. voltage) to a number. The Matlab/Octave script RegressionADCbitsTest.m demonstrates this, with two slightly overlapping Gaussian bands with a large (50-fold) difference in peak height (blue and green lines in the figure on the right); in this case the separation to width ratio is 0.25, much larger (i.e. easier) than the previous example. For this example, the simulation shows that the relative percent errors of peak height measurement is 0.19% for the larger peak and 6.6% for the smaller peak. You can change the resolution of the simulated analog-to-digital converter in number of bits (line 9). The amplitude resolution of an analog-to-digital converter is 2 raised to the power of the number of bits. Common ADC resolutions are 10, 12, and 14 bits, corresponding to resolutions of one part in 1024, 4096 and 16384, respectively. Of course, the effective resolution for the smaller peak in this case is 50 times less, and you can't simply turn up the amplification on the smaller peak without overloading the ADC for the larger one.
Surprisingly, if most of the noise in the signal is this kind of digitization noise, it may actually help to add some additional random noise (specified in line 10 in this script), as was seen in appendix I.
Differentiation. Another application where you can see numerical precision noise is in differentiation, which involves the subtraction of very nearly equal adjacent numbers in a data series. The self-contained Matlab/Octave function DerivativeNumericalPrecisionDemo.m shows how the numerical precision limits of the computer effects the first through fourth derivatives of a Gaussian band that is very finely sampled (over 16,000 points in the half-width in this case) and that has no added noise. The plot on the left shows the four waveforms in Figure 1, left, and their frequency spectra in Figure 2, below. The numerical precision limits of the computer creates random noise at very high frequencies, which are emphasized by differentiation. In the frequency spectra below, the big low-frequency bump near a frequency of 10-2 is the signal and everything above that is numerical noise. The lower-order derivatives are seldom a problem, but by the time you reach the fourth derivative, those noise frequencies approach the strength of the signal frequencies, as you can see in the frequency spectrum of the fourth derivative in the lower right. But this noise is only a very high-frequency noise, so smoothing with as little as a 3-point sliding average smooth removes most of it (click to view).
Smoothing. Finally, there might potentially be a numerical problem with the fastsmooth algorithm, covered in the section on smoothing, because it is a recursive algorithm that uses the results of a previous step in the calculation to calculate the next step. The Matlab/Octave script FastsmoothNumericalPrecisionTest.m demonstrates the numerical precision of fastsmooth.m. Even for 4000-point Gaussian smooth applied to a 100,000 point signal, the numerical noise relative standard deviation is only .00027%, and most of that occurs in the edges of the signal (first 4000 and last 4000 points); the error over 90% of the signal is orders of magnitude less. This would be a negligible problem in most cases.
Appendix W: The Raspberry Pi (added March, 2017)
Signal processing does not necessarily require expensive computer systems. The Raspberry Pi is a remarkably tiny and inexpensive computer board that is about the size of a deck of playing cards and costs $38! Version 3 B+ in early 2018 has a 1.4GHz 64-bit quad-core ARMv8 CPU with 1GB RAM, 4 USB ports, 40 general-purpose input-output pins, HDMI port, 300 mbps Ethernet port, audio jack and composite video, video camera and display interfaces, micro SD card slot for mass storage, VideoCore IV 3D graphics core, 802.11ac Wireless LAN, and Bluetooth 4.2. You can get it with a bunch of installed software, including a version of the Linux operating system, a simple but effective graphical desktop modeled on Windows, a Web browser, the complete LibreOffice suite, Wolfram's Mathematica (screen shot), several programming languages, a bunch of games (including Minecraft), and various utilities. All of these are installed by default on the Raspberry Pi's operating system installer. (There is even a smaller and cheaper model called the Zero that costs just $5 for the card itself or $10 with memory card and power supply; it has less memory and smaller connectors that the other models, but because of its low cost and small size, this model is ideal in situations where it might be damaged or lost, as in rocket or balloon borne experiments).
All you need for a complete computer is a 5 volt, 2 amp power supply, a USB keyboard and mouse (which you can probably pick up at a junk shop), a TV/monitor with an HDMI input, and a mini SD card (8 to 16 Gbytes) for mass storage (you can buy this card with all the software already installed or a blank one to which you can download the software yourself). In fact, if you already have a Wi-Fi network and an Internet-connected computer, tablet, or smartphone, you don't need a separate monitor, keyboard and mouse: you can log onto the Raspberry Pi via your Wi-Fi network or over the Internet, using Putty (for command-line UNIX-style access) or a graphical desktop sharing system such as RealVNC (free for Windows, Mac, IOS, and Android), which reproduces the entire graphical desktop on your local device, complete with a pop-up virtual keyboard. It can also share files with Windows. The Pi has been used as a low-cost alternative for school computer labs, using its included software for both Office-type applications (Writer word processor, Calc spreadsheet, etc), and for programming instruction (Python, C, C++, Java, Scratch, and Ruby). It's also ideal for “headless” applications where it is only accessed remotely via WiFi or Bluetooth, for example as a network file server, weather station, media center or as a networked security camera.
For scientific data acquisition and signal processing applications, the Pi version of Linux has all the "usual" UNIX terminal commands for data gathering, searching, cleaning and summarizing. In addition, there are many add-on libraries for Python, including SciPi, NumPy, and Matplotlib, all of which are free downloads. Allen B. Downey's 164-page PDF book "Think DSP" has many examples of Python code in traditional engineering applications. There are many add-on hardware devices available at low cost, including video cameras and a piggyback sensor board that reads and displays sensor data from several built-in sensors: gyroscope, accelerometer, magnetometer, barometer, temperature, relative humidity. (It's based on the same hardware that is currently in orbit on the International Space Station).
My signal processing spreadsheets run just fine on the version of Calc that comes with the Pi, as do the Calibration worksheets and my analytical method models (screen shot).
For school applications, Element14 markets a Learn to Program Pack Starter Kit ($177) that includes a license for student version of Matlab for Windows or Macintosh and a Raspberry Pi 3 with MicroSD card, power supply, and enclosure (Matlab does not currently run directly on the Pi but can communicate with it). Even cheaper, Octave 3.6 can run directly on a Raspberry Pi; the screen below shows Octave 3.6 running within the Pi's built-in graphical user interface (showing off the 3D graphic functions "mesh" and "surf").
There are many laboratory and field applications, especially in combination with an Arduino micro-controller. However, the slowness of Octave (compared to Matlab), combined with the modest speed of the Raspberry Pi 3 may be limiting in some applications. (Altogether it's about 50 - 1000 times slower than Matlab on a contemporary desktop computer). But it's still fast enough for many applications. Alternatively, it's possible to communicate with Raspberry Pi hardware remotely from a faster computer running MATLAB using the MATLAB Support Package for Raspberry Pi Hardware for Matlab R2016b or later, using one or more remotely accessed Raspberry Pi's for experiment control and data acquisition and local storage and doing the heavy-duty number crunching on the main computer. Or you could simply have the Pi save data or results in a shared folder that is accessed via WiFi from another computer.
Python is the primary programming language that comes with the Raspberry Pi. (This language is quite different from the older languages traditionally used by scientists, such as Fortran or Pascal, and it can be confusing for people without a computer-science background, like me). Here's a simple example of real-time data acquisition and plotting on a Raspberry Pi, using the commercially available add-on Sense Hat board with a program written in Python, measuring temperature as a function of time (real-time animation). If you don't have a Sense Hat, here's a version of the same Python program that plots the running average of a random number between 0 and 200 using the same autoscaling graphic technique, displaying a result that gradually settles down closer and closer to the average of 100 the longer you let it run. (This one will also run in WinPyton on a plain Windows PC). Compare this to the Matlab/Octave script that does the same thing. The execution time of the WinPython and Matlab versions is essentially the same, but the Matlab/Octave script is substantially shorter. These two programs constitute a nice comparison of Matlab vs Python, illustrating the advantages and disadvantages of each.
Other competing small-scale systems include the, a tiny $130 Windows-10 computer board with 2 Gbytes RAM and 32 Gbytes flash storage. . See also this review of single-board computers.
Batch processing (added June, 2017)
In situations where you have a large volume of similar types of data to process, it's useful to automate the process. Let's assume that you have already acquired data in the form of a large number of text or numerical data files of some standardized format that are stored in a known directory (folder) somewhere on your computer. For example they might be ASCII .txt or .csv files with the independent variable ('x') in the first column and one or more dependent variables ('y') in the other columns. There may be a variable number of data files, and their file names and length may be unknown and variable, but the data format is consistent from file to file. You could write a Matlab script or function that will process those files one-by-one, but you want the computer to go through all the data files in that directory automatically, determine their file names, load each into the variable workspace, apply the desired processing operations (peak detection, deconvolution, curve fitting, whatever), and then collect all the resulting terminal window output, each labeled with the file name, add the results to a growing "diary" file, and then go on to the next data file. Ideally, the program should not stop if it encounters any kind of fatal error with one of the data files; rather, it should just skip that one and go on to the next.
BatchProcess.m is a
Matlab/Octave example of just such an automated process that you
can use as a framework for your applications. You need only
change three things to make this work for you:
(a) the directory name where the data are stored on your computer - (DataDirectory) in line 11;
(b) the directory name where the Matlab signal processing functions are stored on your computer
- (FunctionsDirectory) in line 12; and
(c) the actual processing functions that you wish to apply to each file (which in this example perform
peak fitting using the “peakfit.m” function in lines 34 – 41, but could be anything).
When it starts, the routine opens a “diary” file in line 21, located in the FunctionsDirectory, with the file name “BatchProcess<date>.txt” (where <date> is the current date, e.g. 12-Jun-2017). This file captures all the terminal window output during processing - in this example, I am using the peakfit.m function that generates a FitResults matrix (with Peak#, Position, Height, Width, and Area of the best-fit model), and the percent fitting error and R2 value ,for each data file in that directory. (Subsequent runs of the program on the same date are appended to this dairy file. On each subsequent day, a new file is begun for that day). You can also optionally save some of the variables in the workspace to data files; add a “save” function after the processing and before the “catch me” statement (type “help save” at the command prompt for options).
This program uses a couple of coding techniques
that are especially useful in automated file processing. It uses
the “function forms” of the commands “ls” (line 13), “diary”
(line 21), and “load” (line 29) to allow then to accept
variables computed within the program. It also uses the “try/catch/end” structure (lines 28, 47,
49), which prevents the program from stopping if it encounters
an error on one of the data files; rather, it reports the error
for that file and skips to the next one. Here's the general form
of that structure:
% The statements between 'TRY' and 'CATCH ME' will be attempted
% and skipped if they generate any sort of error.
% If an error is generated, the statements between 'CATCH ME'
% and 'END' will be executed instead.
After running this script, the
“BatchProcess...” diary file will contain all the terminal
output. Here's an except from a typical diary file, in which the
first two data files in the directory yielded errors that
were caught and skipped by the “try/catch/end”
structure , but the third one ("2016-08-05-RSCT-2144.txt") and all the following ones
worked normally and reported the results of the peak fitting
Error with file number 1.
Error with file number 2.
Peak# Position Height Width Area
1 6594.2 0.1711 0.74403 0.13551
2 6595.1 0.16178 0.60463 0.1041
% fitting error R2
Peak# Position Height Width Area
1 6594.7 0.11078 1.4432 0.17017
2 6595.6 0.04243 0.38252 0.01727
% fitting error R2
Peak# Position Height Width Area
2 6594 0.05366 0.5515 0.0315
1 6594.9 0.1068 1.2622 0.1435
% fitting error R2
6: .... etc....
You could also optionally import the dairy file
into Excel by opening an Excel worksheet, click on a cell, click
Data > From Text, select the diary file, click to specify that spaces
are to be used as column separators, and click Import. This will put all
the collected terminal output into that spreadsheet.
Additionally you might want to save the workspace variables
(e.g. as a .mat file).
(a) by using mouse-clicks to generate each data point, using Matlab's "ginput" function, orThe first method is illustrated by the simple script realtime.m. When you run this script, it displays a graphical coordinate system; position your mouse pointer along the y (vertical) axis and click to enter data points as you move the mouse pointer up and down. The "ginput" function waits for each click of the mouse button, then the program records the y coordinate position and counts the number of clicks. Data points are assigned to the vector y (line 17), plotted on the graph as black points (line 18), and print out in the command window (line 19). The script realtimeplotautoscale.m is an expanded version that changes the graph scale as the data come in. If the number of data points exceeds 20 ('maxdisplay'), the x axis maximum is re-scaled to twice that (line 32). If the data amplitude equals or exceeds ('maxy'), the y axis is re-scaled to 1.1 times the data amplitude (line 36).
(b) by pre-calculating simulated data and then accessing it point-by-point in a loop.
(1) Define the range of values, either directly or using the indirect function, e.g. INDIRECT("B"&F2&":B"&F3)Each step references the results of the one before it. These steps are illustrated in the same “ ” spreadsheet in column H, rows 20-23. The result is that the maximum y (21.5) occurs at x=28. These steps can even be combined into one long formula (cell H23), although this is harder to read, and harder to document, than the formulas for the separate steps. The peak finder spreadsheet uses this technique.
(2) Determine the maximum y value in that range with the MAX function, e.g.: MAX(INDIRECT("B"&F2&":B"&F3))
(3) Determine the row number in which that number appears with the MATCH function, e.g.: MATCH(H20,B7:B1000,0)+ROW(A6)
(4) Determine the value of x in that row with the INDIRECT function, e.g.: INDIRECT("A"&H21)
is a theorem that describes large collections of numbers or observations that are subject to independent and identically distributed random variation, such as the result of performing the same experiment a large number of times. The average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed. It is an important idea because it guarantees stable long-term results for the averages of some random events. This is why gambling casinos are able to make money; their games are designed to give the casino a small advantage in the long run but highly variable results in the short term, guaranteeing plenty of (noisy) winners, which encourages the gamblers, but even a greater number of (quiet) losers. And that’s why investors in the stock market often make money in the long run, despite the unpredictable day-to-day variation, up one day and down the next, and why it’s so hard to see climate change in the much wilder short-term hot and cold day-to-day and year-to-year swings in the weather. Short term is closer and easier; long term is harder to see, but it's where it's going eventually.
The uncertainty of uncertainty. The situation is even worse if you wish to estimate the standard deviation of a population from small samples. The Matlab script simulates this for the same population in the
There is a
well-documented tendency for people to overestimate the
quality of small numbers of observations, sometimes referred
to as , or , or the y. This is
related to the field of study of a famous pair of
psychologists named Amos Tversky and Daniel Kahneman, who
collaborated in a long-running study of human cognitive biases
in the 1970s. They formulated a hypothesis that people tend to
believe in a false “”, the
name they coined for the mistaken belief that a small sample
drawn from a large population is representative of that large
population. We’d like to believe that scientists are immune to
these foibles and that they always think logically and
correctly. But scientists are only human, so it's important to
be aware of this tendency, particularly when a small sample of
data supports your favorite hypothesis. It’s tempting to stop
there, “while you’re ahead”. This is called "confirmation
bias". Don't do it.
Of course in many
practical experimental measurements, you may really be
constrained to a rather small number of repeated measurements.
There may be a fixed number of data points and no possibility
of gathering more. Or the cost, in money or in time, of
gathering more data may be excessive, even in a laboratory
environment. For example, the process of calibrating
an analytical instrument for quantitative measurement
may involve the preparation and measurement of several
standard samples or solutions of known composition. If the
calibration curve (the relationship between instrument reading
and sample composition) is non-linear, it takes several
different standards to define the curve. You have to consider
not only cost of preparing many standards but also the cost of
cleaning up and safely storing or disposing of the
(potentially hazardous) chemicals afterwards. The bottom line
is, if you are limited to a small number of data points, do
not over-represent the precision of your results. To use the 3-sigma
rule to determine uncertainty ranges for a set of data,
the distribution must be normal (Gaussian) and you need to
know the standard deviation. The problem is that, for small
sets of data, both are uncertain.
AD. Spectroscopy and chromatography combined: time-resolved Classical Least Squares
The introduction of high-speed UV-Visible array detectors into high performance liquid chromatography (HPLC) instruments has significantly increased the power of that method. The speed of such detectors is such that they can acquire a complete spectrum multiple times per second over the entire chromatogram. An example of this is described in a technical report from Shimadazu Scientific Instruments (https://solutions.shimadzu.co.jp/an/n/en/hplc/jpl217011.pdf) which considers the separation of three positional isomers of methyl acetophenone: o-methyl (o-MAP), m-methyl a(m-MAP), and p-methyl (p-MAP). The ultraviolet absorption spectra of these three isomers at a concentration of 400 μg/mL each is shown below on the left, and the chromatographic separation, using the column and conditions specified in their report, are shown in the middle. The report goes on to describe their commercial software, which uses a complex iterative approach to extract the spectra and the chromatographic characteristics from the raw data.
Here I present a comparatively
simple non-iterative technique based on the same chemical
system, in which we consider each spectrum acquired by the
detector as a separate sample mixture and apply the Classic Least Squares method previous discussed, in which
the spectra of the components are known beforehand and
where adherence to the Beer-Lambert Law
is assumed. The spectra and
chromatographic peaks are simulated digitally in the
Matlab/Octave script TimeResolvedCLS.m,
shown in the figure above,
by modeling the spectrum of each component as a the sum
of three Gaussian peaks and the chromatographic peaks as
exponentially modified Gaussians. To make this simulation as
realistic as possible, the parameters were adjusted to match the
graphics in the Shimadazu technical report closely,
and the spectral resolution, sampling rate, and detector noise,
were based on that report also. You can see that the
chromatographic peaks (middle figure) are nowhere near baseline
resolved. Therefore it is to be expected that quantitative calibration based on the
measurement of peak areas in this chromatogram (for
example by the perpendicular drop method), might be inaccurate,
especially if the peak heights are very different. In fact in
this particular case, where the concentrations of the three
components are equal (0.05 μg/mL for each component), the peak
areas measured by perpendicular drop are about 2% from the true
values, mainly due to the slight asymmetry and unequal height of
the three peaks. The spectra (left-hand
figure) are even more highly overlapped than the chromatographic
peaks, but they are distinct in shape, and that is the key.
Basically, we treat this as a series of
3-component CLS calculations, one for each time slice of the
detector. The actual
calculations can be done in two ways, depending on whether the
spectra are processed one by one or are collected for the entire
chromatogram and then processed all at once, using either
"Alternative calculation #1", lines 113-146, or "Alternative
calculation #2", lines 150-170. The first method, shown on the
left, looks like chromatography as it executes; it computes the
chromatographic peaks of the three components point by point as
they evolve in time and plots them in the first three quadrants
of figure window 3 (on the right). The second method calculates
the entire chromatogram in one step at the end and makes the
same final plots. (The
second method is faster computationally, but usually
it's the chromatography that's the rate determining step, not
the calculations). Either way, the result is the same; the
chromatographic peaks of
the three components are completely separated
mathematically, so their areas are easily measured, no matter how much they overlap. Note that,
although the three spectra must be known, no knowledge of the
chromatography peaks is required; they emerge separate and
intact from the data, purely computationally.
In order to test the abilities and limitations of
this method, I have prepared a series of increasingly
challenging scenarios, starting with the one pictured above and
becoming progressively more difficult by making the
chromatographic peaks more closely spaced, making the peak more
asymmetrical, making the spectra more similar, and making the
concentrations unequal. These scenarios are listed in the table below,
along with the typical percent errors in peak area measurement
by the CLS method and links to the corresponding graphics and
Matlab/Octave m-files. Each is a more challenging variation on the first one; #2
has much more chromatographic peak overlap; #3 has much more asymmetrical peaks (higher tau);
#4 has much more similar spectra - in fact,
the peak wavelengths differ by only 0.1 nm, making them look
identical; in #5, component 2 (the middle peak) has a
concentration 100 times lower; and #6 is the same as #5 except
that the peaks are highly asymmetrical. In all of these cases,
the normal perpendicular drop area measurement technique is
either impossible (because there are no distinct peaks for each
component) or are very much in error, but the CLS techniques
works well, giving very low errors except when the middle peak
concentration is 0.0001, which approaches the random noise limit
of the detector. (Another variation, TimeResolvedCLSbaseline.m,
includes the correction for baseline shifts.)
||% errors in area
|| .05 .05 .05
|| 0.0022% 0.002%
||Normal||Slight: tau=10|| .01 .01 .01
|3. Partly resolved
|| .05 .05 .05
|| -0.0004% -0.013%
||Graphic m file|
|4. Unresolved||Almost complete
||Slight: tau=10|| .01 .01 .01
||Graphic m file|
|5. Unresolved||Almost complete||Slight: tau=10
|| .01 .0001 .01
||Graphic m file|
|6. Unresolved||Almost complete||Great: tau=40||.01 .0001 .01|| -0.04%
||Graphic m file|
Even when the peaks are resolved well enough for the perpendicular drop method to work, it can suffer from interaction between adjacent peak heights; that is, a change in the peak height of one peak can affect the measurement of the area of adjacent overlapped peaks, because of shifts in the valley point between them. This is illustrated by TimeResolvedCLScalibration.m, which simulates the measurement of 10 different three-component mixtures similar to the above (but modified so perpendicular drop measurement is possible), where the concentrations vary independently and randomly over a 1 x 10-4 to 9.5 x 10-4 μg/mL range, and then plots measured peak area vs concentration for each component. (Each time you run this, you'll get a different mix of concentrations). Linear least-squares fits of peak area vs concentration are calculated, as shown below. In this typical example, the average absolute percentage error in area measurement for the perpendicular drop method is about 5%, with an R2 of 0.995, and for the CLS measurement is less than 1%, with an R2 of 0.9995. (Even if the detector noise (line 22) is set to zero in this simulation, the errors in the perpendicular drop method remain, because they are caused by overlap between adjacent peaks, rather than by noise).
Though clearly the CLS method is
very effective, all of this really only proves that
the mathematics works well; the method still has the
serious limitation that it requires that the spectra of all the
components be known accurately. This requirement can be met in
some applications, but in liquid chromatography there is a
potential pitfall. If gradient elution and/or temperature
programming are used, and if the spectra of those
chemical compounds are sensitive to the solvent and/or to
temperature, for example shifting their peaks slightly, then
there will likely be additional errors in the CLS procedure.
Obviously this depends on the particular chemical system and
will have to be evaluated on a case-by-case basis. But this
suggests another use for this method: speeding up a
chromatographic method that normally would achieve complete
baseline separation (from which accurate spectra of each
component could be obtained in situ at the right solvent
composition and temperature, and then adjusting the column
and/or flow rate to achieve a faster but incompletely resolved
chromatogram, to which the CLS method could be applied
In other applications, some or all of the
components may simply be unknown and you may want to obtain
their spectra. This would be easy if the chromatography were
able to separate each component cleanly, but what if the peaks
are overlapped so that pure component spectra are never
achieved? In that case are more sophisticated method must be
used, such as the one described in the Shimadzu technical
report. This involves making initial estimates of
spectral and chromatographic peaks, followed by an iterative
search for the best fit to the experimental data, subject
to the imposition of some important constraints based on
physical reality, such as non-negativity of spectra and
chromatography peaks (the peaks are always positive, except for
random noise on the baseline), and the unimodality of
the chromatography peaks, (that is, each
component gives one and only one chromatography peak). Methods
of this type will be left to a future expansion of this essay.
[FitResults,GOF]=peakfit([x y],0,0,1,3) for a single Logistic model (shape 3)
[FitResults,GOF]=peakfit([x y],0,0,4,39,1) for a 4 exponentially broadened Gaussian model (shape 39)
This page is part of "A
Pragmatic Introduction to Signal Processing", created
and maintained by Prof.
Tom O'Haver , Department of Chemistry and Biochemistry, The
University of Maryland at College Park. Comments, suggestions and
questions should be directed to Prof. O'Haver at email@example.com.
Updated November, 2019.
Unique visits to this site since June 22, 2008: