[Introduction] [Signal arithmetic] [Signals and noise] [Smoothing] [Differentiation] [Peak Sharpening] [Harmonic analysis] [Fourier convolution] [Fourier deconvolution] [Fourier filter] [Wavelets] [Peak area measurement] [Linear Least Squares] [Multicomponent Spectroscopy] [Iterative Curve Fitting] [Hyperlinear quantitative absorption spectrophotometry] [Appendix and Case Studies] [Peak Finding and Measurement] [iPeak] [iSignal] [Peak Fitters] [iFilter] [iPower] [List of downloadable software] [Interactive tools]
Here are two examples that show that the answer to this question is yes. The first example is shown on the left. This shows iSignal displaying a computer-generated 4000-point signal consisting of pure random noise, smoothed with a 19-point Gaussian smooth. The upper window shows a tiny slice of this signal that looks like a Gaussian peak with a calculated SNR over 1000. Only by looking at the entire signal (bottom window) do you see the true picture; that “peak” is just part of the noise, smoothed to look nice. Don't fool yourself.
The second example is a simple series of three Matlab commands that uses the 'randn' function to generate a 10000-point data set containing only normally-distributed white noise. Then it uses 'fastmooth' to smooth that noise, resulting in a 'signal' with a standard deviation of about 0.3 and a maximum value around 1.0. That signal is then submitted to iPeak. If the peak detection criteria (e.g. AmpThreshold and SmoothWidth) are set too low, many peaks will be found. But setting the AmpThreshold to 3 times the standard deviation (3 x 0.3 = 0.9) will greatly reduce the incidence of these false peaks.
>> noise=randn(1,10000);
>> signal=fastsmooth(noise,13);
>> ipeak([1:10000;signal],0,0.6,1e-006,17,17)
The peak identification function, which identifies peaks based
on their exact x-axis peak position and a stored table of
identified peak positions, is even less likely
to be fooled by random noise, because in addition to the
peak detection criteria of the findpeaks algorithm, any
detected peak must also match closely to a peak position in
the table of known peaks.
This experimental signal in this case study was unusual because it did not look like a typical signal when plotted; in fact, it looked a lot like noise at first glance. The figure below compares the raw signal (bottom) with the same number of points of normally-distributed white noise (top) with a mean of zero and a standard deviation of 1.0 (obtained from the Matlab/Octave 'randn' function).
As you can
see, the main difference is that the signal has more large
'spikes', especially in the positive direction. This
difference is evident when you look at the descriptive
statistics of the signal and the randn function:
DESCRIPTIVE STATISTICS |
Raw signal |
random noise (randn function) |
Mean |
0.4 |
0 |
Maximum |
38 |
about 5 - 6 |
Standard Deviation (STD) |
1.05 |
1.0 |
Inter-Quartile Range (IQR) |
1.04 |
1.3489 |
Kurtosis |
38 |
3 |
Skewness |
1.64 |
0 |
You can
see that the standard deviations of these two
are nearly the same, but the other statistics
(especially the kurtosis
and
skewness) indicate that the probability distribution of
the signal is far from normal; there are far more
positive spikes in the signal than expected for pure noise.
Most of these turned out to be the peaks of interest for this
signal; they look like spikes only because the length of the signal (over
1,000,000 points) causes
the peaks to be compressed into one screen pixel or less when
the entire signal is plotted on the screen. In the figures on
the left, iSignal is used to "zoom in"
on some of the larger of these peaks (using the cursor arrow
keys). The peaks are very sparsely separated (by an average of
1000 half-widths between peaks) and are well above the level
of background noise (which has a standard deviation of roughly
0.9 throughout the signal).
The researcher
who
obtained this signal said that a 'good' peak was 'bell
shaped', with an amplitude above 5 and a width of 500-1000
x-axis units. So that means that we can expect the signal-to-background-noise ratio to be at
least 5/0.9 = 5.5. You can see in the three example peaks on
the left that the peak widths do indeed meet those
expectations. The interval between adjacent x-axis points is
25, so that means the we can expect the peaks to have about 20
to 40 points in their widths. Based on that, we can expect
that the positions, heights and widths of the peaks should be
able to be be measured fairly accurately using least-squares
methods (which reduce the uncertainty of measured parameters
by about the square root of the number of points used - about
a factor of 5 in this case). However, the noise appears
to be signal-dependent; the noise on the top of the
peaks is distinctly greater than the noise on the baseline.
The result is that the actual signal-to-noise (S/N) ratio of
peak parameter measurement for the larger peaks will not be as
good as might be expected based on the ratio of the peak
height to the noise on the background. Most likely, the
total noise in this signal is the sum of two major components,
one with a fixed standard deviation of 0.9 and the other
roughly equal to 10% of the peak height.
To automate the detection of large numbers of peaks, we can
use the findpeaksG
or iPeak
functions. Reasonable values of the input arguments AmplitudeThreshold, SlopeThreshold, SmoothWidth, and FitWidth for those functions
can be estimated based on the expected peak height (5) and
width (20 to 40 data points) of the "good" peaks. For example,
using AmplitudeThreshold=5, SlopeThreshold=.001, SmoothWidth=25, and FitWidth=25, these function
detect and measure 76 peaks above an amplitude of 5 and with
an average peak width of 523. The interactive iPeak
function is especially convenient for exploring the effect of
these peak detection parameters and for graphically inspecting
the peaks that it finds. Ideally the objective is to find a
set of peak detection arguments that detect and accurately
measure all the peaks that you would consider 'good' and skip
all the 'bad' ones. But in reality the criteria for good and
bad peaks is at least partly subjective, so it's usually best
to err on the side of caution and avoid skipping 'good' peaks
at the risk of including a few 'bad' peaks in the mix, which
can be weeded out manually based on unusual position, height,
width, or appearance.
Of course,
it must be expected that the values of the peak position,
height, and width given by the findpeaksG
or iPeak
functions will only be approximate and will vary depending on
the exact setting of the peak detection arguments; the noisier
the data, the greater the uncertainty in the peak parameters.
In this regard the peak-fitting functions peakfit.m and ipf.m
usually give more accurate results, because they make use of all the data across the peak,
not just the top of the peak as do findpeaksG and
iPeak. For example, compare the results of the peak near
x=3035200 measured with iPeak (click to view) and with peakfit (click to view). Also, the peak
fitting functions are better for dealing with overlapping
peaks and for estimating the uncertainty of the measured
peak parameters, using the bootstrap options of
those functions. For example, the largest peak in this signal
has an x-axis position of 2.8683e+007, height of 32, and width
of 500; the bootstrap method determines that the
standard deviations are 4, 0.92, and 9.3,
respectively.
Because the signal in the case study was so large (over
1,000,000 points), the interactive programs such as iPeak, iSignal, and ipf
may be sluggish in operation, especially if your computer is
not fast computationally or graphically. If this is a serious
problem, it may be best to break the signal up into two or
more segments and deal with each segment separately, then
combine the results. Alternatively, you can use the condense function to average the
entire signal into a smaller number of points by a factor of 2
or 3 (at the risk of slightly reducing peak heights and
increasing peak widths), but then you should reduce SmoothWidth and FitWidth by the same factor to
compensate for the reduced number of data points across the
peaks. Run testcondense.m
for a demonstration of the condense function.
The experimental signal in this case study had a number of narrow spikes above a seemingly flat baseline.
Using iSignal to
investigate the signal, it was found that the visible
positive spikes were single points of very large amplitude (up to
10^{6}), whereas the regions between the spikes were not really flat but contained
bell-shaped peaks that were so much smaller (below 10^{3}) that they were not even visible on this scale. For example, using iSignal to zoom in to the region
around x=26300, you can see one of those bell-shapes peaks
with a small single-point negative-going spike artifact near
its peak.
Very narrow spikes like this are common artifacts in some experimental signals; they are easy to eliminate by using a median filter. The iSignal function has such a filter, activated by the “M” key. The result shows that the single-point spike artifacts have been eliminated, with little effect on the character of the bell-shaped peak.
Other filter types, like most forms of smoothing, would be far less
effective than a median filter for this type of artifact
and would distort the peaks.
The negative spikes in this signal turned out to
be steep steps, which can either be reduced
by using iSignal's slew-rate
limit function (the ` key) or manually eliminated by
using the semicolon key
(;) to set the
selected region between the dotted red cursor lines to
zero. Using the latter approach, the entire cleaned-up
signal is shown below. The remaining peaks are all
positive, bell shaped and have amplitudes from about 6
to about 750.
iPeak can automate the measurement of peak positions and heights for the entire signal.
If required, individual peaks can be measured more accurately by fitting the whole peak with iPeak's “N” key or with peakfit.m or ipf.m. The peaks are all slightly asymmetrical; they fit an exponentially-broadened Gaussian model to a fitting error less than about 0.5%, as shown on the left. The smooth residual plots suggests that the signal was smoothed before the spikes were introduced.
Note that fitting with an
exponentially-broadened Gaussian model gives the peak
parameters of the Gaussian before broadening. iSignal and iPeak estimate
the peak parameters of the broadened peak. As before, the
effect of the broadening is to shift the peak position to
larger values, reduce the peak height, and increase the
peak width.
Position
Height Width Area error S/N
Ratio
isignal
16871
788.88 32.881 27612
172
ipeak
16871
785.34 33.525 28029
peakfit(G)
16871 777.9 33.488 27729
1.68%
peakfit(E) 16863 973.72 27.312
28308 0.47%
G = Gaussian model; E = Exponentially-broadened Gaussian model
This case study demonstrates the application of
several techniques described in this paper to the quantitative
measurement of a peak that is buried in an unstable background,
a situation that can occur in the quantitative analysis
applications of various forms of spectroscopy, process
monitoring, and remote sensing. The objective is to derive a
measure of peak amplitude that varies linearly with the actual
peak amplitude but that is not effected by the changes in the
background and the random noise. In this example, the peak to be
measured is located at a fixed location in the center of the
recorded signal, at x=100 and has a fixed shape (Gaussian) and
width (30). The background, on the other hand, is highly
variable, both in amplitude and in shape. The simulation shows
six superimposed recordings of the signal with six different
peak amplitudes and with randomly varying background amplitudes
and shapes (top row left in the following figures). The methods
that are compared here include smoothing,
differentiation, classical least squares
multicomponent method (CLS), and iterative non-linear curve fitting.
CaseStudyC.m
is a self-contained Matlab/Octave demo function
that demonstrates this case. To run it, download it. place it in the path, and type
“CaseStudyC” at the command prompt. Each time you run it, you'll
get the same series of true peak amplitudes (set by the vector
“SignalAmplitudes”, in line 12) but a different set of
background shapes and amplitudes. The background is modeled as a
Gaussian peak of randomly varying amplitude, position, and
width; you can control the average amplitude
of the background by changing the variable
“BackgroundAmplitude” and the average change
in the background by changing the variable
“BackgroundChange”.
The five methods compared in the figures below are:
1: Top row center. A simple zero-to-peak measurement of the smoothed signal, which assumes that the background is zero.
2: Top row right. The difference between the peak signal and the average background on both sides of the peak (both smoothed), which assumes that the background is flat.
3: Bottom row left. A derivative-based method, which assumes that the background is very broad compared to the measured peak.
4: Bottom row center. Classical least squares (CLS), which assumes that the background is a peak of known shape, width, and position (the only unknown being the height).
5: Bottom row right. iterative non-linear curve fitting (INLS), which which assumes that the background is a peak of known shape but unknown width and position. This method can track changes in the background peak position and width (within limits), as long as the measured peak and the background shapes are independent of the concentration of the unknown.
These five methods are listed roughly in the order of increasing mathematical and geometrical complexity. They are compared below by plotting the actual peak heights (set by the vector “SignalAmplitudes”) vs the measure derived from that method, fitting the data to a straight line, and computing the coefficient of determination, R^{2.}, which ideally is 1.0000.
For the first test (shown in the figure above), both “BackgroundAmplitude” and “BackgroundChange” are set to zero, so that only the random noise is present. In that case all the methods work well, with R^{2.}values all very close to 0.9999. With a 10x higher noise level (click to view), all methods still work about equally well, but with a lower coefficient of determination R^{2}, as might be expected.
For the second test (shown in the figure immediately above), “BackgroundAmplitude”=1 and “BackgroundChange”=0, so the background has significant amplitude variation but a fixed shape, position, and width. In that case, the first two methods fail, but the derivative, CLS, and INLS methods work well.
For the third test, shown in the figure above, “BackgroundAmplitude”=1 and “BackgroundChange”=100, so the background varies in position, width, and amplitude (but remains broad compared to the signal). In that case, the CLS methods fails as well, because it assumes that the background varies only in amplitude. However, if we go one step further (click to view) and set “BackgroundChange”=1000, the background shape is now so unstable that even the INLS method fails, but still the derivative method remains effective as long as the background is broader than the measured peak, no matter what its shape. On the other hand, if the width and position of the measured peak changes from sample to sample, the derivative method will fail and the INLS method is more effective (click to view), as long as the fundamental shape of both measured peak and the background are both known (e.g. Gaussian, Lorentzian ,etc).
Not surprisingly, the more mathematically complex methods perform better, on average. Fortunately, software can "hide" that complexity, in the same way, for example, that a hand-held calculator hides the complexity of long division.
Ensemble
averaging is a powerful method of reducing the effect of
random noise in experimental signals, when it can be applied.
The idea is that the signal is repeated, preferably a large
number of times, and all the repeats are averaged. The signal
builds up, and the noise gradually averages towards zero, as the
number of repeats increases.
An important requirement is that the repeats be aligned
or synchronized, so that in the absence of random noise, the
repeated signals would line up exactly. There are two ways of
managing this:
(a)
the signal repeats are triggered by some external event and the
data acquisition can use that trigger to synchronize the
signals, or
(b) the signal itself has some feature that can be used to
detect each repeat, whenever it occurs.
The first method (a) has the advantage that the signal-to-noise (S/N) ratio can be arbitrarily low and the average signal will still gradually emerge from the noise if the number of repeats is large enough. However, not every experiment has a reliable external trigger.
The second method (b) can be used to average repeated
patterns in one continuous signal without an external trigger
that corresponds to each repeat, but the signal must then
contain some feature (for example, a peak) with a
signal-to-noise ratio large enough to detect reliably in each
repeat. This method can be used even when the signal patters
occur at random intervals, when the timing of the repetitions is
not of interest. The interactive peak detector iPeak
6 has a built-in
ensemble averaging function (Shift-E) can compute the average of all the repeating
waveforms. It works by detecting a single peak in each
repeat in order to synchronize the repeats.
The Matlab script iPeakEnsembleAverageDemo.m (on http://tinyurl.com/cey8rwh) demonstrates this idea, with a signal that contains a repeated underlying pattern of two overlapping Gaussian peaks, 12 points apart, with a 2:1 height ratio, both of width 12. These patterns occur a random intervals, and the noise level is about 10% of the average peak height. Using iPeak (above left), you adjust the peak detection controls to detect only one peak in each repeat pattern, zoom in to isolate any one of those repeat patterns, and press Shift-E. In this case there are about 60 repeats, so the expected signal-to-noise (S/N) ratio improvement is sqrt(60) = 7.7. You can save the averaged pattern (above right) into the Matlab workspace as “EA” by typing
>> load EnsembleAverage; EA=EnsembleAverage;
then curve-fit this averaged pattern to a 2-Gaussian model using the peakfit.m function (figure on the right):
>> peakfit([1:length(EA);EA],40,60,2,1,0,10)
Position
Height Width Area
32.54
13.255 12.003 169.36
44.72
6.7916 12.677 91.648
You'll see a big improvement in the accuracy of the peak separation, height ratio and width, compared to fitting a single pattern in the original x,y signal:
>> peakfit([x;y],16352,60,2,1,0,10)
The wav file “horngoby.wav” (Ctrl-click to open) is a 2-second recording of the sound of a passing automobile horn, exhibiting the familiar Doppler effect. The sampling rate is 22000 Hz. Download this file and place it in your Matlab path. You can then load this into the Matlab workspace as the variable “doppler” and display it using iSignal:
t=0:1/21920:2;
load horngoby.mat
isignal(t,doppler);
Within iSignal, you can switching to frequency spectrum mode (press Shift-S), and zoom in on different portions of the waveform, so you can observe the downward frequency shift and measure it quantitatively. (Actually, it's much easier to hear the frequency shift - press Shift-P to play the sound - than to see it graphically; the shift is rather small on a percentage basis, but human hearing is very sensitive to small pitch (frequency) changes). It helps to re-plot the data to stretch out the frequency region around the fundamental frequency or one of the harmonics. I used iSignal to zoom in on three slices of this waveform and then I plotted the frequency spectrum (Shift-S) near the beginning (plotted in blue), middle (green), and end (red) of the sound. The frequency region between 150 Hz and 550 Hz are plotted in the figure below:
The
group of peaks near 200 are the fundamental
frequency of the lowest note of the horn and the
group of peaks near 400 are its second
harmonic. (Pitched sounds usually have a harmonic
structure of 1, 2, 3... times a fundamental frequency). The group of peaks near 250
are the fundamental frequency of the next higher note of
the horn and the group of peaks near 500 are its second
harmonic. (Car and train horns often have two or three harmonious
notes sounded together). In each of these groups
of harmonics, you can clearly see that the blue peak
(the spectrum measured at the beginning of the sound)
has a higher frequency than
the red peak (the spectrum measured at the end of the sound).
The green peak, taken in the middle, has an intermediate
frequency. The peaks are ragged because
the amplitude and frequency varies over the sampling
interval, but you can still get good quantitative
measures of the frequency of each component by curve fitting to
a Gaussian peak model using peakfit.m or ipf.m:
Peak
Position Height
Width Area
Beginning
206.69 3.0191e+005 0.81866
2.4636e+005
Middle
202.65 1.5481e+005
2.911 4.797e+005
End
197.42
81906 1.3785
1.1994e+005
The estimated precision of
the peak position (i.e. frequency) measurements is about
0.2% relative, based on the bootstrap method,
good enough to allow accurate calculation of the
frequency shift (about 4.2%) and the
speed
of the vehicle
and to demonstrate that the ratio of the second
harmonic to the fundamental for these data is 2.0023, which is
very close to the theoretical value of 2.
It's
also possible to plot the evolving spectrum of such a signal as
a contour graph, using the PlotSegFreqSpect.m
function, as show in the graphic below. You can see the various
frequency components drifting steadily down in frequency as the
time passes. See HarmonicAnalysis.html#Software.
PSM=PlotSegFreqSpect(t,doppler,40,350,0);
% 40 segments, 240 harmonics
Spikes, narrow pulses with a width
of only one or a few points, are sometimes encountered
in signals as a result of an electronic “glitch” or
stray pickup from nearby equipment, and they can easily
be eliminated by the use of a “median” filter. But
it is possible that in some experiments the spikes themselves might be the
important part of the signal and that it is required to
count or measure them. This situation was recently
encountered in a research application, and it opens up
some interesting twists on the usual procedures.
As a demonstration, the Matlab/Octave script SpikeDemo1.m creates a waveform (top panel of figure below) in which a series of spikes are randomly distributed in time, contaminated by two types of noise: white noise and a large-amplitude oscillatory interference simulated by a swept-frequency sine wave. The objective is to count the spikes and locate their position on the x (time) axis. Direct application of findpeaks or iPeak to the raw signal does not work well.
A single-point spike, called a delta function in mathematics, has a power spectrum that is flat; that is, it has equal power at all frequencies, just like white noise. But the oscillatory interference in this case is located in a specific range of frequencies, which opens some interesting possibilities. One approach would be to use a Fourier filter, for example a notch or band-reject filter, to remove the troublesome oscillations selectively. But if the objective of the measurement is only to count the spikes and measure their times, a simpler approach would be to (1) compute the second derivative (which greatly amplifies the spikes relative to the oscillations), (2) smooth the result (to limit the white noise amplification caused by differentiation), then (3) take the absolute value (to yield positive-pointing peaks). This can be done in a single line of nested Matlab/Octave code:
y1=abs(fastsmooth((deriv2(y)).^2,3,2));
The result, shown the lower panel of the figure on the left above, is an almost complete extraction of the spikes, which can then be counted with findpeaksG.m or peakstats.m or iPeak.m:
P=ipeak([x;y1],0,0.1,2e-005,1,3,3,0.2,0);
The second example, SpikeDemo2.m, is similar except that in
this case the oscillatory interference is caused by two fixed-frequency
sine waves at a higher frequency, which completely
obscure the spikes in the raw signal (top panel of the
left figure below). In the power spectrum
(bottom panel, in red), the oscillatory interference
shows as two sharp peaks that dominate the spectrum and
reach to y=10^{6}, whereas the spikes show as
the much lower broad flat plateau at about y=10. In this
case, use can be made of an interesting property of
sliding-average smooths, such as the boxcar, triangular,
and Gaussian smooths;
their frequency responses exhibit a series of deep cusps at frequencies
that are inversely proportional to their filter widths.
So this opens up the possibility of suppressing specific
frequencies of oscillatory interference by adjusting the
filter widths until the cusps occur at or near the
frequency of the oscillations. Since the signal in this
cases are spikes that have a flat power spectrum, they
are simply smoothed by this operation, which will reduce
their heights and increase their widths, but will have
little or no effect on their number or x-axis positions.
In this case a 9 point pseudo-Gaussian smooth puts the
first (lowest frequency) cusp right in between the two oscillatory frequencies.
In the figure on the right, you can see the effect of applying this filter; the spikes, which were not even visible in the original signal, are now cleanly extracted (upper panel), and you can see in the power spectrum (right lower panel, in red) that the two sharp peaks of oscillatory interference is reduced by about a factor of about 1,000,000! This operation can be performed by a single command-line function, adjusting the smooth width (the second input argument, here a 9) by trial and error to minimize the oscillatory interference:
y1=fastsmooth(y,9,3);
(If the interference varies
substantially in frequency across the signal, you could
use a segmented
smooth rather than the standard fastsmooth).
Alternatively, the segmented
Fourier spectrum could be used to visualize this signal, and a
Fourier filter in “notch” mode
could be employed to specifically eliminate the interfering
frequencies.
The extracted peaks can then be counted with any of the peak finding functions, such as:
P=findpeaksG(x,y1,2e-005,0.01,2,5,3);
or
P=findpeaksplot(x,y1,2e-005,0.01,2,5,3);
or
PS=peakstats(x,y1,2e-005,0.01,2,5,3,1);
The simple script “iSignalDeltaTest” demonstrates the power spectrum of the smoothing and differentiation functions of iSignal by a applying them to a delta function. Change the smooth type, smooth width, and derivative order and other functions in order to see how the power spectrum changes.
Some experiments produce peaks that are distorted by being convoluted by processes that make peaks less distinct and modify peak position, height and width. Exponential broadening is one of the most common of these processes. Fourier deconvolution and iterative curve fitting are two methods that can help to measure the true underlying peak parameters. Those two methods are conceptually distinct, because in Fourier deconvolution, the underlying peak shape is unknown but the nature and width of the broadening function (e.g. exponential) is assumed to be known; whereas in iterative least-squares curve fitting it's just the reverse: the underlying peak shape (i.e. Gaussian, Lorentzian, etc) must be known, but the width of the broadening process is initially unknown.
In the script shown below and the resulting graphic shown on the right (Download this script), the underlying signal (uyy) is a set of four Gaussians with peak heights of 1.2, 1.1, 1, 0.9 located at x=10, 20, 30, 40 and peak widths of 3, 4, 5, 6, but in the observed signal (yy) these peaks are broadened exponentially by the exponential function cc, resulting in shifted, shorter, and wider peaks, and then a little constant white noise is added after the broadening. The deconvolution of cc from yy successfully removes the broadening (yydc), but at the expense of a substantial noise increase. However, the extra noise in the deconvoluted signal is high-frequency weighted ("blue") and so is easily reduced by smoothing and has less effect on least-square fits than does white noise. (For a greater challenge, try adding more noise in line 6 or use a bad estimate of time constant in line 10). To plot the recovered signal overlaid with underlying signal: plot(xx,uyy,xx,yydc). To plot the observed signal overlaid with the underlying signal: plot(xx,uyy,xx,yy). Excellent values for the original underlying peak positions, heights, and widths can be obtained by curve-fitting the recovered signal to four Gaussians: [FitResults,FitError]= peakfit([xx;yydc],26,42,4,1,0,10). With ten times the previous noise level (Noise=.01), the values of peak parameters determined by curve fitting are still quite good, and even with 100x more noise (Noise=.1) the peak parameters are more accurate than you might expect for that amount of noise (because that noise is blue). Visually, the noise is so great that the situation looks hopeless, but the curve-fitting actually works pretty well.
xx=5:.1:65;
% Underlying Gaussian peaks with unknown heights,
positions, and widths.
uyy=modelpeaks2(xx,[1 1 1
1],[1.2 1.1 1 .9],[10 20 30 40],[3 4 5 6],...
[0 0 0 0]);
% Observed signal yy, with noise added AFTER the broadening convolution
Noise=.001; % <---Try more noise to see how this method
handles it.
yy=modelpeaks2(xx,[5 5 5 5],[1.2 1.1 1 .9],[10 20 30
40],[3 4 5 6],...
[-40 -40 -40 -40])+Noise.*randn(size(xx));
% Compute transfer function, cc,
cc=exp(-(1:length(yy))./40); % <---Change exponential
time constant here
% Attempt to recover original signal uyy by deconvoluting
cc from yy
% It's necessary to zero-pad the observed signal
yy as shown here.
yydc=deconv([yy zeros(1,length(yy)-1)],cc).*sum(cc);
% Plot and label everything
subplot(2,2,1);plot(xx,uyy);title('Underlying signal,
uyy');
subplot(2,2,2);plot(xx,cc);title('Exponential transfer
function, cc')
subplot(2,2,3);plot(xx,yy);title('observed broadened,
noisy signal, yy');
subplot(2,2,4);plot(xx,yydc);title('Recovered signal,
yydc')
An alternative to the above deconvolution approach is to curve fit the observed signal directly with an exponentially broadened Gaussian (shape number 5): [FitResults,FitError]=peakfit([xx;yy],26,50,4,5,40,10). Both methods give good values of the peak parameters, but the deconvolution method is considerably faster, because curve fitting with a simple Gaussian model is faster than fitting with a exponentially broadened peak model, especially if the number of peaks is large. Also, if the exponential factor is not known, it can be determined by curve fitting one or two of the peaks in the observed signal, using ipf.m, adjusting the exponential factor interactively to get the best fit. Note that you have to give peakfit a reasonably good value for the time constant ('extra'), the input argument right after the peakshape number. If the value is too far off, the fit may fail completely, returning all zeros. A little trial and error suffice. Alternatively, you could try to use peakfit.m version 7 with the independently variable time constant exponentially-broadened Gaussian shape number 31 or 39, to measure the time constant as an iterated variable (to understand the difference, see example 39). If the time constant is expected to be the same for all peaks, better results will be obtained by using shape number 31 or 39 initially to measure the time constant of an isolated peak (preferably one with a good S/N ratio), then apply that fixed time constant in peak shape 5 to all the other groups of overlapping peaks.
Digitization noise, also called quantization
noise, is an artifact caused by the rounding or
truncation of numbers to a fixed number of figures. It
can originate in the analog-to-digital
converter that converts an analog signal to a
digital one, or in the circuitry or software involved in
transmitting the digital signal to a computer, or even
in the process of transferring the data from one program
to another, as in copying and pasting data to and from a
spreadsheet. The result is a series of non-random
steps of equal height. The frequency distribution is
white, because of the sharpness of the steps, as you can
see by observing the power spectrum.
The
figure on the left, top panel, shows the effect of integer
digitization on a sine wave with an amplitude of +/- 10.
Ensemble averaging, which is usually the most effective of
noise reduction techniques, does not reduce this type of noise
(bottom panel) because it is non-random.
Interestingly, if additional
random noise is present in the signal, then ensemble averaging
becomes effective in reducing both the random noise and the digitization
noise. In essence, the added noise randomizes the
digitization, allowing it to be reduced by ensemble averaging.
Moreover, if there is insufficient random noise already in the
signal, it can be beneficial to add additional noise
artificially! The script RoundingError.m
illustrates this effect, as shown the animation on the right,
which shows the digitized sine wave with gradually increasing
amounts of added random noise in line 8 (generated
by the randn.m function) followed by ensemble averaging of 100
repeats
(in lines 17-20). Look closely at the
waveform in this animation as it changes in response to the
random noise addition shown in the title. You can clearly see
how the noise starts out mostly quantization noise but then
quickly decreases as small but increasing amounts of random
noise are added before the ensemble averaging step, then
eventually increases as too much noise is added. The optimum
standard deviation of random noise is about 0.36 times the
quantization size, as you can demonstrate by adding lesser or
greater amounts via the variable Noise in line 6 of this
script. This technique is called "dithering"
and it is also used in audio and in image processing.
An audible example of this
idea is illustrated by the Matlab/Octave script DigitizedSpeech.m, which starts
with an audio recording of the spoken phrase "Testing, one,
two, three", previously recorded at 44000 Hz and saved in WAV
format (TestingOneTwoThree.wav)
and in .mat format (testing123.mat),
rounds off the amplitude data progressively to 8 bits (256
steps; sound link), shown on the left,
4 bits (16 steps; sound link), and 1 bit
(2 steps; sound link), and then the
2-step case again with random
white noise added before the rounding (2 steps + noise; sound link),
plots the waveforms and plays the resulting sounds,
demonstrating both the degrading effect of rounding and the
remarkable improvement caused by adding noise. (Click on these
sound links to hear the sounds on your computer). Although the
computer program in this case does not perform an explicit ensemble averaging operation as does RoundingError.m,
it's likely that the neurons of the hearing center of your
brain provide that function by virtue of their response time
and memory effect.
This is
a simulation of several techniques described in this paper to
the quantitative measurement of a peak that is buried in excess of random noise, where the signal-to-noise (S/N) ratio is below 2. (Ordinarily, a
S/N ratio of 3 is desired for reliable detection).
The Matlab/Octave script LowSNRdemo.m
performs the simulations and calculations and compares the
results graphically, focusing on the behavior of each method as
the S/N ratio approaches zero. Four methods are compared:
(1) smoothing, followed by the peak-to-peak measure of the smoothed signal and background;
(2) a peak finding method based on the findpeakG function;
(3) unconstrained iterative least-squares fitting (INLS) based on the peakfit.m function; and
(4) constrained classical least squares fitting (CLS) based on the cls2.m function.
The measurements are carried out over a range of peak heights for which the S/N ratio varies from 0 to 2. The noise is random, constant, and white. Each time you run the script, you get the same set of underlying signals but independent samples of the random noise.
Results for the initial values in the script are shown in the plots on the left and in the table printed below, both of which are created by the script LowSNRdemo.m. The graphs on the left show correlation plots of the measured peak height vs the real peak height, which should ideally be a straight line with a slope of 1, an intercept of zero, and an R-squared of 1. As you can see, the simplest smoothed-peak method (upper left) is completely inadequate, with a low slope (because smoothing reduces peak height) and a high intercept (because even smoothed noise has a non-zero peak-to-peak value). The findpeaks function (upper right) works OK for height for higher peak heights but fails completely below a S/N ratio of 0.5 because the peak height falls below the amplitude threshold setting. In comparison, the two least-squares techniques work much better, reporting much better values of slope, intercept of zero, and R-squared. But if you look closely at the low end of the peak height range, near zero, you can see that the values reported by the unconstrained fit (lower left) occasionally stray from the line, whereas the constrained fit (lower right) decrease gracefully all the way to zero every time you run the script. Essentially the reason why it's even possible to make measurements at such low S/N ratios is that the data density is very high: that is, there are many data points in each signal (about 1000 points across the half-width of the peak with the initial script values).
The results are summarized in the table below. The height errors are reported as a percentage of the maximum height (initially 2). (For the first three methods, the peak position is also measured and its relative accuracy is reported. The constrained classical least squares fitting does not measure peak position but rather assumes that it remains fixed at the initial value of 100). You can see that the CLS method has a slight edge in accuracy, but you have to consider also that this method works well only if the peak shape, position, and width are known. The unconstrained iterative method can track changes in peak position and width.
Number of points in half-width of
peak: 1000
Method
Height
Error Position Error
Smoothed peak
21.2359% 120.688%
findpeaksG.m
32.3709% 33.363%
peakfit.m
2.7542%
4.6466%
cls2.m
1.6565%
You can change
several of the factors in this simulation to test the robustness
of these methods. Search for the word 'change' in the comments
for values that can be changed. Reduce MaxPeakHeight (line 8) to
make the problem harder. Change peak position and./or width
(lines 9 and 10) to show how the CLS method fails. As usual, the
more you know, the better your results. Change the increment
(line 4) to change the data density; more data is always better.
(Surprisingly, as it will be shown in Appendix U: Measurement Calibration, it is not even necessary to have an accurate peak
shape model in order to get a good
correlation between measured and actual height).
LowSNRdemo.m also computes the power
spectrum of the signal and the amplitude (square root of
the power) of the fundamental, where most of the power of a
broad Gaussian peak falls, and plots it in Figure(2). The correlation to
peak height is similar to the CLS method, but the intercept is higher because
there is a non-zero quantity of noise even in that one frequency
slice of the power spectrum.
We are now, in the 21st century, into the era of "big
data", where high-speed automated data systems can acquire,
store, and process greater quantities of data than ever before.
As this little example shows, greater quantities of data allow
researchers to probe deeper and measure smaller effects that
previously.
The signal detection problems facing those who search the sky for evidence of extraterrestrial civilizations or interesting natural phenomena are enormous. Among those problems are the fact that we don't know much about what to expect. In particular, we don't know exactly where to look in the sky, or what frequencies might be used, or the possible forms of the transmissions. Moreover, the many powerful sources of natural and terrestrial sources of interfering signals must not be confused for extraterrestrial ones. There is also the massive computer power required, which has driven the development of specialized hardware and software as well as distributed computation over thousands of Internet-connected personal computers across the world using the SETI@home computational screen-saver. Although many of the computational techniques used in this search are far more sophisticated than those covered in this web site, they begin with the basic concepts covered here.
One of the reoccurring themes of this site has
been that the more you know about your data, the more
likely you are to obtain a reliable measurement. In the
case of possible extraterrestrial signals, we don't know
much, but we do know a few things.
We know that electromagnetic radiation over a wide range
of frequencies is used for long-distance transmission on
earth and between earth and satellites and probes far
from earth. Astronomers already use radio telescopes to
receive natural radiations from vast distances. In order
to look at different frequencies at once, Fourier transforms of the raw telescope signals
can be computed over multiple time segments. We
previously saw a simulation
that showed how hard it is to see a periodic component
in the presence of on equal amount of random noise and
yet how easy it is to pick it out in the frequency
spectrum.
Also, transmissions from
extraterrestrial civilizations might be in the form
of equally-spaced pulses, so their detection and
verification is also part of SETI signal processing.
Interestingly, triplets and other groups of equally
spaced pulses appear in the Fourier transforms of high
frequency carrier waves that are amplitude or frequency
modulated (like AM
or FM radio). Of course, there is no reason to
assume, nor to reject, that extraterrestrial
civilizations might use the same methods of
communication as ourselves.
One thing that we know for
sure is that the earth rotates around its axis once a
day and that it revolves around the sun once a year. So
if we look at a fixed direction out from the earth, the
distant stars will seem to move in a predictable
pattern, whereas terrestrial sources will remain fixed
on earth. The huge Arecibo
Observatory dish in Puerto Rico is fixed in
position and is often used to look in one selected
direction for extended periods of time. The field of
view of this telescope is such that a point source at
a distance takes 12 seconds to pass. As
SETI
says: “Radio signals from a distant transmitter
should get stronger and then weaker as the telescope's
focal point moves across that area of the sky.
Specifically, the power should increase and then
decrease with a bell shaped curve (a
Gaussian curve). Gaussian curve-fitting is an excellent test to
determine if a radio wave was generated 'out there'
rather than a simple source of interference somewhere
here on Earth, since signals originating from Earth
will typically show constant power patterns rather
than curves”. Also, any observed 12 second peaks can
be re-examined with another focal point shifted
towards the west to see if it repeats with the
expected time and duration.
We also know
that there will be a Doppler
shift in the frequencies
observed if the source is moving relative to the
receiver; this is observed with sound
waves as well as with electromagnetic
waves
like radio or light. Because the earth is
rotating and revolving at a known and constant speed,
we can accurately predict and compensate for the
Doppler shift caused by earth's motion (this is called
“de-chirping”
the data).
For more on the details of SETI signal processing, see
SETI@home.
This appendix examines more closely the question of measuring peak area rather than peak height to reduce the effect of peak broadening, which commonly occurs in chromatography, for reasons that are discussed previously, and also in some forms of spectroscopy. Under what conditions the measurement of peak area might be better than peak height?
The Matlab/Octave script “HeightVsArea.m” simulates the measurement
of a series of standard samples whose concentrations
are given by the vector 'standards'. Each standard
produces an isolated peak whose peak height is
directly proportional to the corresponding value in
'standards' and whose underlying shape is a Gaussian with a constant peak
position ('pos') and width ('wid'). To simulate the
measurement of these samples under typical conditions,
the script changes the shape of the peaks (by
exponential broadening) and adds a variable baseline
and random noise. You can control, by means of the
variable definitions in the first few lines of the
script, the peak beginning and end, the sampling rate
'deltaX' (increment between x values), the peak
position and width ('pos' and 'wid'), the sequence of
peak heights ('standards'), the baseline amplitude
('baseline') and its degree of variability ('vba'),
the extent of shape change ('vbr'), and the amount of
random noise added to the final signal ('noise').
The resulting peaks are shown in Figure 1, above. The script prepares a series of “calibration curves” plotting the values of 'standard' against the measured peak heights or areas for each measurement method. The measurement methods include peak height in Figure 2, peak area in Figure 3, and curve fitting height and area in Figures 4 and 5, respectively. These plots should ideally have an intercept of zero and an R^{2} of 1.000, but the slope is greater for the peak area measurements because area has different units and is numerically greater than peak height. All the measurement methods are baseline corrected; that is, they include code that attempts to compensate for changes in the baseline (controlled by the variable 'baseline').
With the initial values of 'baseline', 'noise', 'vba', and 'vbr', you can clearly see the advantage of peak area measurements (figure 3) compared to peak height (figure 2). This is primarily due to the effect of the variability of peak shape broadening ('vbr') and to the averaging out of random noise in the computation of area.
If you set 'baseline', 'noise', 'vba', and 'vbr' all to zero, you've simulated a perfect world in which all methods work perfectly.
Curve
fitting can measure both peak
height and area; it
is not even absolutely necessary to use an accurate peak shape
model. Using a simple Gaussian
model in this example works much better for peak area
(Figure 5) than for peak height (Figure 4) but is not
significantly better than a simple peak area
measurement (Figure 3). The best results are obtained
if an exponentially-broadened Gaussian model (shape 31 or 39) is used,
using the code in line 30, but that computation takes
longer. Moreover, if the measured peak overlaps another peak significantly, curve fitting both
of those peaks together can give much more accurate results
that other peak area measurement methods.
Both
Microsoft Excel and OpenOffice Calc have the ability to automate
repetitive tasks using “macros”, saved sequences of commands or
keystrokes that are stored for later use. Macros can be most
easily created using the built-in “Macro Recorder”, which will
literally watch all your clicks, drags, and keystrokes and
record them for later playback. Or you can write or edit your
macros in the macro language of that spreadsheet (VBA in Excel;
Python or JavaScript in Calc). Or you can do both: use the macro
recorder first, then edit the resulting code manually to modify
it.
To enable macros in Excel, click on File, Options, click
Customize Ribbon Tab and check 'Developer' and click 'OK'. To
access the macro recorder, click Developer, Record Macro, give
the macro a name, click Options, assign a Ctrl-key shortcut, and
click OK. Then perform your spreadsheet operations, and when
finished, click Stop Recording and save the spreadsheet.
Thereafter, simply pressing your Ctrl-key shortcut will run the
macro and perform all the spreadsheet operations that you
recorded.
Here I will demonstrate two applications in Excel using macros
with the Solver function. (See http://peltiertech.com/Excel/SolverVBA.html#Solver2
for information about setting up macros and solver on your
version of Excel).
A previous section
described the use of the Solver function applied to the
iterative fitting of overlapping peaks in a spreadsheet. The
steps listed in the second paragraph on that page can easily be
captured with the macro recorder and saved with the spreadsheet.
However, a different macro will needed for each different number
of peaks, because the block of cells representing the “Proposed
Model” will be different for each number of peaks. For example,
the template CurveFitter2Gaussian.xlsm
includes a macro named 'fit' for a 2-peak fit, activated by
pressing Ctrl-f. Here is the text of that macro:
Sub fit()
'
' fit Macro
'
' Keyboard Shortcut: Ctrl+f
'
SolverOk SetCell:="$C$12",
MaxMinVal:=2, ValueOf:=0, ByChange:="$C$8:$D$9", _
Engine:=1,
EngineDesc:="GRG Nonlinear"
SolverSolve
End Sub
You
can see that the text of the macro uses only two macro
instructions: "SolverOK" and "SolverSolve". SolverOK specifies
all the information in the "Solver Parameters" dialog box in its
input arguments: 'SetCell' sets the objective as the percent
fitting error in cell C12, 'MaxMinVal' is set to the second
choice (Minimum), and 'ByChange' specifies the table of cells
representing the proposed model (C8:D9) whose values are to be
changed to minimize the objective in cell C12. The last argument
sets the solver engine to 'GRC Nonlinear', the best one for
iterative peak fitting. Finally, "SolverSolve" starts the Solver
engine. You could easily modify this macro for curve fitter
templates with other numbers of peaks just by changing the cells
referenced in the 'ByChange' argument, e.g. C8:E9 for a 3-peak
fit. In this case, though, is probably just as easy to use the
macro recorder to record a macro for each curve fitter template.
A more elaborate
example of a spreadsheet using a macro is TransmissionFittingCalibrationCurve.xls
(screen
image) that creates a calibration curve for a series of
standard concentrations in the TFit method, which was previously
described on TFit.html#spreadsheet.
Here's a portion of that macro:
Range("AF10").Select
Application.CutCopyMode = False
Selection.Copy
Range("A6").Select
Selection.PasteSpecial
Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
:=False,
Transpose:=False
Calculate
Range("J6").Select
Selection.Copy
Range("I6").Select
Selection.PasteSpecial
Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
:=False,
Transpose:=False
Calculate
SolverOk SetCell:="$H$6", MaxMinVal:=2,
ValueOf:=0, ByChange:="$I$6", Engine:=1 _
,
EngineDesc:="GRG Nonlinear"
SolverOk SetCell:="$H$6", MaxMinVal:=2,
ValueOf:=0, ByChange:="$I$6", Engine:=1 _
,
EngineDesc:="GRG Nonlinear"
SolverSolve userFinish:=True
SolverSolve userFinish:=True
SolverSolve userFinish:=True
Range("I6:J6").Select
Selection.Copy
Range("AG10").Select
Selection.PasteSpecial
Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
:=False,
Transpose:=False
The macro in this spreadsheet repeats this chunk of code several times, once for each concentration in the calibration curve
(changing only the "AF10" in the first line to pick up a
different concentration from the "Results table" in column
AF). This macro
uses several additional instructions, to select ranges
("Range...Select"),
copy ("Selection.Copy")
and paste ("Selection.PasteSpecial Paste:=xlPasteValues") values
from one place to another, and re-calculate the spreadsheet
("Calculate"). Each separate click, menu selection, or key
press creates one or more lines of macro text. The syntax is
wordy but quite explicit and clear; you can learn quite a
bit just by recording various spreadsheet actions and
looking at the resulting macro text.
The random walk was mentioned
in the section on signals
and noise as a type of low-frequency ("pink") noise. Wikipedia says:
"A
random walk is a mathematical formalization of a path that
consists of a succession of random steps. For example, the path
traced by a molecule as it travels in a liquid or a gas, the
search path of a foraging animal, superstring behavior, the
price of a fluctuating stock and the financial status of a
gambler can all be modeled as random walks, although they may
not be truly random in reality."
Random walks describe and serve as a model for many kinds
of unstable behavior. Whereas white, 1/f, and blue noises are
anchored to a mean value to which they tend to return, random
walks tend to be more aimless and often drift off on one or
another direction, possibly never to return. Mathematically,
a
random walk can be modeled as the cumulative sum of some random
process, for example the 'randn' function. The graph on the
right compares a 200-point sample of white noise (computed as
'randn' and shown in blue) to a random
walk (computed as a cumulative sum, 'cumsum', and shown in red). Both samples are scaled to
have exactly the same standard deviation, but their behavior is vastly different. The random
walk has much more low frequency behavior, in this case
wandering off beyond the amplitude range of the white noise.
This type of random behavior is very disruptive to the
measurement process, distorting the shapes of peaks and causing
baselines to shift and making them hard to define, and it can
not be reduced significantly by smoothing (See NoiseColorTest.m). In this
particular example, the random walk has an overall positive
slope and a "bump" near the middle that could be confused for a
real signal peak (it's not; it's just noise). But another sample
might have very different behavior. Unfortunately, it is not uncommon to observe
this behavior in experimental signals.
To demonstrate the measurement difficulties, the script RandomWalkBaseline.m simulates
a Gaussian peak with randomly variable position and width, on a
random walk baseline, with a S/N ratio is 15. The peak is
measured by least-squares curve fitting methods using peakfit.m with two different methods of
baseline correction in an attempt to handle the random walk:
(a) a single-component Gaussian model (shape 1) with autozero set to 1 (meaning a linear baseline is first interpolated from the edges of the data segment and subtracted from the signal): peakfit([x;y],0,0,1,1,0,10,1);
(b) a 2-component model, the first being a Gaussian (shape 1) and the second a linear slope (shape 26), with autozero set to 1: peakfit([x;y],0,0,2,[1 26],[0 0],10,0).
In this particular case the fitting error is lower for the second method, especially if the peak falls near the edges of the data range.
But the
relative percent errors of the peak parameters show that the first method gives a
lower error for position and width, at least in this case. On
average, the peak parameters are about the same.
Position Error
Height Error Width Error
Method a: 0.2772
3.0306 0.0125
Method b: 0.4938
2.3085 1.5418
You can compare this to WhiteNoiseBaseline.m which has
a similar signal and S/N ratio, except that the noise is white. Interestingly, the
fitting error with
white noise is greater, but the parameter errors (peak position,
height, width, and area) are lower, and the residuals are more random and less likely to
produce false noise peaks. This is because the random walk noise
is very highly
concentrated at low
frequencies where the signal
frequencies usually lie, whereas white noise also has
considerable power at higher
frequencies, which increases the fitting error but does comparatively little damage to signal
measurement accuracy. This may be slightly counter-intuitive,
but it's important to realize that fitting error does not always
correlate with peak parameter error. Bottom line: random
walk is troublesome.
Depending on the type of experiment, an instrumental design
based on modulation
techniques may help, and ensemble
averaging multiple measurements can help with any type of
unpredictable random noise, which is discussed in the very next
section.
In some experimental designs it may be beneficial to apply the technique of modulation, in which one of the controlled independent variables is oscillated in a periodic fashion, and then to detect the resulting oscillation in the measured signal. With the right instrumental design, some types of noise and drift may thereby be reduced or eliminated.
A simple example applies to optical measurement systems like the one pictured on the right. A light source illuminates a test object (DUT="Device Under Test") and the resulting light from the test object is measured by photo detector. Depending on the objective of the experiment and the arrangement of the parts, the detector might measure the light transmitted by, reflected by, scattered by, or excited by the light beam. A optical chopper rapidly and repeatedly interrupts the light beam falling on the test object, so that the photo detector sees an oscillating signal, and the following electronic system is designed to measure only the oscillating component and to ignore the constant (direct current) component. The advantage of this arrangement is that any interfering signals introduced after the chopper - such as constant light that comes from the test object itself, or ambient light that leaks in from the outside, or any constant background signal generated by the photo detector itself - are not oscillating and are thus rejected. This works best if the electronics is synchronized to the chopper frequency. That's actually the function of the lock-in amplifier, which receives a synchronizing reference signal directly from the chopper to guarantee synchronization even if the chopper frequency were to vary (c.f. the interactive simulation on https://terpconnect.umd.edu/~toh/models/lockin.html and T. C. O'Haver, "Lock-in Amplifiers," J. Chem. Ed. 49, March and April (1972). The lock-in amplifier is sometimes viewed as a "black box" with almost magical abilities, but in fact it is actually performing a rather simple (but very useful) operation, as shown in this simulation.
AmplitudeModulation.m
is a
Matlab/Octave script simulation of modulation and synchronous
detection, in which the signal created when the light beam
scans the test sample is modeled as a Gaussian band ('y'),
whose parameters are defined in the first few lines of the
script (updated for recent version of Matlab). As the spectrum
of the sample is scanned, the light beam is amplitude
modulated by the chopper, represented as a square wave defined
by the bipolar vector 'reference', which switches between +1
and -1, shown in the top panel on the left. The modulation
frequency is many times faster than the rate at which the
sample is scanned. The light emerging from the sample
therefore shows a finely chopped Gaussian ('my'), shown in the
second panel on the left. But the total signal seen by the detector also includes
an unstable background introduced after the modulation ('omy'), such as
lighted emitted by the sample itself or detector background,
which in this simulation this is modeled as a random walk
(Appendix O), which seriously distorts the signal, shown in
the third panel. The detector signal is then sent to a lock-in
amplifier that is synchronized to the reference waveform. The
essential action of the lock-in is to multiply the signal by
the bipolar reference waveform, inverting the signal
when the light of off and passing it
unchanged when the light is on. This causes the unmodulated background signal to be
converted into a bipolar square wave, whereas the modulated
signal is not effected because it is "off" when the reference
signal is negative. The result ('dy') is shown in the 4^{th}
panel. Finally,
this signal is low-passed
filtered by the last stage in the lock-in amplifier to
remove the modulation frequency, resulting in the recovered
signal peak 'sdy' shown in the bottom panel. In effect, the
modulation transforms the signal to a higher frequency
('frequency' in line 44) where low-frequency weighted noise on
the baseline (line 50) are less intense.
SignalToNoiseRatio = 4
Relative % Error:
Position
Height Width
Original: 8.07
23.1
13.7
Modulated: 0.11
0.22
1.01
Each time you run it you will get the same signal peak but a very different random walk background. The S/N ratio will vary from about 4 to 9. It's not uncommon to see a 100-fold improvement in peak height accuracy with modulation, as in the example shown here. (If you wish, you can change the signal peak parameters and the noise level in the first few code lines of this simulation. For an even greater challenge, change line 47 to "baseline=10.*noise+cumsum(noise);" to make the noise a mixture of white and random walk drift, which results in a really ugly raw signal; you can see that the white noise makes it through the synchronous detector but is reduced by the smoothing lowpass filter in the last stage). In effect, the low-pass filter determines the frequency bandwidth of the lock-in system, but it also increases the response time to step changes (as in the Morse Code example).
This improvement in measurement accuracy works only because the dominant random error in this case is
(a) introduced after the modulation, and
(b) a mostly low-frequency noise.
If the noise were white, there would be no improvement, because white noise is the same at all frequencies; in fact there would be a slight reduction in precision because of the fact that the chopper blocks half of the light on average. If the sample (device under test) generates an "absorption" peak that starts a some positive value and then dips down to a lower value before returning, the demodulated output will the a negative-goring peak rather than a positive peak (see AmplitudeModulationAbsorption.m).
This simulation
explores the problem of measuring the height of a small peak (a
"child peak") that is buried in the tail of a much stronger overlapping peak (a "parent peak"), in
the especially challenging case that the smaller peak is not even visible to
the unaided eye. Three different measurement tools will be
explored: iterative least-squares,
classical least-squares regression,
and peak detection,
using the Matlab/Octave tools peakfit.m,
cls.m, or findpeaksG.m,
respectively. (Alternatively, you could use the corresponding spreadsheet templates).
In this example the larger peak is located at x=4 and has a
height of 1.0 and a width of 1.66; the smaller measured peak is
located at x=5 and has a height of 0.1; both have a width of
1.66. Of course, for the purposes of this simulation, we pretend
that we don't necessarily know all of these facts and we will
try to find methods that will extract such information as
possible from the data, even if the signal is noisy. The
measured peak is small enough and close enough to the stronger
overlapping peak (separated by less than the width of the peaks)
that it never forms a maximum
in the total signal. So it looks like there is only
one peak, as
shown on the figure on the right. For that reason the
findpeaks.m function (which automatically finds peak maxima)
will not be useful by itself to locate the smaller peak. Simpler
methods for detecting the second peak also fail to provide a way
to measure the smaller second peak, such as inspecting the
derivatives of the signal (the smoothed fourth derivative shows
some evidence of
asymmetry, but that could just be due to the shape of the
larger peak), or Fourier
self-deconvolution to narrow the peaks so they are
distinguishable, but that is unlikely to be successful with this
much noise. Least-squares methods work better when the
signal-to-noise ratio is poor, and they
can be fine-tuned to make use of available information or
constraints, as will be demonstrated below.
The selection of the best method will depend on what is known
about the signal and the constraints that can be imposed; this
will depend in your knowledge of your experimental signal. In
this simulation (performed by the Matlab/Octave script SmallPeak.m), the signal is composed of
two Gaussian peaks (although that can be changed if desired in
line 26). The first question is: are there more than one peak
there? If we perform an unconstrained iterative fit of a single Gaussian to the
data, as shown in the figure on the right, it shows little or no
evidence of a second peak - the residuals look pretty random.
(If you could reduce the noise, or ensemble-average
even as few as 10 repeat signals, then the noise would be low
enough to see evidence
of a second peak). But as it is, there is nothing
that pops out at you suggesting a second peak.
But suppose we suspect that there should
be another peak of the same Gaussian
shape just on the right side of the larger
peak. We can try fitting a pair of Gaussians to the data (figure on the left), but in
this case the random noise is enough that the fit is not stable.
When you run SmallPeak.m,
the script performs 20 repeat fits (NumSignals in line 20) with
the same underlying peaks but with 20 different random noise
samples, revealing the stability (or instability) of each
measurement method. The fitted peaks in Figure window 1 bounce
around all over the place as the script runs, as you can see in
the animation on the left. The fitting error is on average lower that the
single-Gaussian fit, but that by itself does not mean that the
peak parameters so measured will be reliable; it could just be
"fitting the noise". If it were
isolated all by itself, the small
peak would have a S/N ratio of
about 5 and it could be measured to a peak height
precision of about 3%, but the presence of the larger
interfering peak makes the measurement much more difficult.
(Hint: After running SmallPeak.m the first time, spread out all
the figure windows so they can all be seen separately and don't
overlap. That way you can compared the stability of the
different methods more easily.)
But suppose that we have reason to expect that the two peaks will have the same width, but we don't know what that width might be. We could
try an equal width Gaussian fit (peak shape #6, shown in Matlab/Octave
Figure window 2); the resulting fit is much more stable and
shows that a small peak is located at about x=5 on the right of
the bigger peak, shown below on the left. On the other
hand, if we know the peak positions beforehand, but not the widths, we can use a fixed-position Gaussian
fit (shape #16) shown on the right (Figure window 3). In
the very common situation where the objective is to measure an unknown
concentration of a known component,
then it's possible to prepare standard samples where the
concentration of the sought component is high enough for its
position or width to be determined with certainty.
So far all of these examples have used iterative
peak fitting with at least one peak parameter (position and/or
width) unknown and determined by
measurement. If, on the other hand, all
the peak parameters are known except the peak height,
then the faster and more direct classical
least-squares regression (CLS) can be employed (Figure
window 4). In this case you need to know the peak position and
width of both the measured and the larger interfering peaks (the
computer will calculate their heights). If the positions and the
heights really are constant and known, then this method gives
the best stability and precision of measurement. It's also
computationally faster, which might be important if you have
lots of data to process automatically.
The problem with CLS is that it fails to give accurate
measurements if the peak position and/or width changes without
warning, whereas two of the iterative methods (unconstrained
Gaussian and equal-width Gaussian fits) can adapt to such
changes. It some experiments it quite common to have small
unexpected shifts in the peak position, especially in
chromatography or other flow-based measurements, caused by
unexpected changes in temperature, pressure, flow rate or other
instrumental factors. In SmallPeaks.m, such x-axis shifts can be
simulated using the variable "xshift" in line 18. It's initially
zero, but if you set it to something greater (e.g. 0.2) you'll
find that the equal-width Gaussian fit (Figure window 2) works
better because it can keep up with the changes in x-axis shifts.
But with a greater x-axis shift (xshift=1.0) even the equal-width fit has trouble. Still, if we know the separation between the two peaks, it's possible to use the findpeaksG function to search for and locate the larger peak and to calculate the position of the smaller one. Then the CLS method, with the peak positions so determined for each separate signal, shown in Figure window 5 and labeled "findpeaksP" in the table below, works better. Alternatively, another way to use the findpeaks results is a variation of the equal-width iterative fitting method in which the first guess peak positions (line 82) are derived from the findpeaks results, shown in Figure window 6 and labeled "findpeaksP2" in the table below; that method does not depend on accurate knowledge of the peak widths, only their equality.
Each time you run
SmallPeaks.m, all of these methods are
computed “NumSignals” times (set in
line 20) and compared in a table giving the average peak height
accuracy of all the repeat runs:
xshift=0
Unconstr. EqualW FixedP FixedP&W findpeaksP
findpeaksP2
35.607 16.849 5.1375
4.4437 13.384 16.849
xshift=1
Unconstr. EqualW FixedP FixedP&W findpeaksP
findpeaksP2
31.263 44.107 22.794
46.18 10.607 10.808
Bottom line. The more
you know about your signals, the better you can measure them. A
stable signal with known peak positions and widths is the most precisely
measurable in the presence of random noise ("FixedP&W"), but
if the positions or widths vary from measurement to measurement,
different methods must be used, and precision is degraded
because more of the available information is used to account for
changes other than the ones you want to measure.
From a
signal-to-noise perspective, the stock market is an interesting
example. A national or global stock market is an aggregation of
large numbers of buyers and sellers of shares in publicly traded
companies. They are described by stock market indexes, which are
computed as the weighted average of a large number of selected
stocks. For example, the S&P
500 index is computed from the stock valuations of 500
large US companies. Millions of individuals and organizations
participate in the buying and selling of stocks on a daily
basis, so the S&P 500 index is a prototypical "big data"
conglomerate, reflecting the overall value of 500 of the largest
companies in the largest stock market on earth. Other stock
indices, such as the Russel 2000, include an even larger number
of smaller companies. Individual stocks can fail or fall
drastically in value, but the market indexes average out the
performance of hundreds of companies.
A plot of the daily value, V, of the S&P 500 index vs time,
T, from 1950 through September of 2016 is shown in the following
graphs.
Each plot contains 16608 data points, one for each business day, shown in red. The graph on the left plots V and the graph on the right plots the natural logarithm of V, ln(V). There are considerable up-and-down fluctuations over time that can be related to historical events: the oil crisis of the 1970s, the tech boom and bust of 2000, the subprime mortgage crisis of 2008. Still, the long-term trend of the value is upwards - the current value is over 100 times greater than its value in 1950. This is basically why people invest in the stock market, because on average, over the long run, stock values usually go up. The most common way to model this overall long-term increase over time is based on the equation for compound interest that predicts the growth of investments that have a constant rate of return, such as savings accounts or certificates of deposit:
V = S*(1 + R)^{T}
where V is the value, S is the starting value, R is the annual rate of return, and T is time. By itself, this expression would yield a smooth curve, without all the peaks and dips. The values of S and R that result in the best fit to the stock market data (shown by the blue lines in the graphs) can be determined in two ways:
(1) directly, using the iterative curve fitting method, shown on the left above, or
(2) by taking the logarithm of the values and fitting those to a straight line, shown on the right above.
FitSandP.m is a
Matlab/Octave script that performs both of these calculations
using the data in SandPfrom1950.mat.
When applied to the S&P 500 index data, the rate of return R is about 0.07 (or 7%), but
interestingly these two methods give slightly different results, even
though the exact same data
are used for both, and even though both methods yield the same 7% rate if applied to noiseless synthetic
data calculated from this
expression. This difference between methods is caused by the
irregularities in the stock data that deviate from a smooth line
- in other words, the noise - and it is exacerbated by the large range of the
value data V over time and by the fact that the average return
from 1950 to 1983 is slightly lower than that from 1983 to 2016.
From the point of view of curve fitting, the deviations from a
smooth curve described by the compound interest expression is
just noise.
But from the point of view of the stock market investor, those
deviations can be an opportunity and a warning. Naturally, most investors would like to know how the
stock market will behave in the future, but that requires
extrapolation beyond the range of the available data, which is
always uncertain and dangerous. But still, it's most likely (but not
certain) that the long term behavior of the market (say, over a period of 10 years
or more) will be similar to the past - that is, growing
exponentially at about the same rate as before but with
unpredictable fluctuations similar to what has occurred in the
past. We can take a closer look at those fluctuations by
inspecting the residuals
– that is, subtracting the fitted
curve from the raw data, as shown in iSignal
on the left. There are several notable
features of this "noise". First, the deviations are roughly
proportional to V and thus relatively equal when plotted
on a log scale. Second, the noise has a distinctly low-frequency
character; the periodogram (lower panel, in
red) shows peaks at 33, 16, 8, and 4 years. There are also,
notably, numerous instances over the years when there is a sharp
dip followed by a slower recovery close to the previous value.
And conversely, every peak is eventually followed by a dip. The
conventional advice in investing is to "buy low" (on the dips)
and "sell high" (on the peaks). But of course the problem is
that you can not reliably determine in advance exactly
where the peaks and dips will fall; you have only the past to
guide you. Still, if the current market value is much higher than the long-term
trend, it will likely fall, and if the market value is much lower than the long-term
trend, it will likely rise, eventually. The only thing you can
be sure of is that, in the long run, the market will rise. This
is why saving for retirement by investing in the stock market,
and starting as soon as possible, is so important: over a 30-year working life, the
market is almost guaranteed to rise substantially. The most
painless way to do this is with your employer's 401k or 403b
automatic payroll withdrawal plan. You can not actually invest
in the stock market as a whole, but you can invest in index mutual funds or exchange
traded funds (ETFs), which are collections of stocks that
are constructed to match or track the components of a market index. Such funds
typically have very low management fees, an important factor
in selecting an investment. Other mutual funds attempt to
“beat the market” by carefully buying and selling stocks in an
attempt to create a return that is greater than the overall
market indexes; some are temporarily successful in
doing that, but they charge higher management fees. Mutual
finds and ETFs are much less risky investments than individual
stocks.
Some companies periodically distribute
payouts to investors called “dividends”. Those dividends are
independent of the day-to-day variations in stock price, so
even if the stock value drops temporarily, you still get the
same dividend. For that reason it's important that you set
your investment account to “automatically reinvest dividends”,
so when the share price drops, the dividends are buying shares
at the lower price. The S&P 500
index values used above, called price returns, did not
include dividend
reinvestment; the total returns with dividends reinvested
(https://en.wikipedia.org/wiki/S%26P_500_Index#Versions)
would have been substantially
higher, closer to 11%. (With an average total annual
return of 11%, and starting with an investment of $170 the
first month - that's less than $6 a day - and increasing it 5%
each year, you could accumulate over $600,000 over a
30 year working life, or $1,000,000 if you continued investing
an additional 5 years, as shown by the spreadsheet graphic on the
right). And that’s starting at just $6 per day, about the cost
of a fancy coffee at Starbucks. Think about that the next time
you see a line of young people waiting to order their daily
coffee.
To illustrate how much influence stock market volatility fluctuation (“noise”) has on the market gains, the Matlab/Octave script SnPsimulation.m adds proportional noise to the compound interest calculation to mimic the S&P data, performs the two curve fitting methods described above, repeats the allocations over and over with independent samples of proportional noise, and then calculates the mean and the relative standard deviation (RSD) of the rates of return. A typical result is:
TrueRateOfReturn =
0.07
Measured Rate RSD
Coordinate transformation: 0.07112
8.9%
Iterative curve fitting:
0.07972 19.9%
As
you
can see, the two methods don't agree. In this example, the
return calculated by the iterative method is higher, but it could
just have easily been the other way. The fact is that the
standard deviations are fairly large, and the iterative method
always has a higher standard deviation, because it
weights the higher values more heavily, where deviations from
the line are higher, whereas the log transformation method
weights the data more evenly. Even with this uncertainty,
investing in a stock market index fund almost always performs
better in the long run than more predictable investments
such as saving accounts or CDs, which have much lower rates of
return.
In
investing in the stock market, it's important to focus on the
long-term trends and not to be frightened by the short-term up
and down fluctuations. It's similar to the difference between weather
and climate;
the large and dramatic short-term weather variations
tend to disguise the much smaller long term climate warming
that is slowly melting the icecaps and raising
the sea levels (whether it is caused by human activity or
by natural causes alone or by a combination of both).
For
a spreadsheet template that allows you to calculate the possible
returns on long-term investments in stock market mutual funds,
see https://terpconnect.umd.edu/~toh/simulations/Investment.html.
Note added in June 2020. You might be wondering what effect more recent events have had on the overall stock market performance, in particular the trade wars of 2019 and the Coronavirus pandemic of 2020. Extending the data plotted above to include the S&P results for the dates up to June 2020 has remarkably little effect, as seen in the log plot below.
The average return is still about 7%.
The
recent changes are more evident if you take a much closer look
at the period from 2016 to 2020, for which the return over that
short period was indeed greater, about 9.5%. These 2019 and the
2020 dips, although they were quite sharp and caused a lot of
anxiety at the time, recovered quickly and as a result had
little effect on the overall long-term performance. When stocks
drop, even for well-known and valid reasons, some investors buy
shares at the reduced prices, and when stocks rise, especially
when they hit all-time highs, some investors sell shares, to
"lock in their gains". This behavior acts as a natural brake on
the fluctuations of the market.
In the section
of Signals and Noise, I
said: “The quality of a signal is often expressed as the
signal-to-noise (S/N) ratio, which is the ratio of the true
signal amplitude ... to the standard deviation of the noise.”
That's a simple enough statement, but automating the measurement
of signal and the noise in real signals is not always
straightforward. Sometimes it's difficult to separate or
distinguish between the signal and the noise, because it depends
not only on the numerical nature of the data, but also on the
objectives of the measurement.
For a simple DC (direct current) signal, for example measuring a
fluctuating voltage, the signal is just the average voltage
value and the noise is its standard deviation. This is easily
calculated in a spreadsheet or in Matlab/Octave:
>>
signal=mean(NoisyVoltage);
>> noise=std(NoisyVoltage);
>> SignalToNoiseRatio=signal/noise
But usually things are more complicated. For example,
if the signal is a rectangular pulse (as in the figure on the
left) with constant random noise, then the simple formulation
above will not give accurate results. If the signal is stable
enough that you can get two successive signal recordings m1 and m2 that are identical except for the noise, then you can simply
subtract the signal out: the
standard deviation of the noise is then given by sqrt((std(m1-m2)^{2})/2), where std is the standard
deviation function (because random noise
adds quadratically). But not every signal source is stable
and repeatable enough for that to work perfectly. Alternatively,
you can try to measure the average signal just over the top of
the pulse and the noise only over the baseline interval before
and/or after the pulse. That's not so hard to do by hand, but
it's harder to automate with a computer, especially if the
position or width of the pulse changes. It's basically the same
for smooth peak shapes like the commonly-encountered Gaussian
peak (as in the figure on the right). You can
estimate the height of the peak by smoothing
it and then taking the maximum of the smoothed peak as the
signal: max(fastsmooth(y,10,3)), but the accuracy would degrade if you choose too high
or two low a smooth width. And clearly all this depends on
having a well-defined baseline in the data where there is only
noise. It doesn't work if the noise varies with the amplitude of
the peak.
In many cases, curve fitting can be helpful. For example, you could use peak fitting or a peak detector to locate multiple peaks and measure their peak heights and their S/N ratios on a peak-to-peak basis, by computing the noise as the standard deviation of difference between the raw data and the best-fit line over the top part of the peak. That's how iSignal measures S/N ratios of peaks. Also, iSignal has baseline correction capabilities that allow the peak to be measured relative to the nearby baseline.
Curve fitting also works for complex signals of indeterminate shape that can be approximated by a high-order polynomial or as the sum of a number of basic functions such as Gaussians, as in the example shown on the left. In this example, five Gaussians are used to fit the data to the point where the residuals are random and unstructured. The residuals (shown in blue below the graph) are then just the noise remaining in the signal, whose standard deviation is easily computed using the built-in standard deviation function in a spreadsheet ("STDEV") or in Matlab/Octave ("std"). In this example, the standard deviation of the residuals is 111 and the maximum signal is 40748, so the percent relative standard deviation of the noise is 0.27% and the S/N ratio is 367. (The positions, heights, and widths of the Gaussian components, usually the main results of the curve fitting, are not used in this case; curve fitting is used only to obtain a measure the noise via the residuals). The advantage of this approach over simply subtracting two successive measurements of the signal is that it adjusts for slight changes in the signal from measurement to measurement; the only assumption is that the signal is a smooth, low-frequency waveform that can be fit with a polynomial or a collection of basic peak shapes and that the noise is random and mostly high-frequency compared the the signal. But don't use too high a polynomial order' otherwise you are just "fitting the noise".
With periodic signal waveforms the situation is a bit more complicated. As an example, consider the audio recording of the spoken phrase "Testing, one, two, three" (click to download in .mat format or in WAV format) that was used previously. The Matlab/Octave script PeriodicSignalSNR.m loads the audio file into a variable “waveform”, then computes the average amplitude of the waveform (the “envelope”) by smoothing the absolute value of the waveform:
envelope =
fastsmooth(abs(waveform), SmoothWidth, SmoothType);
The
result
is plotted on the left, where the waveform is in blue and the
envelope is in red. The signal is easy to measure as the maximum
or perhaps the average of the waveform, but the noise is not so
evident. The human voice is not reproducible enough to get a
second identical recording to subtract out the signal as above.
Still, there will be often be gaps in the sound, during which
the background noise will be dominant. In an audio (voice or
music) recording, there will typically be such gaps at the
beginning, then the recording process has already started but
the sound has not yet begun, and possibly at other short periods
when there are pauses in the sound. The idea is that, by
monitoring the envelope of the sound and noting when it falls
below some adjustable threshold value, we can automatically
record the noise that occurs in those gaps, whenever they may
occur in a recording. In PeriodicSignalSNR.m,
this operation is done in lines 26-32, and the threshold is set
in line 12. The threshold value has to be optimized for each
recording. When the threshold value is set to 0.015 in the
"Testing, one, two, three" recording, the resulting noise
segments are located and are marked in red in the plot on the
right. The program determines the
average noise level in this recording simply by computing the
standard deviation of those segments (line 46), then computes
and prints out the peak-to-peak S/N ratio and the RMS (root mean
square) S/N ratio.
PeakToPeak_SignalToNoiseRatio =
143.7296
RMS_SignalToNoiseRatio = 12.7966
The frequency distribution of the noise is also
determined (lines 60-61) and shown in the figure on the left, using
the PlotFrequencySpectrum function, or you could have used
iSignal in the frequency spectrum mode (Shift-S). The spectrum
of the noise shows a strong component very near 60 Hz, which is
almost certainly due to power line
pickup (the recording was made in
the USA, where AC power is 60Hz); this suggests that better
shielding and grounding of the electronics might help to clean
up future recordings. The lack of strong components at 100 Hz
and above suggests that the vocal sounds have been effectively
suppressed at this threshold setting. The script can be applied
to other sound recordings in WAV format simply by changing the
file name and time axis in lines 8 and 9.
SegmentedSmooth.m, illustrated on the right, is a segmented variant of fastsmooth.m, which can be useful if the widths of the peaks or the noise level varies substantially across the signal. The syntax is that same as fastsmooth.m, except that the second input argument "smoothwidths" can be a vector: SmoothY = SegmentedSmooth (Y, smoothwidths,type,ends). The function divides Y into a number of equal-length regions defined by the length of the vector 'smoothwidths', then smooths each region with a smooth of type 'type' and width defined by the elements of vector 'smoothwidths'. In the simple example in the figure on the right, smoothwidths=[31 52 91], which divides up the signal into three regions and smooths the first region with smoothwidth 31, the second with 51, and the last with 91. Any number and sequence of smooth widths can be used. If the peak widths increase or decrease regularly across the signal, you can calculate the smoothwidths vector by giving only the number of segments ("NumSegments") , the first value, "startw", and the last value, "endw", like so:
wstep=(endw-startw)/NumSegments;
smoothwidths=startw:wstep:endw;
Type "help
SegmentedSmooth" for other examples examples. DemoSegmentedSmooth.m demonstrates the operation
with different signals consisting of noisy
variable-width peaks that get progressively wider, like
the figure on the right. FindpeaksSL.m is the same thing
for Lorentzian peaks.
SegmentedSmoothTemplate.xlsx is
a segmented multiple-width data
smoothing spreadsheet template, functionally similar to
SegmentedSmooth.m. In this version there are 20 segments. SegmentedSmoothExample.xlsx is
an example with data (graphic). A related
spreadsheet GradientSmoothTemplate.xlsx (graphic) performs a linearly increasing (or decreasing) smooth
width across the entire signal, given only the start and end
values, automatically generating as many segments are are
necessary. Of course, as is usual with spreadsheets, you'll have
to modify these templates for your particular number of data
points, usually by inserting rows somewhere in the middle and
then drag-copying down from above the insert, plus you may have
to change the x-axis range of the graph. (In contrast, the
Matlab/Octave functions do that automatically).
Segmented Fourier filter. SegmentedFouFilter.m
is a segmented version of FouFilter.m,
which applies different center frequencies and widths to
different segments of the signal. The syntax is the same as
FouFilter.m except that the two input arguments
'centerFrequency' and 'filterWidth' must be vectors with the
values of centerFrequency of filterWidth for each segment. The
signal is divided equally into a number of segments determined
by the length of centerFrequency and filterWidth, which must be
equal. Type 'help SegmentedFouFilter' for help and examples.
findpeaksSG.m is a variant of the findpeaksG
function, with the same syntax, except that the
four peak detection parameters can be vectors, dividing up the signal into regions that are
optimized for peaks of different widths. Any number of segments can be
declared, based on the lengths of the SlopeThreshold
input argument. (Note: you need only enter vectors for
those parameters that you want to vary between segments;
to allow any of the other peak detection parameters to
remain unchanged across all segments, simply enter a
single scalar value for that parameter; only the
SlopeThreshold must be a vector). In the example shown
on the left, the
script TestPrecisionFindpeaksSG.m creates a noisy signal with
three peaks of widely different widths, measures the
peak positions, heights and widths of each peak using
findpeaksSG, and prints out the percent relative
standard deviations of parameters of the three peaks in
100 measurements with independent random noise. With
3-segment peak detection parameters, findpeaksSG
reliably detects and accurately measures all three
peaks. In contrast, findpeaksG, tuned to the middle peak
(using line 26 instead of line 25), measures the first
and last peaks poorly. You can also see that the
precision of peak parameter measurements gets
progressively better (smaller relative standard
deviation) the larger the peak widths, simply because
there are more data points in those peaks. (You can
change any of the variables in lines 10-18).
findpeaksSb.m is a segmented variant of
the findpeaksb.m
function. It has the
same syntax as findpeaksSb, except that the arguments
"SlopeThreshold", "AmpThreshold", "smoothwidth",
"peakgroup", "window", "PeakShape", "extra",
"NumTrials", "autozero", and "fixedparameters" can all
be optionally scalars or vectors with one entry for
each segment. This allows the function to handle widely
varying signals with peaks of very different shapes and
widths and backgrounds. In the example on the right, the
Matlab/Octave script DemoFindPeaksSb.m creates a series of Gaussian
peaks whose widths increase by a factor of 25 and that are superimposed in
a curved baseline with random white noise that increases
gradually across the signal. In this example, four segments are used, changing the peak
detection and curve fitting values so that all the peaks
are measured accurately.
SlopeThreshold
= [.01 .005 .002 .001];
AmpThreshold = 0.7;
SmoothWidth = [5 15 30 35];
FitWidth = [10 12 15 20];
windowspan = [100 125 150 200];
peakshape = 1;
autozero = 3;
The script also computes the relative percent error of the measurement of peak position, height, width, and area for each peak.
measurepeaks.m is an automatic peak detector peaks of arbitrary shape.
It is based on findpeaksSG, with which it shares the first 6
input arguments; the four peak detection arguments can be vectors to accommodate
signals with peaks of widely varying widths. It returns a table containing the peak
number, peak position, absolute peak height, peak-valley
difference, perpendicular drop area, and tangent skim area of
each peak it detects. If the last input argument ('plots') is
set to 1, it will plot the
entire signal with numbered peaks and will also plot the individual
peaks with the peak
maximum (red), valley points (magenta), and tangent lines
(cyan), as shown on the right. The peak height and x-position is
indicated by the red circle, the perpendicular
drop area is the total area measured between the two magenta
vertical lines down to zero, and the tangent skim area is the
area between the cyan baseline and the blue peak (which
compensates for a flat baseline). Type “help measurepeaks” and
try the examples there or run testmeasurepeaks.m
(graphic animation). The
related functions autopeaks.m and autopeaksplot.m are similar, except
that the four peak detection parameters can
be omitted and the function will calculate estimated initial
values.
The script HeightAndArea.m tests
the accuracy of peak height and area measurement with signals
that have multiple peaks with variable width, noise, background,
and peak overlap. Generally, the values for absolute peak height
and perpendicular drop area are best for peaks that have no
background, even if they are slightly overlapped, whereas the
values for peak-valley difference and for tangential skim area
are better for isolated peaks on a straight or slightly curved
background. Note: this function uses smoothing (specified by the
SmoothWidth input argument) only for peak detection; it performs
its measurements on the raw unsmoothed y data. If the raw data
are noisy, it may be best to smooth the y data yourself before
calling measurepeaks.m, using any smooth function of your
choice.
Other segmented functions. The same segmentation code used in SegmentedSmooth.m (lines 53-65)
can be applied to other functions simply by editing the first
line in the first for/end loop (line 59) to refer to the
function that you want to apply in a segmented fashion. For
example, segmented peak sharpening
can be useful when a signal has multiple peaks that vary in
width, and segmented deconvolution
can be useful when a signal has multiple peaks that vary in
width or tailing vary substantially across the signal: SegExpDeconv(x,y,tc) deconvolutes y
with a vector of exponential functions whose time constants are
specified by the vector tc. SegExpDeconvPlot.m
is the same except that it plots the original and deconvoluted
signals and shows the divisions
between the segments by vertical magenta lines.
Most scientific
measurements involve the use of an instrument that actually
measures something else and converts it to the desired
measure. Examples are simple weight scales (which actually
measure the compression of a spring), thermometers (which
actually measure thermal expansion), pH meters (which
actually measure a voltage), and devices for measuring
hemoglobin in blood or CO_{2} in air (which actually measure the
intensity of a light beam). These instruments are
single-purpose, designed to measure one quantity, and
automatically convert what they actually measure into the
the desired quantity and display it directly. But to insure
accuracy, such instruments must be calibrated,
that is, used to measure one or more calibration standards
of known accuracy, such as a standard weight or a sample
that is carefully prepared to a known temperature, pH, or
sugar content. Most are pre-calibrated at the factory for
the measurement of a specific substance in a specific type
of sample.
Analytical
calibration. General-purpose
instrumental techniques that are used to measure the
quantity of many different chemical components in unknown
samples, such as the various kinds of spectroscopy,
chromatography, and electrochemistry, or combination
techniques like “GC-mass
spec”, must also be calibrated, but because those
instruments can be used to measure a wide range of compounds
or elements, they must be calibrated by
the
user for
each substance and for each type of sample. Usually this is
accomplished by carefully preparing (or purchasing) one or
more “standard samples” of known concentration, such as
solution samples in a suitable solvent. Each standard is
inserted or injected into the instrument, and the resulting
instrument readings are plotted against the known
concentrations of the standards, using least-squares
calculations to compute the slope
and intercept, as
well as the standard deviation of the slope (sds)
and intercept (sdi).
Then the "unknowns" (that is, the samples whose
concentrations are to be determined) are measured by the
instrument and their signals are converted into
concentrations with the aid of the calibration curve. If the
calibration is linear, the sample concentration C of any
unknown is given by (A - intercept) /
slope,
where A is the measured signal (height or area) of that
unknown. The predicted standard deviation in the sample
concentration is C*SQRT((sdi/(A-intercept))^2+(sds/slope)^2)
by the rules for propagation
of error. All these calculations can be done in a
spreadsheet, such as CalibrationLinear.xls.
In some cases the thing measured can not be detected
directly but must undergo a chemical reaction that makes it
measurable; in that case the exact same reaction must be
carried out on all the standard solutions and unknown sample
solutions, as demonstrated
in
this animation (thanks to Cecilia Yu of Wellesley
College).
Various calibration methods are used to compensate for
problems such as random errors in standard preparation or
instrument readings, interferences,
drift,
and non-linearity
in the relationship between concentration and instrument
reading. For example, the standard
addition
calibration technique can be used to compensate for multiplicative
interferences. I have prepared a series of
“fill-in-the-blanks” spreadsheet
templates for various calibrations methods, with instructions,
as well as a series of spreadsheet-based
simulations of the error
propagation in widely-used analytical calibration
methods, including a step-by-step
exercise.
Calibration
and
signal processing.
Signal processing often intersects with calibration. For
example, if you use smoothing or filtering to reduce noise, or differentiation to
reduce the effect of background, or measure peak area to reduce the effect of peak
broadening, or use modulation to reduce the effect of
low-frequency drift, then you must
use the exact same
signal processing for both the standard samples and the
unknowns, because the choice of signal processing technique
can have a big impact on the magnitude and even on the units
of the resulting
processed signal (as for example in the derivative
technique and in choosing between peak height and peak area).
PeakCalibrationCurve.m
is an Matlab/Octave example of this. This script simulates
the calibration of a flow
injection system that produces signal peaks that are
related to an underlying concentration or amplitude ('amp').
In this example, six known standards are measured
sequentially, resulting in six separate peaks in the
observed signal. (We assume that the detector signal is
linearly proportional to the concentration at any instant).
To simulate a more realistic measurement, the script adds
four sources of "disturbance" to the observed signal:
a. noise - random white noise added to all the signal data points, controlled by the variable "Noise";
b. background - broad curved background of random amplitude, tilt, and curvature, controlled by "background";
c. broadening - exponential peak broadening that varies randomly from peak to peak, controlled by "broadening";
d. a final smoothing before the peaks are measured, controlled by "FinalSmooth".
The
script uses measurepeaks.m as
an internal function to determine the absolute peak height,
peak-valley difference, perpendicular drop area, and tangent
skim area. It plots separate calibration curve for each of
these measures in figure windows 2-5 against the true
underlying amplitudes (in the vector "amp"), fitting the
data to a straight line and computing the slope, intercept,
and R2. (If the detector response were non-linear, a
quadratic or cubic least-square would work better). The
slope and intercept of the best-fit line is different for
the different methods, but if the R2 is close to 1.000, a
successful measurement can be made. (If all the random
disturbances are set to zero in lines 33-36, the R2 values
will all be 1.000. Otherwise the measurements will not be
perfect and some methods will result in better measurements
- R2 closer to 1.000 - than others). Here is a typical
result:
Peak Position PeakMax Peak-val. Perp drop Tan
skim
1 101.56
1.7151 0.72679 55.827
11.336
2 202.08
2.1775 1.2555
66.521 21.425
3 300.7
2.9248 2.0999
58.455 29.792
4 400.2
3.5912 2.949
66.291 41.264
5 499.98
4.2366 3.7884
68.925 52.459
6 601.07
4.415 4.0797
75.255 61.762
R2 values: 0.9809 0.98615 0.7156
0.99824
In this case, the tangent skim method works best, giving a
linear calibration curve (shown on the left) with the
highest R2.
In this type of application, the peak heights and/or area
measurements do not actually have to be accurate,
but they must be precise.
That's because the objective of an analytical method such as
flow injection or chromatography is not
to measure peak heights
and areas,
but rather to measure concentrations,
which is why calibration curves are used. Figure 6 shows the
correlation between the measured tangent skim areas and the
actual true areas under the peaks in the signal shown above,
right; the slope of this plot shows that the tangent skim
areas are actually about 6% lower that the true areas, but
that does not make a difference in this case because the
standards and the unknown samples are measured the same way.
In some other application,
you may actually need to measure the peak heights and/or
areas accurately, in which case curve fitting is generally
the best way to go.
If the peaks partly overlap, the measured peak heights and
areas may be effected. To reduce the problem, it may be
possible to reduce the overlap by using peak sharpening methods,
for example the derivative
method, deconvolution or
the power
transform method, as demonstrated by the
self-contained Matlab/Octave function PowerTransformCalibrationCurve.m.
Curve fitting the signal
data. Ordinary in curve fitting, such as the classical least squares
(CLS) method and in iterative
nonlinear least-squares, the selection of a model
shape is very important. But in quantitative analysis applications of curve
fitting, where the peak height or area measured by curve
fitting is used only to determine the concentration of
the substance that created the peak by constructing a calibration
curve, having the exact model shape is surprisingly uncritical. The Matlab/Octave script PeakShapeAnalyticalCurve.m shows that, for a single isolated peak whose shape is
constant and independent of concentration, if the wrong
model shape is used, the peak heights measured by curve fitting
will be inaccurate, but that error will be exactly the same for the unknown samples and
the known calibration standards, so the error will
“cancel out” and the measured concentrations will still
be accurate, provided you use the same inaccurate model for both
the known standards and the unknown samples. In
the
example shown on the right, the peak shape of the actual
peak is Gaussian (blue dots) but the model used to fit
the data is Lorentzian (red line). That's an intentionally
bad fit to the signal data; the R2 value for the
fit to the signal data is only 0.962 (a poor fit by the
standards of measurement science). The result of this is
that the slope of the calibration curve
(shown below on the left) is greater than expected; it should have been 10 (because that's the
value of the “sensitivity” in line 18), but it's
actually 10.867 in the figure on the left, but
nevertheless the calibration curve is still
linear and its R2 value is 1.000,
meaning that the analysis should be accurate. (Note that
curve fitting is actually applied twice
in this type of
application, once using iterative curve fitting to fit
the signal data, and then again using polynomial curve fitting
to fit the calibration data).
Despite all this, it's still better to use as
accurate a model peak shape as possible for the signal
data, because the percent fitting error of the signal fit can be used as a warning that something
unexpected is wrong, such as the appearance of an
interfering peak from a foreign substance.
Computations
carried
out by computer software with non-integer numbers have a
natural limit to the precision with which they can be
represented; for example, the number 1/3 is represented as
0.3333333... with a large but finite number of “3”s, whereas
theoretically there are an infinite string of “3”s
in the decimal representation of 1/3. It's the same with
irrational numbers such as "pi" and the square root of 2;
they can never have a exact decimal representation. In
principle, these tiny errors could accumulate in a very
complex multiple-step calculation and could conceivably become
a significant source of error. In the vast majority of
applications to scientific computation, however, these
limits will be minuscule compared to the errors and random
noise that is already present in most real-world
measurements. But it is best to know what those numerical
limits are, under what circumstances they might occur, and
how to minimize them.
Multicomponent
spectroscopy. Probably the most common calculation
where numerical precision is an issue is in the matrix
methods that are used in multicomponent
spectroscopy. In the derivation of the Classical Least Squares
(CLS) method, the matrix
inverse is used to solve large systems of
linear equations. The matrix inverse is a standard function
in programming languages such as Matlab, Octave,
Wolfram's Mathematica,
and in spreadsheets. But if you use that function in Matlab,
the function name (“inv”) is automatically
flagged
by the editor with the following warning:
“For solving a system of linear equations, the inverse of a matrix is primarily of theoretical value. Never use the inverse of a matrix to solve a linear system Ax=b with x=inv(A)*b, because it is slow and inaccurate.... Instead of multiplying by the inverse, use matrix right division (/) or matrix left division (\). That is: Replace inv(A)*b with A\b...[and]...replace b*inv(A) with b/A”
"Slow and inaccurate"?
OK, now I'm scared. But how serious a problem is this really
in actual applications? To answer that question, the
Matlab/Octave script RegressionNumericalPrecisionTest.m
applies the CLS method to a mixture of two very
closely-spaced noiseless overlapping
Gaussian peaks (blue and green lines in the figure on the
left) using three different mathematical formulations of the
least-squares calculation that give different results. The
difficulty of such a measurement depends on the ratio of the
peak separation to the peak half-width; small ratios mean
very highly overlapped peaks which are hard to measure
accurately. In this example the separation-to-width ratio is
0.0033, which is very small (i.e. difficult); this is
equivalent to trying to measure a mixture of two absorption
spectroscopy peaks that are 300 nm wide and separated
by only 1 nm, a
tiny difference that you
wouldn't even notice with the naked eye. The
results of this script shows that the matrix inverse ("inv")
method does indeed have an error
thousands of times larger than the method using matrix
division, but even the matrix division
error is still very small. Practically, the
difference between these methods is unlikely to be
significant when applied to real experimental data, because
even the tiniest bit of signal instability (like that
caused by small changes in the temperature of the sample or
random noise in the signal, which
you can simulate in line 15) produces a far greater error.
So basically that warning message is the voice of a
mathematician or computer programmer, not an experimental
scientist.
Analog-to-digital
resolution. Potentially
more
significant than the computer's numerical resolution is the
resolution of the analog-to-digital
converter (ADC)
that is used to convert analog signals (e.g. voltage) to a
number. The Matlab/Octave script RegressionADCbitsTest.m
demonstrates this, with two slightly overlapping Gaussian
bands with a large (50-fold)
difference
in peak height (blue
and
green lines in the figure on the right); in this case the
separation to width ratio is 0.25, much larger (i.e. easier)
than the previous example. For this example, the simulation
shows that the relative percent errors of peak height
measurement is 0.19% for the larger peak and 6.6% for the
smaller peak. You can change the resolution of the simulated
analog-to-digital converter in number
of
bits (line 9). The amplitude
resolution of an analog-to-digital converter is 2 raised to
the power of the number of bits. Common
ADC resolutions are 10, 12, and 14 bits, corresponding to
resolutions of one part in 1024, 4096 and 16384,
respectively. Of course, the effective resolution for the smaller
peak in this case is 50
times less, and you can't simply turn up the amplification
on the smaller peak without overloading the ADC for the
larger one.
Surprisingly, if most of the noise in the signal is
this kind of digitization noise, it may actually help to add
some additional random noise (specified in line 10 in this
script), as was seen in appendix I.
Differentiation. Another application where you
can see numerical precision noise is in differentiation,
which involves the subtraction of very nearly equal adjacent
numbers in a data series. The self-contained Matlab/Octave
function DerivativeNumericalPrecisionDemo.m
shows how the numerical precision limits of the computer
effects the first through fourth derivatives of a Gaussian
band that is very finely
sampled (over 16,000 points in the half-width in this
case) and that has no added
noise. The plot on the left shows the four waveforms in
Figure 1, left, and their frequency spectra in Figure 2,
below. The numerical precision limits of the computer
creates random noise at very high frequencies, which are
emphasized by differentiation. In the frequency spectra
below, the big
low-frequency bump near a frequency of 10^{-2} is the signal and everything above
that is numerical noise. The
lower-order derivatives are seldom a problem, but by the
time you reach the fourth derivative, those noise
frequencies approach the strength of the signal frequencies,
as you can see in the frequency spectrum of the fourth
derivative in the lower right. But
this noise is only a very high-frequency noise, so smoothing
with as little as a 3-point sliding average smooth removes
most of it (click to
view).
Smoothing. Finally, there might
potentially be a numerical problem with the fastsmooth
algorithm, covered in the section on smoothing, because it is a recursive
algorithm that uses the results of a previous step in
the calculation to calculate the next step. The
Matlab/Octave script FastsmoothNumericalPrecisionTest.m
demonstrates the numerical precision of fastsmooth.m. Even
for 4000-point Gaussian smooth applied to a 100,000 point
signal, the numerical noise relative standard deviation is
only .00027%, and most of that occurs in the edges of the
signal (first 4000 and last 4000 points); the error over 90%
of the signal is orders of magnitude less. This
would be a negligible problem in most cases.
Appendix W: The Raspberry Pi (added March, 2017)
Signal processing does not necessarily require expensive
computer systems. The Raspberry Pi is a remarkably tiny and
inexpensive computer board that is about
the size of a deck of playing cards and costs
$38! Version
3 B+ in early 2018 has a 1.4GHz 64-bit quad-core ARMv8 CPU with 1GB
RAM, 4 USB ports, 40 general-purpose input-output pins, HDMI
port, 300 mbps Ethernet port, audio jack and composite video,
video camera and display interfaces, micro SD card slot for mass
storage, VideoCore IV 3D graphics core, 802.11ac Wireless LAN,
and Bluetooth 4.2. You can get it with a bunch of installed
software, including a version of the Linux operating system, a
simple but effective graphical desktop modeled on Windows, a Web
browser, the complete LibreOffice
suite, Wolfram's Mathematica (screen shot), several
programming languages, a bunch of games (including Minecraft), and
various utilities. All
of these are installed by default on the Raspberry Pi's operating
system installer. (There is even a smaller and cheaper
model called the Zero
that costs just $5 for the card itself or $10 with memory
card and power supply; it has less memory and smaller connectors
that the other models, but because of its
low cost and small size, this model is ideal in situations where
it might be damaged or lost, as in rocket or balloon borne
experiments).
All you need for a complete computer is a 5 volt, 2
amp power supply, a USB keyboard and mouse (which you can
probably pick up at a junk shop), a TV/monitor with an HDMI
input, and a mini SD card (8 to 16 Gbytes) for mass storage (you
can buy this card with all the software already installed or a
blank one to which you can download the software yourself). In
fact, if you already have a Wi-Fi network and an
Internet-connected computer, tablet, or smartphone, you don't
need a separate monitor, keyboard and mouse: you can log onto
the Raspberry Pi via your Wi-Fi network or over the Internet,
using Putty (for
command-line UNIX-style access) or a graphical desktop sharing
system such as RealVNC
(free for Windows, Mac, IOS, and Android), which reproduces the
entire graphical desktop on your local device, complete with a
pop-up virtual keyboard. It can also share
files
with Windows. The Pi has been used as a low-cost
alternative for school computer labs, using its included
software for both Office-type applications (Writer word processor, Calc
spreadsheet, etc), and for programming
instruction (Python, C,
C++, Java, Scratch, and Ruby). It's also ideal for “headless”
applications where it is only accessed remotely via WiFi or
Bluetooth, for example as a network
file
server, weather
station, media
center or as a networked security
camera.
For scientific data acquisition and signal
processing
applications, the Pi version of Linux has all the "usual"
UNIX
terminal commands for data gathering, searching, cleaning
and summarizing. In addition, there are many add-on libraries
for Python,
including SciPi, NumPy, and Matplotlib, all of which are
free downloads. Allen B. Downey's 164-page PDF book "Think
DSP" has many examples of Python code in traditional
engineering applications. There are many add-on hardware devices
available at low cost, including video
cameras and a piggyback sensor board that reads and
displays sensor data from several built-in sensors:
gyroscope, accelerometer, magnetometer, barometer, temperature,
relative humidity. (It's based on the same
hardware
that is currently in orbit on the International Space Station).
My signal
processing
spreadsheets run just fine on the version of Calc that comes with the
Pi, as do the Calibration
worksheets and my analytical
method
models (screen
shot).
For school applications, Element14 markets a
Learn
to Program Pack Starter Kit ($177) that includes a license
for student
version of Matlab for Windows or Macintosh and a Raspberry
Pi 3 with MicroSD card, power supply, and enclosure (Matlab does
not currently run directly on the Pi but can communicate with
it). Even cheaper, Octave 3.6 can
run directly on a Raspberry Pi; the screen below shows
Octave 3.6 running within the Pi's built-in graphical user
interface (showing off the 3D graphic functions "mesh" and
"surf").
There are many laboratory
and
field applications, especially in combination with an Arduino
micro-controller. However, the slowness
of
Octave (compared to Matlab), combined with the modest
speed of the Raspberry Pi 3 may be limiting in some applications. (Altogether
it's about 50 - 1000 times
slower than Matlab on a contemporary desktop computer).
But it's still fast enough for many applications. Alternatively,
it's possible to communicate with Raspberry Pi hardware remotely
from a faster computer running MATLAB using the MATLAB
Support
Package for Raspberry Pi Hardware for Matlab R2016b or
later, using one or more remotely accessed Raspberry Pi's for
experiment control and data acquisition and local storage and
doing the heavy-duty number crunching on the main computer. Or
you could simply have the Pi save data or results in a shared
folder that is accessed via WiFi from another computer.
Python is the primary programming language that comes with the
Raspberry Pi. (This language is quite different from the older
languages traditionally used by scientists, such as Fortran or
Pascal, and it can be confusing for people without a
computer-science background, like me). Here's a simple example of real-time data acquisition and
plotting on a Raspberry Pi, using the commercially
available add-on Sense
Hat board with a program written in
Python, measuring temperature as a function of time (real-time animation). If you don't
have a Sense Hat, here's a version
of the same Python program that plots the running
average of a random number between 0 and 200 using the
same autoscaling graphic technique, displaying a result that
gradually settles down closer and closer to the average of 100
the longer you let it run. (This one will also run in WinPyton on
a plain Windows PC). Compare this to the Matlab/Octave script that does the
same thing. The execution time of the WinPython and Matlab
versions is essentially the same, but the Matlab/Octave script
is substantially shorter. These two programs constitute a nice
comparison of Matlab vs Python, illustrating the advantages and
disadvantages of each.
Other
competing small-scale systems include the LattePanda,
a tiny $130 Windows-10 computer board with 2 Gbytes RAM and 32
Gbytes flash storage. Many
other
similar products are available. See also this review
of single-board computers.
Appendix
X:
Batch processing (added June, 2017)
In situations where you have a large
volume of similar types of data to process, it's useful to
automate the process. Let's assume that you have already
acquired data in the form of a large number of text or numerical
data files of some standardized format that are stored in a
known directory (folder) somewhere on your computer. For example
they might be ASCII .txt or .csv files with the independent
variable ('x') in the first column and one or more dependent
variables ('y') in the other columns. There may be a variable
number of data files, and their file names and length may be
unknown and variable, but the data format is consistent from
file to file. You could write a Matlab script or function that
will process those files one-by-one, but you want the
computer to go through all the data files in that
directory automatically,
determine their file names, load each into the variable
workspace, apply the desired processing operations (peak
detection, deconvolution, curve fitting, whatever), and then
collect all the resulting terminal window output, each labeled
with the file name, add the results to a growing "diary" file,
and then go on to the next data file. Ideally, the program
should not stop if it encounters any kind of fatal error with
one of the data files; rather, it should just skip that one
and go on to the next.
BatchProcess.m is a
Matlab/Octave example of just such an automated process that you
can use as a framework for your applications. You need only
change three things to make this work for you:
(a) the directory name where the data are stored on your computer - (DataDirectory) in line 11;
(b) the directory name where the Matlab signal processing functions are stored on your computer
- (FunctionsDirectory) in line 12; and
(c) the actual processing functions that you wish to apply to each file (which in this example perform
peak fitting using the “peakfit.m” function in lines 34 – 41, but could be anything).
When it starts, the routine opens a “diary” file in line 21, located in the FunctionsDirectory, with the file name “BatchProcess<date>.txt” (where <date> is the current date, e.g. 12-Jun-2017). This file captures all the terminal window output during processing - in this example, I am using the peakfit.m function that generates a FitResults matrix (with Peak#, Position, Height, Width, and Area of the best-fit model), and the percent fitting error and R2 value ,for each data file in that directory. (Subsequent runs of the program on the same date are appended to this dairy file. On each subsequent day, a new file is begun for that day). You can also optionally save some of the variables in the workspace to data files; add a “save” function after the processing and before the “catch me” statement (type “help save” at the command prompt for options).
This program uses a couple of coding techniques
that are especially useful in automated file processing. It uses
the “function forms” of the commands “ls” (line 13), “diary”
(line 21), and “load” (line 29) to allow then to accept
variables computed within the program. It also uses the “try/catch/end” structure (lines 28, 47,
49), which prevents the program from stopping if it encounters
an error on one of the data files; rather, it reports the error
for that file and skips to the next one. Here's the general form
of that structure:
TRY
% The statements between 'TRY' and 'CATCH ME' will be attempted
% and skipped if they generate any sort of error.
CATCH ME
% If an error is generated, the statements between 'CATCH ME'
% and 'END' will be executed instead.
END
After running this script, the
“BatchProcess...” diary file will contain all the terminal
output. Here's an except from a typical diary file, in which the
first two data files in the directory yielded errors that
were caught and skipped by the “try/catch/end”
structure , but the third one ("2016-08-05-RSCT-2144.txt") and all the following ones
worked normally and reported the results of the peak fitting
operations:
Error with file number 1.
Error with file number 2.
3: 2016-08-05-RSCT-2144.txt
Peak#
Position Height
Width Area
1 6594.2
0.1711 0.74403 0.13551
2 6595.1
0.16178 0.60463 0.1041
% fitting error R2
2.5735 0.99483
4: 2016-09-05-RSCT-2146.txt
Peak# Position Height
Width Area
1 6594.7
0.11078 1.4432 0.17017
2 6595.6 0.04243
0.38252 0.01727
% fitting error R2
4.5342 0.98182
5: 2016-09-09-RSCT-2146.txt
Peak# Position Height
Width Area
2 6594 0.05366
0.5515 0.0315
1 6594.9 0.1068
1.2622 0.1435
% fitting error R2
3.709 0.98743
6: .... etc....
You could also optionally import the dairy file
into Excel by opening an Excel worksheet, click on a cell, click
Data > From Text, select the diary file, click to specify that spaces
are to be used as column separators, and click Import. This will put all
the collected terminal output into that spreadsheet.
Additionally you might want to save the workspace variables
(e.g. as a .mat file).
(a) by using mouse-clicks to generate each data point, using Matlab's "ginput" function, orThe first method is illustrated by the simple script realtime.m. When you run this script, it displays a graphical coordinate system; position your mouse pointer along the y (vertical) axis and click to enter data points as you move the mouse pointer up and down. The "ginput" function waits for each click of the mouse button, then the program records the y coordinate position and counts the number of clicks. Data points are assigned to the vector y (line 17), plotted on the graph as black points (line 18), and print out in the command window (line 19). The script realtimeplotautoscale.m is an expanded version that changes the graph scale as the data come in. If the number of data points exceeds 20 ('maxdisplay'), the x axis maximum is re-scaled to twice that (line 32). If the data amplitude equals or exceeds ('maxy'), the y axis is re-scaled to 1.1 times the data amplitude (line 36).
(b) by pre-calculating simulated data and then accessing it point-by-point in a loop.
(1) Define the range of values, either directly or using the indirect function, e.g. INDIRECT("B"&F2&":B"&F3)Each step references the results of the one before it. These steps are illustrated in the same “SpecialFunctions.xlsx” spreadsheet in column H, rows 20-23. The result is that the maximum y (21.5) occurs at x=28. These steps can even be combined into one long formula (cell H23), although this is harder to read, and harder to document, than the formulas for the separate steps. The peak finder spreadsheet uses this technique.
(2) Determine the maximum y value in that range with the MAX function, e.g.: MAX(INDIRECT("B"&F2&":B"&F3))
(3) Determine the row number in which that number appears with the MATCH function, e.g.: MATCH(H20,B7:B1000,0)+ROW(A6)
(4) Determine the value of x in that row with the INDIRECT function, e.g.: INDIRECT("A"&H21)
The
Law of Large Numbers is
a theorem that describes large collections of numbers or
observations that are subject to independent and identically
distributed random variation, such as the result of performing
the same experiment a large number of times. The average of
the results obtained from a large number of trials should be
close to the expected value, and will tend to become closer as
more trials are performed. It is an important idea because it
guarantees stable long-term results for the averages of some
random events. This is why gambling casinos are able to make
money; their games are designed to give the casino a small
advantage in the long run but highly variable results in the
short term, guaranteeing plenty of (noisy) winners, which
encourages the gamblers, but even a greater number of (quiet)
losers. And that’s why investors in the stock market often
make money in the long run, despite the unpredictable
day-to-day variation, up one day and down the next, and why
it’s so hard to see climate change in the much wilder
short-term hot and cold day-to-day and year-to-year swings in
the weather. Short term is closer and easier;
long term is harder to see, but it's where it's going
eventually.
The uncertainty of uncertainty. The situation
is even worse if you wish to estimate the standard
deviation of a population from small samples. The Matlab
script RunningStandardDeviation.m simulates this for the same population in
the
There is a
well-documented tendency for people to overestimate the
quality of small numbers of observations, sometimes referred
to as hasty
generalization, or insensitivity
to
sample size, or the gambler’s
fallacy. This is
related to the field of study of a famous pair of
psychologists named Amos Tversky and Daniel Kahneman, who
collaborated in a long-running study of human cognitive biases
in the 1970s. They formulated a hypothesis that people tend to
believe in a false “Law
of
Small Numbers”, the
name they coined for the mistaken belief that a small sample
drawn from a large population is representative of that large
population. We’d like to believe that scientists are immune to
these foibles and that they always think logically and
correctly. But scientists are only human, so it's important to
be aware of this tendency, particularly when a small sample of
data supports your favorite hypothesis. It’s tempting to stop
there, “while you’re ahead”. This is called "confirmation
bias". Don't do it.
Of course in many
practical experimental measurements, you may really be
constrained to a rather small number of repeated measurements.
There may be a fixed number of data points and no possibility
of gathering more. Or the cost, in money or in time, of
gathering more data may be excessive, even in a laboratory
environment. For example, the process of calibrating
an analytical instrument for quantitative measurement
may involve the preparation and measurement of several
standard samples or solutions of known composition. If the
calibration curve (the relationship between instrument reading
and sample composition) is non-linear, it takes several
different standards to define the curve. You have to consider
not only cost of preparing many standards but also the cost of
cleaning up and safely storing or disposing of the
(potentially hazardous) chemicals afterwards. The bottom line
is, if you are limited to a small number of data points, do
not over-represent the precision of your results. To use the 3-sigma
rule to determine uncertainty ranges for a set of data,
the distribution must be normal (Gaussian) and you need to
know the standard deviation. The problem is that, for small
sets of data, both are uncertain.
AD.
Spectroscopy and chromatography combined: time-resolved
Classical Least Squares
The introduction of high-speed UV-Visible array detectors into high performance liquid chromatography (HPLC) instruments has significantly increased the power of that method. The speed of such detectors is such that they can acquire a complete spectrum multiple times per second over the entire chromatogram. An example of this is described in a technical report from Shimadazu Scientific Instruments (https://solutions.shimadzu.co.jp/an/n/en/hplc/jpl217011.pdf) which considers the separation of three positional isomers of methyl acetophenone: o-methyl (o-MAP), m-methyl a(m-MAP), and p-methyl (p-MAP). The ultraviolet absorption spectra of these three isomers at a concentration of 400 μg/mL each is shown below on the left, and the chromatographic separation, using the column and conditions specified in their report, are shown in the middle. The report goes on to describe their commercial software, which uses a complex iterative approach to extract the spectra and the chromatographic characteristics from the raw data.
CLSPercentErrors = 0.0021993 0.0020162 0.0015607Here I present a comparatively
simple non-iterative technique based on the same chemical
system, in which we consider each spectrum acquired by the
detector as a separate sample mixture and apply the Classic Least Squares method previous discussed, in which
the spectra of the components are known beforehand and
where adherence to the Beer-Lambert Law
is assumed. The spectra and
chromatographic peaks are simulated digitally in the
Matlab/Octave script TimeResolvedCLS.m,
shown in the figure above,
by modeling the spectrum of each component as a the sum
of three Gaussian peaks and the chromatographic peaks as
exponentially modified Gaussians. To make this simulation as
realistic as possible, the parameters were adjusted to match the
graphics in the Shimadazu technical report closely,
and the spectral resolution, sampling rate, and detector noise,
were based on that report also. You can see that the
chromatographic peaks (middle figure) are nowhere near baseline
resolved. Therefore it is to be expected that quantitative calibration based on the
measurement of peak areas in this chromatogram (for
example by the perpendicular drop method), might be inaccurate,
especially if the peak heights are very different. In fact in
this particular case, where the concentrations of the three
components are equal (0.05 μg/mL for each component), the peak
areas measured by perpendicular drop are about 2% from the true
values, mainly due to the slight asymmetry and unequal height of
the three peaks. The spectra (left-hand
figure) are even more highly overlapped than the chromatographic
peaks, but they are distinct in shape, and that is the key.
Basically, we treat this as a series of
3-component CLS calculations, one for each time slice of the
detector. The actual
calculations can be done in two ways, depending on whether the
spectra are processed one by one or are collected for the entire
chromatogram and then processed all at once, using either
"Alternative calculation #1", lines 113-146, or "Alternative
calculation #2", lines 150-170. The first method, shown on the
left, looks like chromatography as it executes; it computes the
chromatographic peaks of the three components point by point as
they evolve in time and plots them in the first three quadrants
of figure window 3 (on the right). The second method calculates
the entire chromatogram in one step at the end and makes the
same final plots. (The
second method is faster computationally, but usually
it's the chromatography that's the rate determining step, not
the calculations). Either way, the result is the same; the
chromatographic peaks of
the three components are completely separated
mathematically, so their areas are easily measured, no matter how much they overlap. Note that,
although the three spectra must be known, no knowledge of the
chromatography peaks is required; they emerge separate and
intact from the data, purely computationally.
In order to test the abilities and limitations of
this method, I have prepared a series of increasingly
challenging scenarios, starting with the one pictured above and
becoming progressively more difficult by making the
chromatographic peaks more closely spaced, making the peak more
asymmetrical, making the spectra more similar, and making the
concentrations unequal. These scenarios are listed in the table below,
along with the typical percent errors in peak area measurement
by the CLS method and links to the corresponding graphics and
Matlab/Octave m-files. Each is a more challenging variation on the first one; #2
has much more chromatographic peak overlap; #3 has much more asymmetrical peaks (higher tau);
#4 has much more similar spectra - in fact,
the peak wavelengths differ by only 0.1 nm, making them look
identical; in #5, component 2 (the middle peak) has a
concentration 100 times lower; and #6 is the same as #5 except
that the peaks are highly asymmetrical. In all of these cases,
the normal perpendicular drop area measurement technique is
either impossible (because there are no distinct peaks for each
component) or are very much in error, but the CLS techniques
works well, giving very low errors except when the middle peak
concentration is 0.0001, which approaches the random noise limit
of the detector. (Another variation, TimeResolvedCLSbaseline.m,
includes the correction for baseline shifts.)
Peak resolution |
Spectral similarity |
Peak asymmetry |
Concentration ratios |
% errors in area
measurement |
Links |
1. Normal |
Normal |
Slight: tau=10 |
.05 .05 .05 |
0.0022% 0.002%
0.0016% |
Graphic
m file |
2. Unresolved |
Normal | Slight: tau=10 | .01 .01 .01 |
-0.06%
-0.053% -0.041% |
Graphic
m file |
3. Partly resolved |
Normal | Great: tau=40 |
.05 .05 .05 |
-0.0004% -0.013%
-0.066% |
Graphic m file |
4. Unresolved | Almost complete |
Slight: tau=10 | .01 .01 .01 |
0.054%
0.049% 0.04% |
Graphic m file |
5. Unresolved | Almost complete | Slight: tau=10 |
.01 .0001 .01 |
0.026%
2.4% 0.019% |
Graphic m file |
6. Unresolved | Almost complete | Great: tau=40 | .01 .0001 .01 | -0.04%
-3.8% -0.03% |
Graphic m file |
Even when the peaks are resolved well enough for the perpendicular drop method to work, it can suffer from interaction between adjacent peak heights; that is, a change in the peak height of one peak can affect the measurement of the area of adjacent overlapped peaks, because of shifts in the valley point between them. This is illustrated by TimeResolvedCLScalibration.m, which simulates the measurement of 10 different three-component mixtures similar to the above (but modified so perpendicular drop measurement is possible), where the concentrations vary independently and randomly over a 1 x 10^{-4 }to 9.5 x 10^{-4} μg/mL range, and then plots measured peak area vs concentration for each component. (Each time you run this, you'll get a different mix of concentrations). Linear least-squares fits of peak area vs concentration are calculated, as shown below. In this typical example, the average absolute percentage error in area measurement for the perpendicular drop method is about 5%, with an R2 of 0.995, and for the CLS measurement is less than 1%, with an R2 of 0.9995. (Even if the detector noise (line 22) is set to zero in this simulation, the errors in the perpendicular drop method remain, because they are caused by overlap between adjacent peaks, rather than by noise).
Though clearly the CLS method is
very effective, all of this really only proves that
the mathematics works well; the method still has the
serious limitation that it requires that the spectra of all the
components be known accurately. This requirement can be met in
some applications, but in liquid chromatography there is a
potential pitfall. If gradient elution and/or temperature
programming are used, and if the spectra of those
chemical compounds are sensitive to the solvent and/or to
temperature, for example shifting their peaks slightly, then
there will likely be additional errors in the CLS procedure.
Obviously this depends on the particular chemical system and
will have to be evaluated on a case-by-case basis. But this
suggests another use for this method: speeding up a
chromatographic method that normally would achieve complete
baseline separation (from which accurate spectra of each
component could be obtained in situ at the right solvent
composition and temperature, and then adjusting the column
and/or flow rate to achieve a faster but incompletely resolved
chromatogram, to which the CLS method could be applied
accurately.
In other applications, some or all of the
components may simply be unknown and you may want to obtain
their spectra. This would be easy if the chromatography were
able to separate each component cleanly, but what if the peaks
are overlapped so that pure component spectra are never
achieved? In that case are more sophisticated method must be
used, such as the one described in the Shimadzu technical
report. This involves making initial estimates of
spectral and chromatographic peaks, followed by an iterative
search for the best fit to the experimental data, subject
to the imposition of some important constraints based on
physical reality, such as non-negativity of spectra and
chromatography peaks (the peaks are always positive, except for
random noise on the baseline), and the unimodality of
the chromatography peaks, (that is, each
component gives one and only one chromatography peak). Methods
of this type will be left to a future expansion of this essay.
[FitResults,GOF]=peakfit([x y],0,0,1,3) for a single Logistic model (shape 3)
[FitResults,GOF]=peakfit([x y],0,0,4,39,1) for a 4 exponentially broadened Gaussian model
(shape 39)
This page is part of "A
Pragmatic Introduction to Signal Processing", created
and maintained by Prof.
Tom O'Haver , Department of Chemistry and Biochemistry, The
University of Maryland at College Park. Comments, suggestions and
questions should be directed to Prof. O'Haver at toh@umd.edu.
Updated November, 2019.
Unique visits to this site since
June 22, 2008: