[Introduction]  [Signal arithmetic]  [Signals and noise]   [Smoothing]   [Differentiation]  [Peak Sharpening]  [Harmonic analysis]   [Fourier convolution]  [Fourier deconvolution]  [Fourier filter]  [Wavelets]   [Peak area measurement]  [Linear Least Squares]  [Multicomponent Spectroscopy]  [Iterative Curve Fitting]  [Hyperlinear quantitative absorption spectrophotometry] [Appendix and Case Studies]  [Peak Finding and Measurement]  [iPeak]   [iSignal]  [Peak Fitters]   [iFilter]  [iPower]  [List of downloadable software]  [Interactive tools]

### Appendix: Case Studies, Simulations, and Frequently Asked Questions.

[S. Measuring the Signal-to-Noise Ratio of Complex Signals
[T. Dealing with wide ranging signals]
[U. Measurement Calibration]
[V. Numerical precision limits of computer hardware and software]
[W. Low-cost computer hardware: The Raspberry Pi
[Y: Real-time signal processing]
[AA: Computer simulation of signals and instruments] (added December 2018)
[AB: Who uses this web site and associated software?] (added December 2018)
[AC. The Law of Large Numbers(added March 2019)

[AE. The Mystery Peak(added  November 2019)

### A. Can smoothed noise may be mistaken for an actual signal?

Here are two examples that show that the answer to this question is yes. The first example is shown on the left. This shows iSignal displaying a computer-generated 4000-point signal consisting of pure random noise, smoothed with a 19-point Gaussian smooth. The upper window shows a tiny slice of this signal that looks like a Gaussian peak with a calculated SNR over 1000. Only by looking at the entire signal (bottom window) do you see the true picture; that "peak" is just part of the noise, smoothed to look nice. Don't fool yourself.

The second example is a simple series of three Matlab commands that uses the 'randn' function to generate a 10000-point data set containing only normally-distributed white noise. Then it uses 'fastmooth' to smooth that noise, resulting in a 'signal' with a standard deviation of about 0.3 and a maximum value around 1.0. That signal is then submitted to iPeak. If the peak detection criteria (e.g. AmpThreshold and SmoothWidth) are set too low, many peaks will be found. But setting the AmpThreshold to 3 times the standard deviation (3 x 0.3 = 0.9) will greatly reduce the incidence of these false peaks.

>> noise=randn(1,10000);
>> signal=fastsmooth(noise,13);
>> ipeak([1:10000;signal],0,0.6,1e-006,17,17)

The peak identification function, which identifies peaks based on their exact x-axis peak position and a stored table of identified peak positions, is even less likely to be fooled by random noise, because in addition to the peak detection criteria of the findpeaks algorithm, any detected peak must also match closely to a peak position in the table of known peaks.

### B. Signal or Noise?

This experimental signal in this case study was unusual because it did not look like a typical signal when plotted; in fact, it looked a lot like noise at first glance. The figure below compares the raw signal (bottom) with the same number of points of normally-distributed white noise (top) with a mean of zero and a standard deviation of 1.0 (obtained from the Matlab/Octave 'randn' function).

As you can see, the main difference is that the signal has more large 'spikes', especially in the positive direction.  This difference is evident when you look at the descriptive statistics of the signal and the randn function:

 DESCRIPTIVE STATISTICS Raw signal random noise (randn function) Mean 0.4 0 Maximum 38 about 5 - 6 Standard Deviation (STD) 1.05 1.0 Inter-Quartile Range (IQR) 1.04 1.3489 Kurtosis 38 3 Skewness 1.64 0

You can see that the standard deviations of these two are nearly the same, but the other statistics (especially the kurtosis and skewness) indicate that the probability distribution of the signal is far from normal; there are far more positive spikes in the signal than expected for pure noise. Most of these turned out to be the peaks of interest for this signal; they look like spikes only because the length of the signal (over 1,000,000 points) causes the peaks to be compressed into one screen pixel or less when the entire signal is plotted on the screen. In the figures on the left, iSignal is used to "zoom in" on some of the larger of these peaks (using the cursor arrow keys). The peaks are very sparsely separated (by an average of 1000 half-widths between peaks) and are well above the level of background noise (which has a standard deviation of roughly 0.9 throughout the signal).

The researcher who obtained this signal said that a 'good' peak was 'bell shaped', with an amplitude above 5 and a width of 500-1000 x-axis units. So that means that we can expect the signal-to-background-noise ratio to be at least 5/0.9 = 5.5. You can see in the three example peaks on the left that the peak widths do indeed meet those expectations. The interval between adjacent x-axis points is 25, so that means the we can expect the peaks to have about 20 to 40 points in their widths. Based on that, we can expect that the positions, heights and widths of the peaks should be able to be be measured fairly accurately using least-squares methods (which reduce the uncertainty of measured parameters by about the square root of the number of points used - about a factor of 5 in this case). However, the noise appears to be signal-dependent; the noise on the top of the peaks is distinctly greater than the noise on the baseline. The result is that the actual signal-to-noise (S/N) ratio of peak parameter measurement for the larger peaks will not be as good as might be expected based on the ratio of the peak height to the noise on the background.  Most likely, the total noise in this signal is the sum of two major components, one with a fixed standard deviation of 0.9 and the other roughly equal to 10% of the peak height.

To automate the detection of large numbers of peaks, we can use the findpeaksG or iPeak functions. Reasonable values of the input arguments
AmplitudeThreshold, SlopeThreshold, SmoothWidth, and FitWidth for those functions can be estimated based on the expected peak height (5) and width (20 to 40 data points) of the "good" peaks. For example, using AmplitudeThreshold=5, SlopeThreshold=.001,  SmoothWidth=25, and FitWidth=25, these function detect and measure 76 peaks above an amplitude of 5 and with an average peak width of 523. The interactive iPeak function is especially convenient for exploring the effect of these peak detection parameters and for graphically inspecting the peaks that it finds. Ideally the objective is to find a set of peak detection arguments that detect and accurately measure all the peaks that you would consider 'good' and skip all the 'bad' ones. But in reality the criteria for good and bad peaks is at least partly subjective, so it's usually best to err on the side of caution and avoid skipping 'good' peaks at the risk of including a few 'bad' peaks in the mix, which can be weeded out manually based on unusual position, height, width, or appearance.

Of course, it must be expected that the values of the peak position, height, and width given by the findpeaksG or iPeak functions will only be approximate and will vary depending on the exact setting of the peak detection arguments; the noisier the data, the greater the uncertainty in the peak parameters. In this regard the peak-fitting functions peakfit.m and ipf.m usually give more accurate results, because they make use of all the data across the peak, not just the top of the peak as do findpeaksG and iPeak. For example, compare the results of the peak near x=3035200 measured with iPeak (click to view) and with peakfit (click to view). Also, the peak fitting functions are better for dealing with overlapping peaks and for estimating the uncertainty of the measured peak parameters, using the bootstrap options of those functions. For example, the largest peak in this signal has an x-axis position of 2.8683e+007, height of 32, and width of 500; the bootstrap method determines that the standard deviations are 4, 0.92,  and 9.3, respectively.

Because the signal in the case study was so large (over 1,000,000 points), the interactive programs such as iPeak, iSignal, and ipf may be sluggish in operation, especially if your computer is not fast computationally or graphically. If this is a serious problem, it may be best to break the signal up into two or more segments and deal with each segment separately, then combine the results. Alternatively, you can use the condense function to average the entire signal into a smaller number of points by a factor of 2 or 3 (at the risk of slightly reducing peak heights and increasing peak widths), but then you should reduce
SmoothWidth and FitWidth by the same factor to compensate for the reduced number of data points across the peaks.  Run testcondense.m for a demonstration of the condense function.

C. Buried treasure.

The experimental signal in this case study had a number of narrow spikes above a seemingly flat baseline.

Using iSignal to investigate the signal, it was found that the visible positive spikes were single points of very large amplitude (up to 106), whereas the regions between the spikes were not really flat but contained bell-shaped peaks that were so much smaller (below 103) that they were not even visible on this scale. For example, using iSignal to zoom in to the region around x=26300, you can see one of those bell-shapes peaks with a small single-point negative-going spike artifact near its peak.

Very narrow spikes like this are common artifacts in some experimental signals; they are easy to eliminate by using a median filter. The iSignal function has such a filter, activated by the "M" key. The result shows that the single-point spike artifacts have been eliminated, with little effect on the character of the bell-shaped peak.

Other filter types, like most forms of smoothing, would be far less effective than a median filter for this type of artifact and would distort the peaks.

The negative spikes in this signal turned out to be steep steps, which can either be reduced by using iSignal's slew-rate limit function (the ` key) or manually eliminated by using the semicolon key (;) to set the selected region between the dotted red cursor lines to zero. Using the latter approach, the entire cleaned-up signal is shown below. The remaining peaks are all positive, bell shaped and have amplitudes from about 6 to about 750.

iPeak can automate the measurement of peak positions and heights for the entire signal, using the peak detection settings shown at the bottom of the screen shot below.

If required, individual peaks can be measured more accurately by fitting the whole peak with iPeak's "N" key or with peakfit.m or ipf.m. The peaks are all slightly asymmetrical; they fit an exponentially-broadened Gaussian model to a fitting error less than about 0.5%, as shown on the left. The smooth residual plots suggests that the signal was smoothed before the spikes were introduced and that the noise increases with the signal amplitude (because these is little or noise on the base-line). .

Note that fitting with an exponentially-broadened Gaussian model gives the peak parameters of the Gaussian before broadening. iSignal and iPeak estimate the peak parameters of the broadened peak. As before, the effect of the broadening is to shift the peak position to larger values, reduce the peak height, and increase the peak width.

Position  Height  Width   Area  error  S/N Ratio
isignal     16871   788.88  32.881  27612         172
ipeak       16871   785.34  33.525  28029
peakfit(G)  16871   777.9   33.488  27729  1.68%
peakfit(E)  16863   973.72  27.312  28308  0.47%

G = Gaussian model; E = Exponentially-broadened Gaussian model

### D. The Battle Rounds: a comparison of methods.

This case study demonstrates the application of several techniques described in this paper to the quantitative measurement of a peak that is buried in an unstable background, a situation that can occur in the quantitative analysis applications of various forms of spectroscopy, process monitoring, and remote sensing. The objective is to derive a measure of peak amplitude that varies linearly with the actual peak amplitude but that is not effected by the changes in the background and the random noise. In this example, the peak to be measured is located at a fixed location in the center of the recorded signal, at x=100 and has a fixed shape (Gaussian) and width (30). The background, on the other hand, is highly variable, both in amplitude and in shape. The simulation shows six superimposed recordings of the signal with six different peak amplitudes and with randomly varying background amplitudes and shapes (top row left in the following figures). The methods that are compared here include smoothing, differentiation, classical least squares multicomponent method (CLS), and iterative non-linear curve fitting.

CaseStudyC.m is a self-contained Matlab/Octave demo function that demonstrates this case. To run it, download it. place it in the path, and type "CaseStudyC" at the command prompt. Each time you run it, you'll get the same series of true peak amplitudes (set by the" vector SignalAmplitudes", in line 12) but a different set of background shapes and amplitudes. The background is modeled as a Gaussian peak of randomly varying amplitude, position, and width; you can control the average amplitude of the background by changing the variable "BackgroundAmplitude" and the average change in the background by changing the variable "BackgroundChange".

The five methods compared in the figures below are:

1: Top row center. A simple zero-to-peak measurement of the smoothed signal, which assumes that the background is zero.

2: Top row right. The difference between the peak signal and the average background on both sides of the peak (both smoothed), which assumes that the background is flat.

3: Bottom row left. A derivative-based method, which assumes that the background is very broad compared to the measured peak.

4: Bottom row center. Classical least squares (CLS), which assumes that the background is a peak of known shape, width, and position (the only unknown being the height).

5: Bottom row right. iterative non-linear curve fitting (INLS), which which assumes that the background is a peak of known shape but unknown width and position. This method can track changes in the background peak position and width (within limits), as long as the measured peak and the background shapes are independent of the concentration of the unknown.

These five methods are listed roughly in the order of increasing mathematical and geometrical complexity. They are compared below by plotting the actual peak heights (set by the vector "SignalAmplitudes") vs the measure derived from that method, fitting the data to a straight line, and computing the coefficient of determination, R2., which ideally is 1.0000.

For the first test (shown in the figure above), both "BackgroundAmplitude" and "BackgroundChange" are set to zero, so that only the random noise is present. In that case all the methods work well, with R2.values all very close to 0.9999. With a 10x higher noise level (click to view), all methods still work about equally well, but with a lower coefficient of determination R2, as might be expected.

For the second test (shown in the figure immediately above), "BackgroundAmplitude"=1 and "BackgroundChange"=0, so the background has significant amplitude variation but a fixed shape, position, and width. In that case, the first two methods fail, but the derivative, CLS, and INLS methods work well.

For the third test, shown in the figure above, "BackgroundAmplitude"=1 and "BackgroundChange"=100, so the background varies in position, width, and amplitude (but remains broad compared to the signal). In that case, the CLS methods fails as well, because it assumes that the background varies only in amplitude. However, if we go one step further (click to view) and set "BackgroundChange"=1000, the background shape is now so unstable that even the INLS method fails, but still the derivative method remains effective as long as the background is broader than the measured peak, no matter what its shape. On the other hand, if the width and position of the measured peak changes from sample to sample, the derivative method will fail and the INLS method is more effective (click to view), as long as the fundamental shape of both measured peak and the background are both known (e.g. Gaussian, Lorentzian ,etc).

Not surprisingly, the more mathematically complex methods perform better, on average. Fortunately, software can "hide" that complexity, in the same way, for example, that a hand-held calculator hides the complexity of long division.

### Appendix E: Ensemble averaging patterns in a continuous signal.

Ensemble averaging is a powerful method of reducing the effect of random noise in experimental signals, when it can be applied. The idea is that the signal is repeated, preferably a large number of times, and all the repeats are averaged. The signal builds up, and the noise gradually averages towards zero, as the number of repeats increases.

An important requirement is that the repeats be aligned or synchronized, so that in the absence of random noise, the repeated signals would line up exactly. There are two ways of managing this:

(a) the signal repeats are triggered by some external event and the data acquisition can use that trigger to synchronize the signals, or
(b) the signal itself has some feature that can be used to detect each repeat, whenever it occurs.

The first method (a) has the advantage that the signal-to-noise (S/N) ratio can be arbitrarily low and the average signal will still gradually emerge from the noise if the number of repeats is large enough. However, not every experiment has a reliable external trigger.

The second method (b) can be used to average repeated patterns in one continuous signal without an external trigger that corresponds to each repeat, but the signal must then contain some feature (for example, a peak) with a signal-to-noise ratio large enough to detect reliably in each repeat. This method can be used even when the signal patters occur at random intervals, when the timing of the repetitions is not of interest. The interactive peak detector iPeak 6 has a built-in ensemble averaging function (Shift-E) can compute the average of all the repeating waveforms. It works by detecting a single peak in each repeat in order to synchronize the repeats.

The Matlab script iPeakEnsembleAverageDemo.m (on http://tinyurl.com/cey8rwh) demonstrates this idea, with a signal that contains a repeated underlying pattern of two overlapping Gaussian peaks, 12 points apart, with a 2:1 height ratio, both of width 12. These patterns occur a random intervals, and the noise level is about 10% of the average peak height. Using iPeak (above left), you adjust the peak detection controls to detect only one peak in each repeat pattern, zoom in to isolate any one of those repeat patterns, and press Shift-E. In this case there are about 60 repeats, so the expected signal-to-noise (S/N) ratio improvement is sqrt(60) = 7.7. You can save the averaged pattern (above right) into the Matlab workspace as "EA" by typing

then curve-fit this averaged pattern to a 2-Gaussian model using the peakfit.m function (figure on the right):

>> peakfit([1:length(EA);EA],40,60,2,1,0,10)

Position     Height     Width    Area
32.54       13.255     12.003   169.36
44.72        6.7916    12.677    91.648

You will see a big improvement in the accuracy of the peak separation, height ratio and width, compared to fitting a single pattern in the original x,y signal:

>> peakfit([x;y],16352,60,2,1,0,10)

### Appendix F: Harmonic Analysis of the Doppler Effect

The wav file "horngoby.wav" (Ctrl-click to open) is a 2-second recording of the sound of a passing automobile horn, exhibiting the familiar Doppler effect. The sampling rate is 22000 Hz. Download this file and place it in your Matlab path. You can then load this into the Matlab workspace as the variable "doppler" and display it using iSignal:

t=0:1/21920:2;
isignal(t,doppler);

Within iSignal, you can switching to frequency spectrum mode (press Shift-S), and zoom in on different portions of the waveform, so you can observe the downward frequency shift and measure it quantitatively. (Actually, it's much easier to hear the frequency shift - press Shift-P to play the sound - than to see it graphically; the shift is rather small on a percentage basis, but human hearing is very sensitive to small pitch (frequency) changes). It helps to re-plot the data to stretch out the frequency region around the fundamental frequency or one of the harmonics. I used iSignal to zoom in on three slices of this waveform and then I plotted the frequency spectrum (Shift-S) near the beginning (plotted in blue), middle (green), and end (red) of the sound. The frequency region between 150 Hz and 550 Hz are plotted in the figure below:

The group of peaks near 200 are the fundamental frequency of the lowest note of the horn and the group of peaks near 400 are its second harmonic. (Pitched sounds usually have a harmonic structure of 1, 2, 3... times a fundamental frequency). The group of peaks near 250 are the fundamental frequency of the next higher note of the horn and the group of peaks near 500 are its second harmonic. (Car and train horns often have two or three harmonious notes sounded together). In each of these groups of harmonics, you can clearly see that the blue peak (the spectrum measured at the beginning of the sound) has a higher frequency than the red peak (the spectrum measured at the end of the sound). The green peak, taken in the middle, has an intermediate frequency. The peaks are ragged because the amplitude and frequency varies over the sampling interval, but you can still get good quantitative measures of the frequency of each component by curve fitting to a Gaussian peak model using peakfit.m or ipf.m:

Peak        Position    Height     Width    Area
Beginning   206.69    3.0191e+005 0.81866   2.4636e+005
Middle      202.65    1.5481e+005 2.911     4.797e+005
End         197.42    81906       1.3785    1.1994e+005

The estimated precision of the peak position (i.e. frequency) measurements is about 0.2% relative, based on the bootstrap method, good enough to allow accurate calculation of the frequency shift (about 4.2%) and the speed of the vehicle and to demonstrate that the ratio of the second harmonic to the fundamental for these data is 2.0023, which is very close to the theoretical value of 2.

It's also possible to plot the evolving spectrum of such a signal as a contour graph, using the PlotSegFreqSpect.m function, as show in the graphic below. You can see the various frequency components drifting steadily down in frequency as the time passes. See HarmonicAnalysis.html#Software.

PSM=PlotSegFreqSpect(t,doppler,40,350,0); % 40 segments, 240 harmonics

### Appendix G: Measuring spikes

Spikes, narrow pulses with a width of only one or a few points, are sometimes encountered in signals as a result of an electronic "glitch" or stray pickup from nearby equipment, and they can easily be eliminated by the use of a "median" filter. But it is possible that in some experiments the spikes themselves might be the important part of the signal and that it is required to count or measure them. This situation was recently encountered in a research application, and it opens up some interesting twists on the usual procedures.

As a demonstration, the Matlab/Octave script SpikeDemo1.m creates a waveform (top panel of figure below) in which a series of spikes are randomly distributed in time, contaminated by two types of noise: white noise and a large-amplitude oscillatory interference simulated by a swept-frequency sine wave. The objective is to count the spikes and locate their position on the x (time) axis. Direct application of findpeaks or iPeak to the raw signal does not work well.

A single-point spike, called a delta function in mathematics, has a power spectrum that is flat; that is, it has equal power at all frequencies, just like white noise. But the oscillatory interference in this case is located in a specific range of frequencies, which opens some interesting possibilities. One approach would be to use a Fourier filter, for example a notch or band-reject filter, to remove the troublesome oscillations selectively. But if the objective of the measurement is only to count the spikes and measure their times, a simpler approach would be to (1) compute the second derivative (which greatly amplifies the spikes relative to the oscillations), (2) smooth the result (to limit the white noise amplification caused by differentiation), then (3) take the absolute value (to yield positive-pointing peaks). This can be done in a single line of nested Matlab/Octave code:

y1=abs(fastsmooth((deriv2(y)).^2,3,2));

The result, shown the lower panel of the figure on the left above, is an almost complete extraction of the spikes, which can then be counted with findpeaksG.m or peakstats.m or iPeak.m:

P=ipeak([x;y1],0,0.1,2e-005,1,3,3,0.2,0);

The second example, SpikeDemo2.m, is similar except that in this case the oscillatory interference is caused by two fixed-frequency sine waves at a higher frequency, which completely obscure the spikes in the raw signal (top panel of the left figure below). In the power spectrum (bottom panel, in red), the oscillatory interference shows as two sharp peaks that dominate the spectrum and reach to y=106, whereas the spikes show as the much lower broad flat plateau at about y=10. In this case, use can be made of an interesting property of sliding-average smooths, such as the boxcar, triangular, and Gaussian smooths; their frequency responses exhibit a series of deep cusps at frequencies that are inversely proportional to their filter widths. So this opens up the possibility of suppressing specific frequencies of oscillatory interference by adjusting the filter widths until the cusps occur at or near the frequency of the oscillations. Since the signal in this cases are spikes that have a flat power spectrum, they are simply smoothed by this operation, which will reduce their heights and increase their widths, but will have little or no effect on their number or x-axis positions. In this case a 9 point pseudo-Gaussian smooth puts the first (lowest frequency) cusp right in between the two oscillatory frequencies.

In the figure on the right, you can see the effect of applying this filter; the spikes, which were not even visible in the original signal, are now cleanly extracted (upper panel), and you can see in the power spectrum (right lower panel, in red) that the two sharp peaks of oscillatory interference is reduced by about a factor of about 1,000,000! This operation can be performed by a single command-line function, adjusting the smooth width (the second input argument, here a 9) by trial and error to minimize the oscillatory interference:

y1=fastsmooth(y,9,3);

(If the interference varies substantially in frequency across the signal, you could use a segmented smooth rather than the standard fastsmooth).  Alternatively, the segmented Fourier spectrum could be used to visualize this signal, and a Fourier filter in "notch" mode could be employed to specifically eliminate the interfering frequencies.

The extracted peaks can then be counted with any of the peak finding functions, such as:

P=findpeaksG(x,y1,2e-005,0.01,2,5,3);

or

P=findpeaksplot(x,y1,2e-005,0.01,2,5,3);

or

PS=peakstats(x,y1,2e-005,0.01,2,5,3,1);

The simple script "iSignalDeltaTest" demonstrates the power spectrum of the smoothing and differentiation functions of iSignal by a applying them to a delta function. Change the smooth type, smooth width, and derivative order and other functions in order to see how the power spectrum changes.

### Appendix H: Fourier deconvolution vs curve fitting (they are not the same).

Some experiments produce peaks that are distorted by being convoluted by processes that make peaks less distinct and modify peak position, height and width. Exponential broadening is one of the most common of these processes. Fourier deconvolution and iterative curve fitting are two methods that can help to measure the true underlying peak parameters.  Those two methods are  conceptually distinct, because in Fourier deconvolution, the underlying peak shape is unknown but the nature and width of the broadening function (e.g. exponential) is assumed to be known; whereas in iterative least-squares curve fitting it's just the reverse: the underlying peak shape (i.e. Gaussian, Lorentzian, etc) must be known, but the width of the broadening process is initially unknown.

In the script shown below and the resulting graphic shown on the right (Download this script), the underlying signal (uyy) is a set of four Gaussians with peak heights of 1.2, 1.1, 1, 0.9 located at x=10, 20, 30, 40 and peak widths of 3, 4, 5, 6, but in the observed signal (yy) these peaks are broadened exponentially by the exponential function cc, resulting in shifted, shorter, and wider peaks, and then a little constant white noise is added after the broadening. The deconvolution of cc from yy successfully removes the broadening (yydc), but at the expense of a substantial noise increase. However, the extra noise in the deconvoluted signal is high-frequency weighted ("blue") and so is easily reduced by smoothing and has less effect on least-square fits than does white noise. (For a greater challenge, try adding more noise in line 6 or use a bad estimate of time constant in line 10). To plot the recovered signal overlaid with underlying signal: plot(xx,uyy,xx,yydc). To plot the observed signal overlaid with the underlying signal: plot(xx,uyy,xx,yy). Excellent values for the original underlying peak positions, heights, and widths can be obtained by curve-fitting the recovered signal to four Gaussians: [FitResults,FitError]= peakfit([xx;yydc],26,42,4,1,0,10). With ten times the previous noise level (Noise=.01), the values of peak parameters determined by curve fitting are still quite good, and even with 100x more noise (Noise=.1) the peak parameters are more accurate than you might expect for that amount of noise (because that noise is blue). Visually, the noise is so great that the situation looks hopeless, but the curve-fitting actually works pretty well.

xx=5:.1:65;

% Underlying Gaussian peaks with unknown heights, positions, and widths.
uyy=modelpeaks2(xx,[1 1 1 1],[1.2 1.1 1 .9],[10 20 30 40],[3 4 5 6],...

[0 0 0 0]);

Noise=.001; % <---Try more noise to see how this method handles it.
yy=modelpeaks2(xx,[5 5 5 5],[1.2 1.1 1 .9],[10 20 30 40],[3 4 5 6],...

[-40 -40 -40 -40])+Noise.*randn(size(xx));

% Compute transfer function, cc,
cc=exp(-(1:length(yy))./40); % <---Change exponential time constant here

% Attempt to recover original signal uyy by deconvoluting cc from yy

% It's necessary to zero-pad the observed signal yy as shown here.
yydc=deconv([yy zeros(1,length(yy)-1)],cc).*sum(cc);

% Plot and label everything
subplot(2,2,1);plot(xx,uyy);title('Underlying signal, uyy');
subplot(2,2,2);plot(xx,cc);title('Exponential transfer function, cc')
subplot(2,2,4);plot(xx,yydc);title('Recovered signal, yydc')

An alternative to the above deconvolution approach is to curve fit the observed signal directly with an exponentially broadened Gaussian (shape number 5): [FitResults,FitError]=peakfit([xx;yy],26,50,4,5,40,10). Both methods give good values of the peak parameters, but the deconvolution method is considerably faster, because curve fitting with a simple Gaussian model is faster than fitting with a exponentially broadened peak model, especially if the number of peaks is large. Also, if the exponential factor is not known, it can be determined by curve fitting one or two of the peaks in the observed signal, using ipf.m, adjusting the exponential factor interactively to get the best fit. Note that you have to give peakfit a reasonably good value for the time constant ('extra'), the input argument right after the peakshape number.  If the value is too far off, the fit may fail completely, returning all zeros. A little trial and error suffice. Alternatively, you could try to use peakfit.m version 7 with the independently variable time constant exponentially-broadened Gaussian shape number 31 or 39, to measure the time constant as an iterated variable (to understand the difference, see example 39). If the time constant is expected to be the same for all peaks, better results will be obtained by using shape number 31 or 39 initially to measure the time constant of an isolated peak (preferably one with a good S/N ratio), then apply that fixed time constant in peak shape 5 to all the other groups of overlapping peaks.

### Appendix I: Digitization noise - can adding noise really help?

Digitization noise, also called quantization noise, is an artifact caused by the rounding or truncation of numbers to a fixed number of figures. It can originate in the analog-to-digital converter that converts an analog signal to a digital one, or in the circuitry or software involved in transmitting the digital signal to a computer, or even in the process of transferring the data from one program to another, as in copying and pasting data to and from a spreadsheet.  The result is a series of non-random steps of equal height. The frequency distribution is white, because of the sharpness of the steps, as you can see by observing the power spectrum.

The figure on the left, top panel, shows the effect of integer digitization on a sine wave with an amplitude of +/- 10. Ensemble averaging, which is usually the most effective of noise reduction techniques, does not reduce this type of noise (bottom panel) because it is non-random.

Interestingly, if additional random noise is present in the signal, then ensemble averaging becomes effective in reducing

An
audible example of this idea is illustrated by the Matlab/Octave script DigitizedSpeech.m, which starts with an audio recording of the spoken phrase "Testing, one, two, three", previously recorded at 44000 Hz and saved in WAV format (TestingOneTwoThree.wav) and in .mat format (testing123.mat), rounds off the amplitude data progressively to 8 bits (256 steps; sound link), shown on the left, 4 bits (16 steps; sound link), and 1 bit (2 steps; sound link), and then the 2-step case again with random white noise added before the rounding (2 steps + noise; sound link), plots the waveforms and plays the resulting sounds, demonstrating both the degrading effect of rounding and the remarkable improvement caused by adding noise. (Click on these sound links to hear the sounds on your computer). Although the computer program in this case does not perform an explicit ensemble averaging operation as does RoundingError.m, it's likely that the neurons of the hearing center of your brain provide that function by virtue of their response time and memory effect.

### Appendix J: How Low can you Go? Performance with very low signal-to-noise ratios.

This is a simulation of several techniques described in this paper to the quantitative measurement of a peak that is buried in excess of random noise, where the signal-to-noise (S/N) ratio is below 2. (Ordinarily, a S/N ratio of 3 is desired for reliable detection).

The Matlab/Octave script LowSNRdemo.m performs the simulations and calculations and compares the results graphically, focusing on the behavior of each method as the S/N ratio approaches zero.  Four methods are compared:

(1) smoothing, followed by the peak-to-peak measure of the smoothed signal and background;
(2) a peak finding method based on the  findpeakG function;
(3) unconstrained iterative least-squares fitting (INLS) based on the peakfit.m function; and
(4) constrained classical least squares fitting (CLS) based on the cls2.m function.

The measurements are carried out over a range of peak heights for which the S/N ratio varies from 0 to 2. The noise is random, constant, and white. Each time you run the script, you get the same set of underlying signals but independent samples of the random noise.

Results for the initial values in the script are shown in the plots on the left and in the table printed below, both of which are created by the script  LowSNRdemo.m. The graphs on the left show correlation plots of the measured peak height vs the real peak height, which should ideally be a straight line with a slope of 1, an intercept of zero, and an R-squared of 1.  As you can see, the simplest smoothed-peak method (upper left) is completely inadequate, with a low slope (because smoothing reduces peak height) and a high intercept (because even smoothed noise has a non-zero peak-to-peak value). The findpeaks function (upper right) works OK for height for higher peak heights but fails completely below a S/N ratio of 0.5 because the peak height falls below the amplitude threshold setting. In comparison, the two least-squares techniques work much better, reporting much better values of slope, intercept of zero, and R-squared. But if you look closely at the low end of the peak height range, near zero, you can see that the values reported by the unconstrained fit (lower left) occasionally stray from the line, whereas the constrained fit (lower right) decrease gracefully all the way to zero every time you run the script. Essentially the reason why it's even possible to make measurements at such low S/N ratios is that the data density is very high: that is, there are many data points in each signal (about 1000 points across the half-width of the peak with the initial script values).

The results are summarized in the table below. The height errors are reported as a percentage of the maximum height (initially 2). (For the first three methods, the peak position is also measured and its relative accuracy is reported. The constrained classical least squares fitting does not measure peak position but rather assumes that it remains fixed at the initial value of 100). You can see that the CLS method has a slight edge in accuracy, but you have to consider also that this method works well only if the peak shape, position, and width are known. The unconstrained iterative method can track changes in peak position and width.

Number of points in half-width of peak: 1000
Method         Height Error     Position Error
Smoothed peak    21.2359%        120.688%
findpeaksG.m     32.3709%        33.363%
peakfit.m        2.7542%         4.6466%
cls2.m           1.6565%

You can change several of the factors in this simulation to test the robustness of these methods. Search for the word 'change' in the comments for values that can be changed. Reduce MaxPeakHeight (line 8) to make the problem harder. Change peak position and./or width (lines 9 and 10) to show how the CLS method fails. As usual, the more you know, the better your results. Change the increment (line 4) to change the data density; more data is always better.

(Surprisingly, as it will be shown in Appendix U: Measurement Calibration, it is not even necessary to have an accurate peak shape model in order to get a good correlation between measured and actual height).

LowSNRdemo.m also computes the power spectrum of the signal and the amplitude (square root of the power) of the fundamental, where most of the power of a broad Gaussian peak falls, and plots it in Figure(2). The correlation to peak height is similar to the CLS method, but
the intercept is higher because there is a non-zero quantity of noise even in that one frequency slice of the power spectrum.

We are now, in the 21st century, into the era of "big data", where high-speed automated data systems can acquire, store, and process greater quantities of data than ever before. As this little example shows, greater quantities of data allow researchers to probe deeper and measure smaller effects that previously.

### Appendix K: Signal processing and the search for extraterrestrial intelligence.

The signal detection problems facing those who search the sky for evidence of extraterrestrial civilizations or interesting natural phenomena are enormous. Among those problems are the fact that we don't know much about what to expect. In particular, we don't know exactly where to look in the sky, or what frequencies might be used, or the possible forms of the transmissions. Moreover, the many powerful sources of natural and terrestrial sources of interfering signals must not be confused for extraterrestrial ones. There is also the massive computer power required, which has driven the development of specialized hardware and software as well as distributed computation over thousands of Internet-connected personal computers across the world using the SETI@home computational screen-saver. Although many of the computational techniques used in this search are far more sophisticated than those covered in this web site, they begin with the basic concepts covered here.

One of the reoccurring themes of this site has been that the more you know about your data, the more likely you are to obtain a reliable measurement. In the case of possible extraterrestrial signals, we don't know much, but we do know a few things.

We know that electromagnetic radiation over a wide range of frequencies is used for long-distance transmission on earth and between earth and satellites and probes far from earth. Astronomers already use radio telescopes to receive natural radiations from vast distances. In order to look at different frequencies at once,
of the raw telescope signals can be computed over multiple time segments. We previously saw a simulation that showed how hard it is to see a periodic component in the presence of on equal amount of random noise and yet how easy it is to pick it out in the frequency spectrum.

Also, t
ransmissions from extraterrestrial civilizations might be in the form of  equally-spaced pulses, so their detection and verification is also part of SETI signal processing. Interestingly, triplets and other groups of equally spaced pulses appear in the Fourier transforms of high frequency carrier waves that are amplitude or frequency modulated (like AM or FM radio). Of course, there is no reason to assume, nor to reject, that extraterrestrial civilizations might use the same methods of communication as ourselves.

One thing that we know for sure is that the earth rotates around its axis once a day and that it revolves around the sun once a year. So if we look at a fixed direction out from the earth, the distant stars will seem to move in a predictable pattern, whereas terrestrial sources will remain fixed on earth. The huge Arecibo Observatory dish in Puerto Rico is fixed in position and is often used to look in one selected direction for extended periods of time. The field of view of this telescope is such that a point source at a distance takes 12 seconds to pass. As SETI says: "Radio signals from a distant transmitter should get stronger and then weaker as the telescope's focal point moves across that area of the sky. Specifically, the power should increase and then decrease with a bell shaped curve (a Gaussian curve). Gaussian curve-fitting is an excellent test to determine if a radio wave was generated 'out there' rather than a simple source of interference somewhere here on Earth, since signals originating from Earth will typically show constant power patterns rather than curves". Also, any observed 12 second peaks can be re-examined with another focal point shifted towards the west to see if it repeats with the expected time and duration.

We also know that there will be a in the frequencies observed if the source is moving relative to the receiver; this is observed with sound waves as well as with electromagnetic waves like radio or light. Because the earth is rotating and revolving at a known and constant speed, we can accurately predict and compensate for the Doppler shift caused by earth's motion (this is called "de-chirping" the data).

For more on the details of SETI signal processing, see SETI@home.

### AppendixL: Why measure peak area rather than peak height?

This appendix examines more closely the question of measuring peak area rather than peak height to reduce the effect of peak broadening, which commonly occurs in chromatography, for reasons that are discussed previously, and also in some forms of spectroscopy. Under what conditions the measurement of peak area might be better than peak height?

The Matlab/Octave script "HeightVsArea.m" simulates the measurement of a series of standard samples whose concentrations are given by the vector 'standards'. Each standard produces an isolated peak whose peak height is directly proportional to the corresponding value in 'standards' and whose underlying shape is a Gaussian with a constant peak position ('pos') and width ('wid'). To simulate the measurement of these samples under typical conditions, the script changes the shape of the peaks (by exponential broadening) and adds a variable baseline and random noise. You can control, by means of the variable definitions in the first few lines of the script, the peak beginning and end, the sampling rate 'deltaX' (increment between x values), the peak position and width ('pos' and 'wid'), the sequence of peak heights ('standards'), the baseline amplitude ('baseline') and its degree of variability ('vba'), the extent of shape change ('vbr'), and the amount of random noise added to the final signal ('noise').

The resulting peaks are shown in Figure 1, above. The script prepares a series of "calibration curves" plotting the values of 'standard' against the measured peak heights or areas for each measurement method. The measurement methods include peak height in Figure 2, peak area in Figure 3, and curve fitting height and area in Figures 4 and 5, respectively. These plots should ideally have an intercept of zero and an R2 of 1.000, but the slope is greater for the peak area measurements because area has different units and is numerically greater than peak height. All the measurement methods are baseline corrected; that is, they include code that attempts to compensate for changes in the baseline (controlled by the variable 'baseline').

With the initial values of 'baseline', 'noise', 'vba', and 'vbr', you can clearly see the advantage of peak area measurements (figure 3) compared to peak height (figure 2). This is primarily due to the effect of the variability of peak shape broadening ('vbr') and to the averaging out of random noise in the computation of area.

Figure 2                                                                                  Figure 3

If you set 'baseline', 'noise', 'vba', and 'vbr' all to zero, you've simulated a perfect world in which all methods work perfectly.

Curve fitting can measure both peak height and area; it is not even absolutely necessary to use an accurate peak shape model. Using a simple Gaussian model in this example works much better for peak area (Figure 5) than for peak height (Figure 4) but is not significantly better than a simple peak area measurement (Figure 3). The best results are obtained if an exponentially-broadened Gaussian model (shape 31 or  39) is used, using the code in line 30, but that computation takes longer. Moreover, if the measured peak overlaps another peak significantly, curve fitting both of those peaks together can give much more accurate results that other peak area measurement methods.

### Appendix N: Using macros to extend the capability of spreadsheets

Both Microsoft Excel and OpenOffice Calc have the ability to automate repetitive tasks using "macros", saved sequences of commands or keystrokes that are stored for later use. Macros can be most easily created using the built-in "Macro Recorder", which will literally watch all your clicks, drags, and keystrokes and record them for later playback. Or you can write or edit your macros in the macro language of that spreadsheet (VBA in Excel; Python or JavaScript in Calc). Or you can do both: use the macro recorder first, then edit the resulting code manually to modify it.

To enable macros in Excel, click on File, Options, click Customize Ribbon Tab and check 'Developer' and click 'OK'. To access the macro recorder, click Developer, Record Macro, give the macro a name, click Options, assign a Ctrl-key shortcut, and click OK. Then perform your spreadsheet operations, and when finished, click Stop Recording and save the spreadsheet. Thereafter, simply pressing your Ctrl-key shortcut will run the macro and perform all the spreadsheet operations that you recorded.

Here I will demonstrate two applications in Excel using macros with the Solver function.  (See http://peltiertech.com/Excel/SolverVBA.html#Solver2 for information about setting up macros and solver on your version of Excel).

A previous section described the use of the Solver function applied to the iterative fitting of overlapping peaks in a spreadsheet. The steps listed in the second paragraph on that page can easily be captured with the macro recorder and saved with the spreadsheet. However, a different macro will needed for each different number of peaks, because the block of cells representing the "Proposed Model" will be different for each number of peaks. For example, the template CurveFitter2Gaussian.xlsm includes a macro named 'fit' for a 2-peak fit, activated by pressing Ctrl-f.  Here is the text of that macro:

Sub fit()
'
' fit Macro
'
' Keyboard Shortcut: Ctrl+f
'
SolverOk SetCell:="\$C\$12", MaxMinVal:=2, ValueOf:=0,
ByChange:="\$C\$8:\$D\$9", _
Engine:=1, EngineDesc:=
"GRG Nonlinear"
SolverSolve
End Sub

You can see that the text of the macro uses only two macro instructions: "SolverOK" and "SolverSolve". SolverOK specifies all the information in the "Solver Parameters" dialog box in its input arguments: 'SetCell' sets the objective as the percent fitting error in cell C12, 'MaxMinVal' is set to the second choice (Minimum), and 'ByChange' specifies the table of cells representing the proposed model (C8:D9) whose values are to be changed to minimize the objective in cell C12. The last argument sets the solver engine to 'GRC Nonlinear', the best one for iterative peak fitting. Finally, "SolverSolve" starts the Solver engine. You could easily modify this macro for curve fitter templates with other numbers of peaks just by changing the cells referenced in the 'ByChange' argument, e.g. C8:E9 for a 3-peak fit. In this case, though, is probably just as easy to use the macro recorder to record a macro for each curve fitter template.

A more elaborate example of a spreadsheet using a macro is TransmissionFittingCalibrationCurve.xls (screen image) that creates a calibration curve for a series of standard concentrations in the TFit method, which was previously described on TFit.html#spreadsheet. Here's a portion of that macro:

Range("AF10").Select
Application.CutCopyMode = False
Selection.Copy
Range("A6").Select
Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
:=False, Transpose:=False
Calculate
Range("J6").Select
Selection.Copy
Range("I6").Select
Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
:=False, Transpose:=False
Calculate
SolverOk SetCell:="\$H\$6", MaxMinVal:=2, ValueOf:=0, ByChange:="\$I\$6", Engine:=1 _
, EngineDesc:="GRG Nonlinear"
SolverOk SetCell:="\$H\$6", MaxMinVal:=2, ValueOf:=0, ByChange:="\$I\$6", Engine:=1 _
, EngineDesc:="GRG Nonlinear"
SolverSolve userFinish:=True
SolverSolve userFinish:=True
SolverSolve userFinish:=True
Range("I6:J6").Select
Selection.Copy
Range("AG10").Select
Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
:=False, Transpose:=False

The macro in this spreadsheet repeats this ch
unk of code several times, once for each concentration in the calibration curve (changing only the "AF10" in the first line to pick up a different concentration from the "Results table" in column AF). This macro uses several additional instructions, to select ranges ("Range...Select"), copy ("Selection.Copy") and paste ("Selection.PasteSpecial Paste:=xlPasteValues") values from one place to another, and re-calculate the spreadsheet ("Calculate"). Each separate click, menu selection, or key press creates one or more lines of macro text. The syntax is wordy but quite explicit and clear; you can learn quite a bit just by recording various spreadsheet actions and looking at the resulting macro text.

### Appendix O: Random walks and baseline correction

The random walk was mentioned in the section on signals and noise as a type of low-frequency ("pink") noise. Wikipedia says: "A random walk is a mathematical formalization of a path that consists of a succession of random steps. For example, the path traced by a molecule as it travels in a liquid or a gas, the search path of a foraging animal, "superstring" behavior, the price of a fluctuating stock and the financial status of a gambler can all be modeled as random walks, although they may not be truly random in reality."

Random walks describe and serve as a model for many kinds of unstable behavior. Whereas white, 1/f, and blue noises are anchored to a mean value to which they tend to return, random walks tend to be more aimless and often drift off on one or another direction, possibly never to return. Mathematically, a random walk can be modeled as the cumulative sum of some random process, for example the 'randn' function. The graph on the right compares a 200-point sample of white noise (computed as 'randn' and shown in
blue) to a random walk (computed as a cumulative sum, 'cumsum', and shown in red). Both samples are scaled to have exactly the same standard deviation, but even so their behavior is vastly different. The random walk has much more low frequency behavior, in this case wandering off beyond the amplitude range of the white noise. This type of random behavior is very disruptive to the measurement process, distorting the shapes of peaks and causing baselines to shift and making them hard to define, and it can not be reduced significantly by smoothing (See NoiseColorTest.m). In this particular example, the random walk has an overall positive slope and a "bump" near the middle that could be confused for a real signal peak (it's not; it's just noise). But another sample might have very different behavior. Unfortunately, it is not uncommon to observe this behavior in experimental signals.

To demonstrate the measurement difficulties, the script RandomWalkBaseline.m simulates a Gaussian peak with randomly variable position and width, on a random walk baseline, with a S/N ratio is 15. The peak is measured by least-squares curve fitting methods using peakfit.m with two different methods of baseline correction in an attempt to handle the random walk:

(a) a single-component Gaussian model (shape 1) with autozero set to 1 (meaning a linear baseline is first interpolated from the edges of the data segment and subtracted from the signal): peakfit([x;y],0,0,1,1,0,10,1)

(
b) a 2-component model, the first being a Gaussian (shape 1) and the second a linear slope (shape 26), with autozero set to 1: peakfit([x;y],0,0,2,[1 26],[0 0],10,0)

In this particular case the fitting error is lower for the second method, especially if the peak falls near the edges of the data range.

But the relative percent errors of the peak parameters show that the first method gives a lower error for position and width, at least in this case. On average, the peak parameters are about the same.

Position Error  Height Error  Width Error
Method a:  0.2772       3.0306        0.0125
Method b:  0.4938       2.3085        1.5418

You can compare this to WhiteNoiseBaseline.m which has a similar signal and S/N ratio, except that the noise is white. Interestingly, the fitting error with white noise is greater, but the parameter errors (peak position, height, width, and area) are lower, and the residuals are more random and less likely to produce false noise peaks. This is because the random walk noise is very highly concentrated at low frequencies where the signal frequencies usually lie, whereas white noise also has considerable power at higher frequencies, which increases the fitting error but does comparatively little damage to signal measurement accuracy. This may be slightly counter-intuitive, but it's important to realize that fitting error does not always correlate with peak parameter error. Bottom line: random walk is troublesome.

Depending on the type of experiment, an instrumental design based on modulation techniques may help, and ensemble averaging multiple measurements can help with any type of unpredictable random noise, which is discussed in the very next section.

### Appendix P. Modulation and synchronous detection: It's not magic, but it's close.

In some experimental designs it may be beneficial to apply the technique of modulation, in which one of the controlled independent variables is oscillated in a periodic fashion, and then to detect the resulting oscillation in the measured signal. With the right instrumental design, some types of noise and drift may thereby be reduced or eliminated.

A simple example applies to optical measurement systems like the one pictured on the right. A light source illuminates a test object (DUT="Device Under Test") and the resulting light from the test object is measured by photo detector. Depending on the objective of the experiment and the arrangement of the parts, the detector might measure the light transmitted by, reflected by, scattered by, or excited by the light beam. A optical chopper rapidly and repeatedly interrupts the light beam falling on the test object, so that the photo detector sees an oscillating signal, and the following electronic system is designed to measure only the oscillating component and to ignore the constant (direct current) component. The advantage of this arrangement is that any interfering signals introduced after the chopper - such as constant light that comes from the test object itself, or ambient light that leaks in from the outside, or any constant background signal generated by the photo detector itself - are not oscillating and are thus rejected. This works best if the electronics is synchronized to the chopper frequency. That's actually the function of the lock-in amplifier, which receives a synchronizing reference signal directly from the chopper to guarantee synchronization even if the chopper frequency were to vary (c.f. the interactive simulation on https://terpconnect.umd.edu/~toh/models/lockin.html and T. C. O'Haver, "Lock-in Amplifiers," J. Chem. Ed. 49, March and April (1972). The lock-in amplifier is sometimes viewed as a "black box" with almost magical abilities, but in fact it is actually performing a rather simple (but very useful) operation, as shown in this simulation.

AmplitudeModulation.m is a Matlab/Octave script simulation of modulation and synchronous detection, in which the signal created when the light beam scans the test sample is modeled as a Gaussian band ('y'), whose parameters are defined in the first few lines of the script (updated for recent version of Matlab). As the spectrum of the sample is scanned, the light beam is amplitude modulated by the chopper, represented as a square wave defined by the bipolar vector 'reference', which switches between +1 and -1, shown in the top panel on the left. The modulation frequency is many times faster than the rate at which the sample is scanned. The light emerging from the sample therefore shows a finely chopped Gaussian ('my'), shown in the second panel on the left. But the total signal seen by the detector also includes an unstable background introduced after the modulation ('omy'), such as lighted emitted by the sample itself or detector background, which in this simulation this is modeled as a random walk (Appendix O), which seriously distorts the signal, shown in the third panel. The detector signal is then sent to a lock-in amplifier that is synchronized to the reference waveform. The essential action of the lock-in is to multiply the signal by the bipolar reference waveform, inverting the signal when the light of off and passing it unchanged when the light is on. This causes the unmodulated background signal to be converted into a bipolar square wave, whereas the modulated signal is not effected because it is "off" when the reference signal is negative. The result ('dy') is shown in the 4th panel. Finally, this signal is low-passed filtered by the last stage in the lock-in amplifier to remove the modulation frequency, resulting in the recovered signal peak 'sdy' shown in the bottom panel. In effect, the modulation transforms the signal to a higher frequency ('frequency' in line 44) where low-frequency weighted noise on the baseline (line 50) are less intense.

These various signals are compared in the figure on the right. The original Gaussian signal peak ('y') is shown as the blue line, and the contaminating background {'baseline') is shown in black, in this case modeled as a random walk. The total signal ('oy') that would have been seen by the detector if modulation was not used is shown in green. The signal distortion is evident, and any attempt to measure the signal peak in that signal would be greatly in error. The signal 'sdy' recovered by the modulation and lock-in system is shown in red and overlayed with the original signal peak 'y' in blue for comparison. The fact that the blue and red line are so close to each other indicates the extent to which this method is successful. To make a more quantitative comparison, this script also uses the peakfit.m function, which employs a least-squares method to measure the peak parameters in the original unmodulated total signal (green line) and in the modulated recovered signal (green) and to compute the relative percent error in peak position, width, and width by both methods:

SignalToNoiseRatio = 4
Relative % Error:   Position         Height         Width
Original:
8.07                23.1              13.7
Modulated:
0.11                0.22              1.01

Each time you run it you will get the same signal peak but a very different random walk background. The S/N ratio will vary from about 4 to 9. It's not uncommon to see a 100-fold improvement in peak height accuracy with modulation, as in the example shown here. (If you wish, you can change the signal peak parameters and the noise level in the first few code lines of this simulation. For an even greater challenge, change line 47 to "baseline=10.*noise+cumsum(noise);" to make the noise a mixture of white and random walk drift, which results in a really ugly raw signal; you can see that the white noise makes it through the synchronous detector but is reduced by the smoothing lowpass filter in the last stage). In effect, the low-pass filter determines the frequency bandwidth of the lock-in system, but it also increases the response time to step changes (as in the Morse Code example).

This improvement in measurement accuracy works only because the dominant random error in this case is

(a) introduced after the modulation, and
(b) a mostly low-frequency noise.

If the noise were white, there would be no improvement, because white noise is the same at all frequencies; in fact there would be a slight reduction in precision because of the fact that the chopper blocks half of the light on average. If the sample (device under test) generates an "absorption" peak that starts a some positive value and then dips down to a lower value before returning, the demodulated output will the a negative-goring  peak rather than a positive peak (see AmplitudeModulationAbsorption.m).

In a computer-interfaced experimental system, you may not actually need a physical lock-in amplifier. It's possible to simulate the effect in software, as is done in this simulation. You need only digitize both the modulated sample signal and modulation reference signal, then invert the total signal whenever the reference signal is "off".

In some spectroscopic applications another useful type of modulation is "wavelength modulation", in which the wavelength of the light source is oscillated over the wavelength region of an emission or absorption peak in the spectrum (reference 32); this is often used in tunable diode laser spectroscopy and applied to the measurement of gases such as methane, water vapor, and carbon dioxide, especially in remote sensing, where the sample may be far from the detector. Less commonly modulation techniques are also applied in "AC" (alternating current) electrochemistry and in spectroelectrochemistry.

### Appendix Q: Measuring a buried peak

This simulation explores the problem of measuring the height of a small peak (a aso-called "child peak") that is buried in the tail of a much stronger overlapping peak (a "parent peak"), in the especially challenging case that the smaller peak is not even visible to the unaided eye. Three different measurement tools will be explored: iterative least-squares, classical least-squares regression, and peak detection, using the Matlab/Octave tools peakfit.m, cls.m, or findpeaksG.m, respectively. (Alternatively, you could use the corresponding spreadsheet templates).

In this example the larger peak is located at x=4 and has a height of 1.0 and a width of 1.66; the smaller measured peak is located at x=5 and has a height of 0.1 and also with a width of 1.66. Of course, for the purposes of this simulation, we pretend that we don't necessarily know all of these facts and we will try to find methods that will extract such information as possible from the data, even if the signal is noisy. The measured peak is small enough and close enough to the stronger overlapping peak (separated by less than the width of the peaks) that it
never forms a maximum in the total signal. So it looks like there is only one peak, as shown on the figure on the right. For that reason the findpeaks.m function (which automatically finds peak maxima) will not be useful by itself to locate the smaller peak. Simpler methods for detecting the second peak also fail to provide a way to measure the smaller second peak, such as inspecting the derivatives of the signal (the smoothed fourth derivative shows some evidence of asymmetry, but that could just be due to the shape of the larger peak), or Fourier self-deconvolution to narrow the peaks so they are distinguishable, but that is unlikely to be successful with this much noise. Least-squares methods work better when the signal-to-noise ratio is poor, and they can be fine-tuned to make use of available information or constraints, as will be demonstrated below.

The selection of the best method will depend on what is known about the signal and the constraints that can be imposed; this will depend in your knowledge of your experimental signal. In this simulation (performed by the Matlab/Octave script SmallPeak.m), the signal is composed of two Gaussian peaks (although that can be changed if desired in line 26). The first question is: are there more than one peak there? If we perform an unconstrained iterative fit of a
single Gaussian to the data, as shown in the figure on the right, it shows little or no evidence of a second peak - the residuals look pretty random. (If you could reduce the noise, or ensemble-average even as few as 10 repeat signals, then the noise would be low enough to see evidence of a second peak).  But as it is, there is nothing that pops out at you suggesting a second peak.

But suppose we suspect that there
should be another peak of the same Gaussian shape just on the right side of the larger peak. We can try fitting a pair of Gaussians to the data (figure on the left), but in this case the random noise is enough that the fit is not stable. When you run SmallPeak.m, the script performs 20 repeat fits (NumSignals in line 20) with the same underlying peaks but with 20 different random noise samples, revealing the stability (or instability) of each measurement method. The fitted peaks in Figure window 1 bounce around all over the place as the script runs, as you can see in the animation on the left. The fitting error is on average lower that the single-Gaussian fit, but that by itself does not mean that the peak parameters so measured will be reliable; it could just be "fitting the noise". If it were isolated all by itself, the small peak would have a S/N ratio of about 5 and it could be measured to a peak height precision of about 3%, but the presence of the larger interfering peak makes the measurement much more difficult. (Hint: After running SmallPeak.m the first time, spread out all the figure windows so they can all be seen separately and don't overlap. That way you can compared the stability of the different methods more easily.)

But suppose that we have reason to expect that the
two peaks will have the same width, but we don't know what that width might be. We could try an equal width Gaussian fit (peak shape #6, shown in Matlab/Octave Figure window 2); the resulting fit is much more stable and shows that a small peak is located at about x=5 on the right of the bigger peak, shown below on the left.  On the other hand, if we know the peak positions beforehand, but not the widths, we can use a fixed-position Gaussian fit (shape #16) shown on the right  (Figure window 3). In the very common situation where the objective is to measure an unknown concentration of a known component, then it's possible to prepare standard samples where the concentration of the sought component is high enough for its position or width to be determined with certainty.

So far all of these examples have used iterative peak fitting with at least one peak parameter (position and/or width) unknown and determined by measurement. If, on the other hand, all the peak parameters are known except the peak height, then the faster and more direct classical least-squares regression (CLS) can be employed (Figure window 4). In this case you need to know the peak position and width of both the measured and the larger interfering peaks (the computer will calculate their heights). If the positions and the heights really are constant and known, then this method gives the best stability and  precision of measurement. It's also computationally faster, which might be important if you have lots of data to process automatically.

The problem with CLS is that it fails to give accurate measurements if the peak position and/or width changes without warning, whereas two of the iterative methods (unconstrained Gaussian and equal-width Gaussian fits) can adapt to such changes. It some experiments it quite common to have small unexpected shifts in the peak position, especially in chromatography or other flow-based measurements, caused by unexpected changes in temperature, pressure, flow rate or other instrumental factors. In SmallPeaks.m, such x-axis shifts can be simulated using the variable "xshift" in line 18. It's initially zero, but if you set it to something greater (e.g. 0.2) you'll find that the equal-width Gaussian fit (Figure window 2) works better because it can keep up with the changes in x-axis shifts.

But with a greater x-axis shift (xshift=1.0) even the equal-width fit has trouble. Still, if we know the separation between the two peaks, it's possible to use the findpeaksG function to search for and locate the larger peak and to calculate the position of the smaller one. Then the CLS method, with the peak positions so determined for each separate signal, shown in Figure window 5 and labeled "findpeaksP" in the table below, works better. Alternatively, another way to use the findpeaks results is a variation of the equal-width iterative fitting method in which the first guess peak positions (line 82) are derived from the findpeaks results, shown in Figure window 6 and labeled "findpeaksP2" in the table below; that method does not depend on accurate knowledge of the peak widths, only their equality.

Each time you run SmallPeaks.m, all of these methods are computed "NumSignals" times (set in line 20) and compared in a table giving the average peak height accuracy of all the repeat runs:

xshift=0
Unconstr. EqualW  FixedP FixedP&W findpeaksP findpeaksP2
35.607    16.849  5.1375  4.4437   13.384     16.849

xshift=1
Unconstr. EqualW  FixedP FixedP&W findpeaksP findpeaksP2
31.263    44.107  22.794  46.18    10.607     10.808

Bottom line. The more you know about your signals, the better you can measure them. A stable signal with known peak positions and widths is the most precisely measurable in the presence of random noise ("FixedP&W"), but if the positions or widths vary from measurement to measurement, different methods must be used, and precision is degraded because more of the available information is used to account for changes other than the ones you want to measure.

Appendix R. Signal and Noise in the Stock Market?  Added September 2016, updated June 2020.

From a signal-to-noise perspective, the stock market is an interesting example. A national or global stock market is an aggregation of large numbers of buyers and sellers of shares in publicly traded companies. They are described by stock market indexes, which are computed as the weighted average of a large number of selected stocks. For example, the S&P 500 index is computed from the stock valuations of 500 large US companies. Millions of individuals and organizations participate in the buying and selling of stocks on a daily basis, so the S&P 500 index is a prototypical "big data" conglomerate, reflecting the overall value of 500 of the largest companies in the largest stock market on earth. Other stock indices, such as the Russel 2000, include an even larger number of smaller companies. Individual stocks can fail or fall drastically in value, but the market indexes average out the performance of hundreds of companies.

A plot of the daily value, V, of the S&P 500 index vs time, T, from 1950 through September of 2016 is shown in the following graphs.

Each plot contains 16608 data points, one for each business day, shown in red. The graph on the left plots V and the graph on the right plots the natural logarithm of V, ln(V). There are considerable up-and-down fluctuations over time that can be related to historical events: the oil crisis of the 1970s, the tech boom and bust of 2000, the subprime mortgage crisis of 2008. Still, the long-term trend of the value is upwards - the current value is over 100 times greater than its value in 1950. This is basically why people invest in the stock market, because on average, over the long run, stock values usually go up. The most common way to model this overall long-term increase over time is based on the equation for compound interest that predicts the growth of investments that have a constant rate of return, such as savings accounts or certificates of deposit:

V = S*(1 + R)T

where V is the value, S is the starting value, R is the annual rate of return, and T is time. By itself, this expression would yield a smooth curve, without all the peaks and dips. The values of S and R that result in the best fit to the stock market data (shown by the blue lines in the graphs) can be determined in two ways:

(1) directly, using the iterative curve fitting method, shown on the left above, or
(2) by taking the logarithm of the values and fitting those to a straight line, shown on the right above.

FitSandP.m is a Matlab/Octave script that performs both of these calculations using the data in SandPfrom1950.mat. When applied to the S&P 500 index data, the rate of return R is about 0.07 (or 7%), but interestingly these two methods give slightly different results, even though the exact same data are used for both, and even though both methods yield the same 7% rate if applied to noiseless synthetic data calculated from this expression. This difference between methods is caused by the irregularities in the stock data that deviate from a smooth line - in other words, the noise - and it is exacerbated by the large range of the value data V over time and by the fact that the average return from 1950 to 1983 is slightly lower than that from 1983 to 2016.

From the point of view of curve fitting, the deviations from a smooth curve described by the compound interest expression is just
noise. But from the point of view of the stock market investor, those deviations can be an opportunity and a warning. Naturally, most investors would like to know how the stock market will behave in the future, but that requires extrapolation beyond the range of the available data, which is always uncertain and dangerous. But still, it's most likely (but not certain) that the long term behavior of the market (say, over a period of 10 years or more) will be similar to the past - that is, growing exponentially at about the same rate as before but with unpredictable fluctuations similar to what has occurred in the past. We can take a closer look at those fluctuations by inspecting the residuals - that is, subtracting the fitted curve from the raw data, as shown in iSignal on the left. There are several notable features of this "noise". First, the deviations are roughly proportional to V and thus relatively equal when plotted on a log scale. Second, the noise has a distinctly low-frequency character; the periodogram (lower panel, in red) shows peaks at 33, 16, 8, and 4 years. There are also, notably, numerous instances over the years when there is a sharp dip followed by a slower recovery close to the previous value. And conversely, every peak is eventually followed by a dip. The conventional advice in investing is to "buy low" (on the dips) and "sell high" (on the peaks). But of course the problem is that you can not reliably determine in advance exactly where the peaks and dips will fall; you have only the past to guide you. Still, if the current market value is much higher than the long-term trend, it will likely fall, and if the market value is much lower than the long-term trend, it will likely rise, eventually. The only thing you can be sure of is that, in the long run, the market will rise. This is why saving for retirement by investing in the stock market, and starting as soon as possible, is so important: over a 30-year working life, the market is almost guaranteed to rise substantially. The most painless way to do this is with your employer's 401k or 403b automatic payroll withdrawal plan. You can not actually invest in the stock market as a whole, but you can invest in index mutual funds or exchange traded funds (ETFs), which are collections of stocks that are constructed to match or track the components of a market index. Such funds typically have very low management fees, an important factor in selecting an investment. Other mutual funds attempt to "beat the market" by carefully buying and selling stocks in an attempt to create a return that is greater than the overall market indexes; some are temporarily successful in doing that, but they charge higher management fees. Mutual finds and ETFs are much less risky investments than individual stocks.

Some companies periodically distribute payouts to investors called "dividends". Those dividends are independent of the day-to-day variations in stock price, so even if the stock value drops temporarily, you still get the same dividend. For that reason it's important that you set your investment account to "automatically reinvest dividends", so when the share price drops, the dividends are buying shares at the lower price. The S&P 500 index values used above, called price returns, did not include dividend reinvestment; the total returns with dividends reinvested (https://en.wikipedia.org/wiki/S%26P_500_Index#Versions) would have been substantially higher, closer to 11%. (With an average total annual return of 11%, and starting with an investment of \$170 the first month - that's less than \$6 a day - and increasing it 5% each year, you could accumulate over \$600,000 over a 30 year working life, or \$1,000,000 if you continued investing an additional 5 years, as shown by the spreadsheet graphic on the right). And that's starting at just \$6 per day, about the cost of a fancy coffee at Starbucks. Think about that the next time you see a line of young people waiting to order their daily coffee.

To illustrate how much influence stock market volatility fluctuation ("noise") has on the market gains, the Matlab/Octave script SnPsimulation.m adds proportional noise to the compound interest calculation to mimic the S&P data, performs the two curve fitting methods described above, repeats the allocations over and over with independent samples of proportional noise, and then calculates the mean and the relative standard deviation (RSD) of the rates of return. A typical result is:

TrueRateOfReturn = 0.07
Measured Rate  RSD
Coordinate transformation:   0.07112     8.9%
Iterative curve fitting:     0.07972    19.9%

As you can see, the two methods don't agree. In this example, the return calculated by the iterative method is higher, but it could just have easily been the other way. The fact is that the standard deviations are fairly large, and the iterative method always has a higher standard deviation, because it weights the higher values more heavily, where deviations from the line are higher, whereas the log transformation method weights the data more evenly. Even with this uncertainty, investing in a stock market index fund almost always performs better in the long run than more predictable investments such as saving accounts or CDs, which have much lower rates of return.

In investing in the stock market, it's important to focus on the long-term trends and not to be frightened by the short-term up and down fluctuations. It's similar to the difference between weather and climate; the large and dramatic short-term weather variations tend to disguise the much smaller long term climate warming that is slowly melting the icecaps and raising the sea levels (whether it is caused by human activity or by natural causes alone or by a combination of both).

For a spreadsheet template that allows you to calculate the possible returns on long-term investments in stock market mutual funds, see https://terpconnect.umd.edu/~toh/simulations/Investment.html.

Note added in June 2020. You might be wondering what effect more recent events have had on the overall stock market performance, in particular the trade wars of 2019 and the Coronavirus pandemic of 2020. Extending the data plotted above to include the S&P results for the dates up to June 2020 has remarkably little effect, as seen in the log plot below.

The average return is still about 7%.

The recent changes are more evident if you take a much closer look at the period from 2016 to 2020, for which the return over that short period was indeed greater, about 9.5%. These 2019 and the 2020 dips, although they were quite sharp and caused a lot of anxiety at the time, recovered quickly and as a result had little effect on the overall long-term performance. When stocks drop, even for well-known and valid reasons, some investors buy shares at the reduced prices, and when stocks rise, especially when they hit all-time highs, some investors sell shares, to "lock in their gains". This behavior acts as a natural brake on the fluctuations of the market.

Appendix S. Measuring signal-to-noise ratio in complex signals. (Added October, 2016)

In the section of Signals and Noise, I said: "The quality of a signal is often expressed as the signal-to-noise (S/N) ratio, which is the ratio of the true signal amplitude ... to the standard deviation of the noise." That's a simple enough statement, but automating the measurement of signal and the noise in real signals is not always straightforward. Sometimes it's difficult to separate or distinguish between the signal and the noise, because it depends not only on the numerical nature of the data, but also on the objectives of the measurement.

For a simple DC (direct current) signal, for example measuring a fluctuating voltage, the signal is just the average voltage value and the noise is its standard deviation. This is easily calculated in a spreadsheet or in Matlab/Octave:

>> signal=mean(NoisyVoltage);
>> noise=std(NoisyVoltage);
>> SignalToNoiseRatio=signal/noise

But usually things are more complicated. For example, if the signal is a rectangular pulse (as in the figure on the left) with constant random noise, then the simple formulation above will not give accurate results. If the signal is stable enough that you can get two successive signal recordings m1 and m2 that are identical except for the noise, then you can simply subtract the signal out: the standard deviation of the noise is then given by sqrt((std(m1-m2)2)/2), where std is the standard deviation function (because random noise adds quadratically). But not every signal source is stable and repeatable enough for that to work perfectly. Alternatively, you can try to measure the average signal just over the top of the pulse and the noise only over the baseline interval before and/or after the pulse. That's not so hard to do by hand, but it's harder to automate with a computer, especially if the position or width of the pulse changes. It's basically the same for smooth peak shapes like the commonly-encountered Gaussian peak (as in the figure on the right). You can estimate the height of the peak by smoothing it and then taking the maximum of the smoothed peak as the signal: max(fastsmooth(y,10,3)), but the accuracy would degrade if you choose too high or two low a smooth width. And clearly all this depends on having a well-defined baseline in the data where there is only noise. It doesn't work if the noise varies with the amplitude of the peak.

In many cases, curve fitting can be helpful. For example, you could use peak fitting or a peak detector to locate multiple peaks and measure their peak heights and their S/N ratios on a peak-to-peak basis, by computing the noise as the standard deviation of difference between the raw data and the best-fit line over the top part of the peak. That's how iSignal measures S/N ratios of peaks. Also, iSignal has baseline correction capabilities that allow the peak to be measured relative to the nearby baseline.

Curve fitting also works for complex signals of indeterminate shape that can be approximated by a high-order polynomial or as the sum of a number of basic functions such as Gaussians, as in the example shown on the left. In this example, five Gaussians are used to fit the data to the point where the residuals are random and unstructured. The residuals (shown in blue below the graph) are then just the noise remaining in the signal, whose standard deviation is easily computed using the built-in standard deviation function in a spreadsheet ("STDEV") or in Matlab/Octave ("std"). In this example, the standard deviation of the residuals is 111 and the maximum signal is 40748, so the percent relative standard deviation of the noise is 0.27% and the S/N ratio is 367. (The positions, heights, and widths of the Gaussian components, usually the main results of the curve fitting, are not used in this case; curve fitting is used only to obtain a measure the noise via the residuals). The advantage of this approach over simply subtracting two successive measurements of the signal is that it adjusts for slight changes in the signal from measurement to measurement; the only assumption is that the signal is a smooth, low-frequency waveform that can be fit with a polynomial or a collection of basic peak shapes and that the noise is random and mostly high-frequency compared the the signal.  But don't use too high a polynomial order' otherwise you are just "fitting the noise".

With periodic signal waveforms the situation is a bit more complicated. As an example, consider the audio recording of the spoken phrase "Testing, one, two, three" (click to download in .mat format or in WAV format) that was used previously. The Matlab/Octave script PeriodicSignalSNR.m loads the audio file into a variable "waveform", then computes the average amplitude of the waveform (the "envelope") by smoothing the absolute value of the waveform:

envelope = fastsmooth(abs(waveform), SmoothWidth, SmoothType);

The result is plotted on the left, where the waveform is in blue and the envelope is in red. The signal is easy to measure as the maximum or perhaps the average of the waveform, but the noise is not so evident. The human voice is not reproducible enough to get a second identical recording to subtract out the signal as above. Still, there will be often be gaps in the sound, during which the background noise will be dominant. In an audio (voice or music) recording, there will typically be such gaps at the beginning, then the recording process has already started but the sound has not yet begun, and possibly at other short periods when there are pauses in the sound. The idea is that, by monitoring the envelope of the sound and noting when it falls below some adjustable threshold value, we can automatically record the noise that occurs in those gaps, whenever they may occur in a recording. In PeriodicSignalSNR.m, this operation is done in lines 26-32, and the threshold is set in line 12. The threshold value has to be optimized for each recording. When the threshold value is set to 0.015 in the "Testing, one, two, three" recording, the resulting noise segments are located and are marked in red in the plot on the right. The program determines the average noise level in this recording simply by computing the standard deviation of those segments (line 46), then computes and prints out the peak-to-peak S/N ratio and the RMS (root mean square) S/N ratio.

PeakToPeak_SignalToNoiseRatio = 143.7296

RMS_SignalToNoiseRatio = 12.7966

The frequency distribution of the noise is also determined (lines 60-61) and shown in the figure on the left, using the PlotFrequencySpectrum function, or you could have used iSignal in the frequency spectrum mode (Shift-S). The spectrum of the noise shows a strong component very near 60 Hz, which is almost certainly due to power line pickup (the recording was made in the USA, where AC power is 60Hz); this suggests that better shielding and grounding of the electronics might help to clean up future recordings. The lack of strong components at 100 Hz and above suggests that the vocal sounds have been effectively suppressed at this threshold setting. The script can be applied to other sound recordings in WAV format simply by changing the file name and time axis in lines 8 and 9.

Appendix T. Dealing with wide ranging signals: segmented processing functions (added November 2016)

To facilitate the inspection of very large and complex signals, it's useful to be able to "zoom in" to different parts of the x-axis range, which can be done with the interactive Matlab tools iSignal, iPeak, and ipf.m or by the Matlab/Octave function segplot.m. Sometimes an experimental signal will vary so much across its x-axis range that it's impossible to find a single setting for operations like smoothing or peak detection that is optimized for all regions of the signal. It's always possible to break up the signal into pieces and treat each separately, for example using segplot, but in some cases it's easier to use a single segmented application over the entire signal. That's the idea behind the Matlab/Octave functions and the Excel spreadsheet templates in this section.

SegmentedSmooth.m, illustrated on the right, is a segmented variant of fastsmooth.m, which can be useful if the widths of the peaks or the noise level varies substantially across the signal. The syntax is that same as fastsmooth.m, except that the second input argument "smoothwidths" can be a vector: SmoothY = SegmentedSmooth (Y, smoothwidths,type,ends). The function divides Y into a number of equal-length regions defined by the length of the vector 'smoothwidths', then smooths each region with a smooth of type 'type' and width defined by the elements of vector 'smoothwidths'. In the simple example in the figure on the right, smoothwidths=[31 52 91], which divides up the signal into three regions and smooths the first region with smoothwidth 31, the second with 51, and the last with 91. Any number and sequence of smooth widths can be used. If the peak widths increase or decrease regularly across the signal, you can calculate the smoothwidths vector by giving only the number of segments ("NumSegments") , the first value, "startw", and the last value, "endw", like so:

wstep=(endw-startw)/NumSegments;
smoothwidths=startw:wstep:endw;

Type "help SegmentedSmooth" for other examples examples. DemoSegmentedSmooth.m demonstrates the operation with different signals consisting of noisy variable-width peaks that get progressively wider, like the figure on the right. FindpeaksSL.m is the same thing for Lorentzian peaks.

SegmentedSmoothTemplate.xlsx is a segmented multiple-width data smoothing spreadsheet template, functionally similar to SegmentedSmooth.m. In this version there are 20 segments. SegmentedSmoothExample.xlsx is an example with data (graphic). A related spreadsheet GradientSmoothTemplate.xlsx (graphic) performs a linearly increasing (or decreasing) smooth width across the entire signal, given only the start and end values, automatically generating as many segments are are necessary. Of course, as is usual with spreadsheets, you'll have to modify these templates for your particular number of data points, usually by inserting rows somewhere in the middle and then drag-copying down from above the insert, plus you may have to change the x-axis range of the graph. (In contrast, the Matlab/Octave functions do that automatically).

Segmented Fourier filter. SegmentedFouFilter.m is a segmented version of  FouFilter.m, which applies different center frequencies and widths to different segments of the signal. The syntax is the same as FouFilter.m except that the two input arguments 'centerFrequency' and 'filterWidth' must be vectors with the values of centerFrequency of filterWidth for each segment. The signal is divided equally into a number of segments determined by the length of centerFrequency and filterWidth, which must be equal. Type 'help SegmentedFouFilter' for help and examples.

findpeaksSG.m is a variant of the findpeaksG function, with the same syntax, except that the four peak detection parameters can be vectors, dividing up the signal into regions that are optimized for peaks of different widths. Any number of segments can be declared, based on the lengths of the SlopeThreshold input argument. (Note: you need only enter vectors for those parameters that you want to vary between segments; to allow any of the other peak detection parameters to remain unchanged across all segments, simply enter a single scalar value for that parameter; only the SlopeThreshold must be a vector). In the example shown on the left, the script TestPrecisionFindpeaksSG.m creates a noisy signal with three peaks of widely different widths, measures the peak positions, heights and widths of each peak using findpeaksSG, and prints out the percent relative standard deviations of parameters of the three peaks in 100 measurements with independent random noise. With 3-segment peak detection parameters, findpeaksSG reliably detects and accurately measures all three peaks. In contrast, findpeaksG, tuned to the middle peak (using line 26 instead of line 25), measures the first and last peaks poorly. You can also see that the precision of peak parameter measurements gets progressively better (smaller relative standard deviation) the larger the peak widths, simply because there are more data points in those peaks. (You can change any of the variables in lines 10-18).

findpeaksSb.m is a segmented variant of the findpeaksb.m function. It has the same syntax as findpeaksSb, except that the arguments "SlopeThreshold", "AmpThreshold", "smoothwidth", "peakgroup", "window", "PeakShape", "extra", "NumTrials", "autozero", and "fixedparameters" can all be optionally scalars or vectors with one entry for each segment. This allows the function to handle widely varying signals with peaks of very different shapes and widths and backgrounds. In the example on the right, the Matlab/Octave script DemoFindPeaksSb.m creates a series of Gaussian peaks whose widths increase by a factor of 25 and that are superimposed in a curved baseline with random white noise that increases gradually across the signal. In this example, four segments are used, changing the peak detection and curve fitting values so that all the peaks are measured accurately.

SlopeThreshold = [.01 .005 .002 .001];
AmpThreshold = 0.7;
SmoothWidth = [5 15 30 35];
FitWidth = [10 12 15 20];
windowspan = [100 125 150 200];
peakshape = 1;
autozero = 3;

The script also computes the relative percent error of the measurement of peak position, height, width, and area for each peak.

measurepeaks.m is an automatic peak detector peaks of arbitrary shape. It is based on findpeaksSG, with which it shares the first 6 input arguments; the four peak detection arguments can be vectors to accommodate signals with peaks of widely varying widths. It returns a table containing the peak number, peak position, absolute peak height, peak-valley difference, perpendicular drop area, and tangent skim area of each peak it detects. If the last input argument ('plots') is set to 1, it will plot the entire signal with numbered peaks and will also plot the individual peaks with the peak maximum (red), valley points (magenta), and tangent lines (cyan), as shown on the right. The peak height and x-position is indicated by the red circle, the perpendicular drop area is the total area measured between the two magenta vertical lines down to zero, and the tangent skim area is the area between the cyan baseline and the blue peak (which compensates for a flat baseline). Type "help measurepeaks" and try the examples there or run testmeasurepeaks.m (graphic animation). The related functions autopeaks.m and autopeaksplot.m are similar, except that the four peak detection parameters can be omitted and the function will calculate estimated initial values.

The script HeightAndArea.m tests the accuracy of peak height and area measurement with signals that have multiple peaks with variable width, noise, background, and peak overlap. Generally, the values for absolute peak height and perpendicular drop area are best for peaks that have no background, even if they are slightly overlapped, whereas the values for peak-valley difference and for tangential skim area are better for isolated peaks on a straight or slightly curved background. Note: this function uses smoothing (specified by the SmoothWidth input argument) only for peak detection; it performs its measurements on the raw unsmoothed y data. If the raw data are noisy, it may be best to smooth the y data yourself before calling measurepeaks.m, using any smooth function of your choice.

Other segmented functions. The same segmentation code used in SegmentedSmooth.m (lines 53-65) can be applied to other functions simply by editing the first line in the first for/end loop (line 59) to refer to the function that you want to apply in a segmented fashion. For example, segmented peak sharpening can be useful when a signal has multiple peaks that vary in width, and segmented deconvolution can be useful when a signal has multiple peaks that vary in width or tailing vary substantially across the signal: SegExpDeconv(x,y,tc) deconvolutes y with a vector of exponential functions whose time constants are specified by the vector tc. SegExpDeconvPlot.m is the same except that it plots the original and deconvoluted signals and shows the divisions between the segments by vertical magenta lines.

Appendix U. Measurement Calibration (Added December, 2016)

Most scientific measurements involve the use of an instrument that actually measures something else and converts it to the desired measure. Examples are simple weight scales (which actually measure the compression of a spring), thermometers (which actually measure thermal expansion), pH meters (which actually measure a voltage), and devices for measuring hemoglobin in blood or CO2 in air (which actually measure the intensity of a light beam). These instruments are single-purpose, designed to measure one quantity, and automatically convert what they actually measure into the the desired quantity and display it directly. But to insure accuracy, such instruments must be calibrated, that is, used to measure one or more calibration standards of known accuracy, such as a standard weight or a sample that is carefully prepared to a known temperature, pH, or sugar content. Most are pre-calibrated at the factory for the measurement of a specific substance in a specific type of sample.

Analytical calibration. General-purpose instrumental techniques that are used to measure the quantity of many different chemical components in unknown samples, such as the various kinds of spectroscopy, chromatography, and electrochemistry, or combination techniques like "GC-mass spec", must also be calibrated, but because those instruments can be used to measure a wide range of compounds or elements, they must be calibrated by the user for each substance and for each type of sample. Usually this is accomplished by carefully preparing (or purchasing) one or more "standard samples" of known concentration, such as solution samples in a suitable solvent. Each standard is inserted or injected into the instrument, and the resulting instrument readings are plotted against the known concentrations of the standards, using least-squares calculations to compute the slope and intercept, as well as the standard deviation of the slope (sds) and intercept (sdi). Then the "unknowns" (that is, the samples whose concentrations are to be determined) are measured by the instrument and their signals are converted into concentrations with the aid of the calibration curve. If the calibration is linear, the sample concentration C of any unknown is given by (A - intercept) / slope, where A is the measured signal (height or area) of that unknown. The predicted standard deviation in the sample concentration is C*SQRT((sdi/(A-intercept))^2+(sds/slope)^2) by the rules for propagation of error. All these calculations can be done in a spreadsheet, such as CalibrationLinear.xls.

In some cases the thing measured can not be detected directly but must undergo a chemical reaction that makes it measurable; in that case the exact same reaction must be carried out on all the standard solutions and unknown sample solutions, as demonstrated in this animation (thanks to Cecilia Yu of Wellesley College).

Various calibration methods are used to compensate for problems such as random errors in standard preparation or instrument readings, interferences, drift, and non-linearity in the relationship between concentration and instrument reading. For example, the standard addition calibration technique can be used to compensate for multiplicative interferences. I have prepared a series of "fill-in-the-blanks" spreadsheet templates for various calibrations methods, with instructions, as well as a series of spreadsheet-based simulations of the error propagation in widely-used analytical calibration methods, including a step-by-step exercise.

Calibration and signal processing. Signal processing often intersects with calibration. For example, if you use smoothing or filtering to reduce noise, or differentiation to reduce the effect of background, or measure peak area to reduce the effect of peak broadening, or use modulation to reduce the effect of low-frequency drift, then you must use the exact same signal processing for both the standard samples and the unknowns, because the choice of signal processing technique can have a big impact on the magnitude and even on the units of the resulting processed signal (as for example in the derivative technique and in choosing between peak height and peak area).

PeakCalibrationCurve.m is an Matlab/Octave example of this. This script simulates the calibration of a flow injection system that produces signal peaks that are related to an underlying concentration or amplitude ('amp'). In this example, six known standards are measured sequentially, resulting in six separate peaks in the observed signal. (We assume that the detector signal is linearly proportional to the concentration at any instant). To simulate a more realistic measurement, the script adds four sources of "disturbance" to the observed signal:

a. noise - random white noise added to all the signal data points, controlled by the variable "Noise";
b. background - broad curved background of random amplitude, tilt, and curvature, controlled by "background";
d. a final smoothing before the peaks are measured, controlled by "FinalSmooth".

The script uses measurepeaks.m as an internal function to determine the absolute peak height, peak-valley difference, perpendicular drop area, and tangent skim area. It plots separate calibration curve for each of these measures in figure windows 2-5 against the true underlying amplitudes (in the vector "amp"), fitting the data to a straight line and computing the slope, intercept, and R2. (If the detector response were non-linear, a quadratic or cubic least-square would work better). The slope and intercept of the best-fit line is different for the different methods, but if the R2 is close to 1.000, a successful measurement can be made. (If all the random disturbances are set to zero in lines 33-36, the R2 values will all be 1.000. Otherwise the measurements will not be perfect and some methods will result in better measurements - R2 closer to 1.000 - than others). Here is a typical result:

Peak Position PeakMax Peak-val. Perp drop Tan skim
1    101.56   1.7151  0.72679   55.827    11.336
2    202.08   2.1775  1.2555    66.521    21.425
3    300.7    2.9248  2.0999    58.455    29.792
4    400.2    3.5912  2.949     66.291    41.264
5    499.98   4.2366  3.7884    68.925    52.459
6    601.07   4.415   4.0797    75.255    61.762
R2 values:        0.9809  0.98615   0.7156    0.99824

In this case, the tangent skim method works best, giving a linear calibration curve (shown on the left) with the highest R2.

In this type of application, the peak heights and/or area measurements do not actually have to be
accurate, but they must be precise. That's because the objective of an analytical method such as flow injection or chromatography is not to measure peak heights and areas, but rather to measure concentrations, which is why calibration curves are used. Figure 6 shows the correlation between the measured tangent skim areas and the actual true areas under the peaks in the signal shown above, right; the slope of this plot shows that the tangent skim areas are actually about 6% lower that the true areas, but that does not make a difference in this case because the standards and the unknown samples are measured the same way. In some other application, you may actually need to measure the peak heights and/or areas accurately, in which case curve fitting is generally the best way to go.

If the peaks partly overlap, the measured peak heights and areas may be effected. To reduce the problem, it may be possible to reduce the overlap by using peak sharpening methods, for example the derivative method, deconvolution or the power transform method, as demonstrated by the self-contained Matlab/Octave function PowerTransformCalibrationCurve.m.

Curve fitting the signal data. Ordinary in curve fitting, such as the classical least squares (CLS) method and in iterative nonlinear least-squares, the selection of a model shape is very important. But in quantitative analysis applications of curve fitting, where the peak height or area measured by curve fitting is used only to determine the concentration of the substance that created the peak by constructing a calibration curve, having the exact model shape is surprisingly uncritical. The Matlab/Octave script PeakShapeAnalyticalCurve.m shows that, for a single isolated peak whose shape is constant and independent of concentration, if the wrong model shape is used, the peak heights measured by curve fitting will be inaccurate, but that error will be exactly the same for the unknown samples and the known calibration standards, so the error will "cancel out" and the measured concentrations will still be accurate, provided you use the same inaccurate model for both the known standards and the unknown samples. In the example shown on the right, the peak shape of the actual peak is Gaussian (blue dots) but the model used to fit the data is Lorentzian (red line). That's an intentionally bad fit to the signal data; the R2 value for the fit to the signal data is only 0.962 (a poor fit by the standards of measurement science). The result of this is that the slope of the calibration curve (shown below on the left) is greater than expected; it should have been 10 (because that's the value of the "sensitivity" in line 18), but it's actually 10.867 in the figure on the left, but nevertheless the calibration curve is still linear and its R2 value is 1.000, meaning that the analysis should be accurate. (Note that curve fitting is actually applied twice in this type of application, once using iterative curve fitting to fit the signal data, and then again using polynomial curve fitting to fit the calibration data).

Despite all this, it's still better to use as accurate a model peak shape as possible for the signal data, because the percent fitting error of the signal fit can be used as a warning that something unexpected is wrong, such as the appearance of an interfering peak from a foreign substance.

Appendix V. Numerical precision of computer software (added January, 2017)

Computations carried out by computer software with non-integer numbers have a natural limit to the precision with which they can be represented; for example, the number 1/3 is represented as 0.3333333... with a large but finite number of "3"s, whereas theoretically there are an infinite string of "3"s in the decimal representation of 1/3. It's the same with irrational numbers such as "pi" and the square root of 2; they can never have a exact decimal representation. In principle, these tiny errors could accumulate in a very complex multiple-step calculation and could conceivably become a significant source of error. In the vast majority of applications to scientific computation, however, these limits will be minuscule compared to the errors and random noise that is already present in most real-world measurements. But it is best to know what those numerical limits are, under what circumstances they might occur, and how to minimize them.

Multicomponent spectroscopy. Probably the most common calculation where numerical precision is an issue is in the matrix methods that are used in multicomponent spectroscopy. In the derivation of the Classical Least Squares (CLS) method, the matrix inverse is used to solve large systems of linear equations. The matrix inverse is a standard function in programming languages such as Matlab, Octave, Wolfram's Mathematica, and in spreadsheets. But if you use that function in Matlab, the function name ("inv") is automatically flagged by the editor with the following warning:

"For solving a system of linear equations, the inverse of a matrix is primarily of theoretical value. Never use the inverse of a matrix to solve a linear system Ax=b with x=inv(A)*b, because it is slow and inaccurate.... Instead of multiplying by the inverse, use matrix right division (/) or matrix left division (\). That is: Replace inv(A)*b with A\b...[and]...replace b*inv(A) with b/A"

"Slow and inaccurate"? OK, now I'm scared. But how serious a problem is this really in actual applications? To answer that question, the Matlab/Octave script RegressionNumericalPrecisionTest.m applies the CLS method to a mixture of two very closely-spaced noiseless overlapping Gaussian peaks (blue and green lines in the figure on the left) using three different mathematical formulations of the least-squares calculation that give different results. (The peaks must not have any added random noise, because here we are focusing on the errors caused by the computer itself and not, as in everywhere else in this essay, on external physical noise). The difficulty of such a measurement depends on the ratio of the peak separation to the peak half-width; small ratios mean very highly overlapped peaks which are hard to measure accurately. In this example the separation-to-width ratio is 0.0033, which is very small (i.e. difficult); this is equivalent to trying to measure a mixture of two absorption spectroscopy peaks that are 300 nm wide and separated by only 1 nm, a tiny difference that you wouldn't even notice with the naked eye. The results of this script shows that the matrix inverse ("inv") method does indeed have an error thousands of times larger than the method using matrix division, but even the matrix division error is still very small. Practically, the difference between these methods is unlikely to be significant when applied to real experimental data, because even the tiniest bit of signal instability (like that caused by small changes in the temperature of the sample or random noise in the signal, which you can simulate in line 15) produces a far greater error. So basically that warning message is the voice of a mathematician or computer programmer, not an experimental scientist.

Analog-to-digital resolution. Potentially more significant than the computer's numerical resolution is the resolution of the analog-to-digital converter (ADC) that is used to convert analog signals (e.g. voltage) to a number. The Matlab/Octave script RegressionADCbitsTest.m demonstrates this, with two slightly overlapping Gaussian bands with a large (50-fold) difference in peak height (blue and green lines in the figure on the right); in this case the separation to width ratio is 0.25, much larger (i.e. easier) than the previous example. For this example, the simulation shows that the relative percent errors of peak height measurement is 0.19% for the larger peak and 6.6% for the smaller peak. You can change the resolution of the simulated analog-to-digital converter in number of bits (line 9). The amplitude resolution of an  analog-to-digital converter is 2 raised to the power of the number of bits. Common ADC resolutions are 10, 12, and 14 bits, corresponding to resolutions of one part in 1024, 4096 and 16384, respectively. Of course, the effective resolution for the smaller peak in this case is 50 times less, and you can't simply turn up the amplification on the smaller peak without overloading the ADC for the larger one.

Surprisingly, if most of the noise in the signal is this kind of digitization noise, it may actually help to
add some additional random noise (specified in line 10 in this script), as was seen in appendix I.

Differentiation. Another application where you can see numerical precision noise is in differentiation, which involves the subtraction of very nearly equal adjacent numbers in a data series. The self-contained Matlab/Octave function DerivativeNumericalPrecisionDemo.m shows how the numerical precision limits of the computer effects the first through fourth derivatives of a Gaussian band that is very finely sampled (over 16,000 points in the half-width in this case) and that has no added noise. The plot on the left shows the four waveforms in Figure 1, left, and their frequency spectra in Figure 2, below. The numerical precision limits of the computer creates random noise at very high frequencies, which are emphasized by differentiation. In the frequency spectra below, the big low-frequency bump near a frequency of 10-2 is the signal and everything above that is numerical noise. The lower-order derivatives are seldom a problem, but by the time you reach the fourth derivative, those noise frequencies approach the strength of the signal frequencies, as you can see in the frequency spectrum of the fourth derivative in the lower right. But this noise is only a very high-frequency noise, so smoothing with as little as a 3-point sliding average smooth removes most of it (click to view).

The numerical errors of derivative calculations can be reduced somewhat by using the Fourier Transform method (reference 88), but the errors are spread over the entire signal period.

An alternative derivative method based on the Fourier Transform has slightly lower numerical errors but is seldom used in practice (reference 88).) The fundamental difference between the two is that the finite difference method works locally, on one small segment of the data at a time, whereas the FT method works globally, because each frequency in the Fourier representation extends throughout the entire time domain. The Matlab/Octave script FDvsFTderivative.m compares the numerical errors of finite difference (FD) and Fourier transform (FT) methods of differentiation. It creates a very broad, finely sampled Gaussian peak and then computes its fourth derivative both ways. (The noise is caused only by software numerical resolution limitations). The result is that the numerical errors are lower for the FT method near the peak but are much greater far from the peak center.

Smoothing
. Finally, there might potentially be a numerical problem with the fastsmooth algorithm, covered in the section on smoothing, because it is a recursive algorithm that uses the results of a previous step in the calculation to calculate the next step. The Matlab/Octave script FastsmoothNumericalPrecisionTest.m demonstrates the numerical precision of fastsmooth.m. Even for 4000-point Gaussian smooth applied to a 100,000 point signal, the numerical noise relative standard deviation is only .00027%, and most of that occurs in the edges of the signal (first 4000 and last 4000 points); the error over 90% of the signal is orders of magnitude less. This would be a negligible problem in most cases.

Appendix W: The Raspberry Pi (added March, 2017)
Signal processing does not necessarily require expensive computer systems. The Raspberry Pi is a remarkably tiny and inexpensive
computer board that is about the size of a deck of playing cards and costs as little as \$38! Version 3 B+ in early 2018 has a 1.4GHz 64-bit quad-core ARMv8 CPU with 1GB RAM, 4 USB ports, 40 general-purpose input-output pins, HDMI port, 300 mbps Ethernet port, audio jack and composite video, video camera and display interfaces, micro SD card slot for mass storage, VideoCore IV 3D graphics core, 802.11ac Wireless LAN, and Bluetooth 4.2. You can get it with a bunch of installed software, including a version of the Linux operating system, a simple but effective graphical desktop modeled on Windows, a Web browser, the complete LibreOffice suite, Wolfram's Mathematica (screen shot), several programming languages, a bunch of games (including Minecraft), and various utilities. All of these are installed by default on the Raspberry Pi's operating system installer. (There is even a smaller and cheaper model called the Zero that costs just \$5 for the card itself or \$10 with memory card and power supply; it has less memory and smaller connectors that the other models, but because of its low cost and small size, this model is ideal in situations where it might be damaged or lost, as in rocket or balloon borne experiments).

For scientific data acquisition and signal processing applications, the Pi version of Linux has all the "usual" UNIX terminal commands for data gathering, searching, cleaning and summarizing. In addition, there are many add-on libraries for Python, including SciPi, NumPy, and Matplotlib, all of which are free downloads. Allen B. Downey's 164-page PDF book "Think DSP" has many examples of Python code in traditional engineering applications. There are many add-on hardware devices available at low cost, including video cameras and a piggyback sensor board that reads and displays sensor data from several built-in sensors: gyroscope, accelerometer, magnetometer, barometer, temperature, relative humidity. (It's based on the same hardware that is currently in orbit on the International Space Station).

My signal processing spreadsheets run just fine on the version of
Calc that comes with the Pi, as do the Calibration worksheets and my analytical method models (screen shot).

For school applications, Element14 markets a Learn to Program Pack Starter Kit that includes a license for student version of Matlab for Windows or Macintosh and a Raspberry Pi 3 with MicroSD card, power supply, and enclosure (Matlab does not currently run directly on the Pi but can communicate with it). Even cheaper, Octave 3.6 can run directly on a Raspberry Pi; the screen below shows Octave 3.6 running within the Pi's built-in graphical user interface (showing off the 3D graphic functions "mesh" and "surf").

There are many laboratory and field applications, especially in combination with an Arduino micro-controller. However, the slowness of Octave (compared to Matlab), combined with the modest speed of the Raspberry Pi 3 may be limiting in some applications. (Altogether it's about 50 - 1000 times slower than Matlab on a contemporary desktop computer). Python, with comes with the Pi, is faster than Octave. Alternatively, it's possible to communicate with Raspberry Pi hardware remotely from a faster computer running MATLAB using the MATLAB Support Package for Raspberry Pi Hardware for Matlab R2016b or later, using one or more remotely accessed Raspberry Pi's for experiment control and data acquisition and local storage and doing the heavy-duty number crunching on the main computer. Or you could simply have the Pi save data or results in a shared folder that is accessed via WiFi from another computer.

Python is the primary programming language that comes with the Raspberry Pi. (This language is quite different than the older languages traditionally used by scientists, such as Fortran or Pascal, and it might be confusing for people without a computer-science background). See this comparison of Matlab and Python for signal processing applications. Here's a simple example of real-time data acquisition and plotting on a Raspberry Pi, using the commercially available add-on Sense Hat board with a program written in Python, measuring temperature as a function of time (real-time animation). If you don't have a Sense Hat, here's a version of the same Python program that plots the running average of a random number between 0 and 200 using the same autoscaling graphic technique, displaying a result that gradually settles down closer and closer to the average of 100 the longer you let it run. (This one will also run in WinPyton on a plain Windows PC). Compare this to the Matlab/Octave script that does the same thing. The execution time of the WinPython and Matlab versions is essentially the same, but in this particular case the Matlab/Octave script is shorter.

Other competing systems include the BeagleBoard and the LattePanda, a tiny Windows-10 computer board with 2 Gbytes RAM and 32 Gbytes flash storage. Many similar products are available.

Appendix X: Batch processing (added June, 2017)
In situations where you have a large volume of similar types of data to process, it's useful to automate the process. Let's assume that you have already acquired data in the form of a large number of text or numerical data files of some standardized format that are stored in a known directory (folder) somewhere on your computer. For example they might be ASCII .txt or .csv files with the independent variable ('x') in the first column and one or more dependent variables ('y') in the other columns. There may be a variable number of data files, and their file names and length may be unknown and variable, but the data format is consistent from file to file. You could write a Matlab script or function that will process those files one-by-one, but you want the computer to go through all the data files in that directory automatically, determine their file names, load each into the variable workspace, apply the desired processing operations (peak detection, deconvolution, curve fitting, whatever), and then collect all the resulting terminal window output, each labeled with the file name, add the results to a growing "diary" file, and then go on to the next data file. Ideally, the program should not stop if it encounters any kind of fatal error with one of the data files; rather, it should just skip that one and go on to the next

BatchProcess.m is a Matlab/Octave example of just such an automated process that you can use as a framework for your applications. You need only change three things to make this work for you:

(a) the directory name where the data are stored on your computer - (DataDirectory) in line 11;
(b) the directory name where the Matlab signal processing functions are stored on your computer
- (FunctionsDirectory) in line 12; and

(c) the actual processing functions that you wish to apply to each file (which in this example perform
peak fitting using the "peakfit.m" function
in lines 34 - 41, but could be anything).

When it starts, the routine opens a "diary" file in line 21, located in the FunctionsDirectory, with the file name "BatchProcess<date>.txt" (where <date> is the current date, e.g. 12-Jun-2017). This file captures all the terminal window output during processing - in this example, I am using the peakfit.m function that generates a FitResults matrix (with  Peak#, Position, Height, Width, and Area of the best-fit model), and the percent fitting error and R2 value ,for each data file in that directory. (Subsequent runs of the program on the same date are appended to this dairy file. On each subsequent day, a new file is begun for that day). You can also optionally save some of the variables in the workspace to data files; add a "save" function after the processing and before the "catch me" statement (type "help save" at the command prompt for options).

This program uses a couple of coding techniques that are especially useful in automated file processing. It uses the "function forms" of the commands "ls" (line 13), "diary" (line 21), and "load" (line 29) to allow then to accept variables computed within the program. It also uses the "try/catch/end" structure (lines 28, 47, 49), which prevents the program from stopping if it encounters an error on one of the data files; rather, it reports the error for that file and skips to the next one. Here's the general form of that structure:

TRY
% The statements between 'TRY' and '
CATCH ME' will be
%
attempted and skipped if they generate an error.
CATCH ME
% If an error is generated,
the statements between
%
'CATCH ME' and 'END' will be executed instead.
END

After running this script, the "BatchProcess..." diary file will contain all the terminal output. Here's an except from a typical diary file, in which the first two data files in the directory yielded errors that were caught and skipped by the "try/catch/end" structure , but the third one ("2016-08-05-RSCT-2144.txt") and all the following ones worked normally and reported the results of the peak fitting operations:

Error with file number 1.
Error with file number 2.

3: 2016-08-05-RSCT-2144.txt
Peak#  Position  Height     Width       Area
1   6594.2    0.1711    0.74403    0.13551
2   6595.1    0.16178    0.60463    0.1041
% fitting error    R2
2.5735      0.99483

4: 2016-09-05-RSCT-2146.txt
Peak#   Position  Height    Width      Area
1    6594.7    0.11078   1.4432    0.17017
2    6595.6    0.04243   0.38252   0.01727
% fitting error    R2
4.5342      0.98182

5: 2016-09-09-RSCT-2146.txt
Peak#   Position  Height     Width      Area
2    6594     0.05366    0.5515     0.0315
1    6594.9   0.1068     1.2622     0.1435
% fitting error    R2
3.709      0.98743

6: .... etc....

You could also optionally import the dairy file into Excel by opening an Excel worksheet, click on a cell, click Data > From Text, select the diary file, click to specify that spaces are to be used as column separators, and click Import. This will put all the collected terminal output into that spreadsheet. Additionally you might want to save the workspace variables (e.g. as a .mat file).

Appendix Y: Real-time signal processing (added February 2018)
All of the signal processing techniques covered so far make the assumption that you have acquired and stored the data in computer memory before beginning processing. In some cases, however, it is necessary to do the signal processing in "real time", that is, point-by-point as the data are acquired from the sensor or instrument. That requires some modification of the software, but the main ideas still apply. In this section we will look at ways to perform real-time data plotting, smoothing, differentiation, peak detection, harmonic analysis (frequency spectra), and Fourier filtering. Because the details of the data acquisition itself varies with each individual experimenter and instrumental setup, these demonstration scripts will simulate real-time data so that you can run them immediately on your computer without additional hardware. I'll do this in either of two ways:
(a) by using mouse-clicks to generate each data point, using Matlab's "ginput" function, or
(b) by pre-calculating simulated data and then accessing it point-by-point in a loop.
The first method is illustrated by the simple script realtime.m. When you run this script, it displays a graphical coordinate system; position your mouse pointer along the y (vertical) axis and click to enter data points as you move the mouse pointer up and down. The "ginput" function waits for each click of the mouse button, then the program records the y coordinate position and counts the number of clicks. Data points are assigned to the vector y (line 17), plotted on the graph as black points (line 18), and print out in the command window (line 19). The script realtimeplotautoscale.m is an expanded version that changes the graph scale as the data come in. If the number of data points exceeds 20 ('maxdisplay'), the x axis maximum is  re-scaled to twice that (line 32). If the data amplitude equals or exceeds ('maxy'), the y axis is re-scaled to 1.1 times the data amplitude (line 36).

The script realtimeplotautoscale2.m uses the second method to simulate real-time data, using pre-calculated data (loaded from disk in line 13) that is accessed point-by-point in lines 25 and 26 (animation shown on the right).

The script realtimeplotdatedtime.m demonstrates one way to use Matlab's 'clock' function to record the data and time of each data point that is acquired by clicking. You could also have the computer control the time of data acquisition by reading the clock in a loop until the desired time and date arrives, then take a data point. Of course, a Windows machine is not ideal for high-speed, precisely-timed data acquisition, because there are typically so many interrupts and other processes going on in the background, but it's adequate for low-speed applications. For higher speeds, specialized hardware and software is available.

Smoothing. The script RealTimeSmoothTest.m demonstrates real-time smoothing, plotting the raw unsmoothed data as a black line and the smoothed data in red. In this case the script pre-calculates simulated data in line 28 and then accesses the data point-by-point in the processing 'for' loop (lines 30-51). The total number of data points is controlled by 'maxx' in line 17 (initially set to 1000) and the smooth width (in points) is controlled by 'SmoothWidth' in line 20. (To do this with real time data from your sensor, comment out line 29 and replace line 32 with the code that acquires one data point from your sensor).

As you can see in the screen animation on the left, the smoothed data (in red) is delayed compared to the raw data, because a smoothed data point can not be computed until a number of data points equal to the smooth width has been acquired - 21 points, or about 0.5 seconds in this example. (However, knowing the smooth width, you can correct the recorded y-axis positions of signal features, such as maxima, minima, peaks, or inflection points). This particular example implements a sliding average smooth, but other smooth shapes can be implemented simply by uncommenting line 24 (rectangular smooth), 25 (triangular smooth), or 26 (Gaussian smooth), which requires that the functions 'triangle' and 'gaussian' be in the Matlab/Octave path.

A practical application of a sliding average smooth like this is in a control system where a noisy signal turns on a valve, switch or alarm signal whenever the signal exceeds a certain value; an example is shown in the figure on the right,
where the threshold value is 0.5 and the signal is so noisy that smoothing is required to prevent the signal from prematurely triggering the control. Too much smoothing, however, will cause an unacceptable delay in operation.

On a standard desktop PC (Intel Core i5 3 Ghz) running Windows 10 home, the smooth operation adds about
2 microseconds per data point to the data acquisition time (without plotting, PlottingOn=0) and 20 milliseconds per point (50 Hz) with point-by-point plotting (PlottingOn=1). With plotting off, the script acquires, smooths, and stores the smoothed data in the variable "sy" in real time, then plots the data only after data acquisition is complete, which is much faster than plotting in real time.

Differentiation. The script RealTimeSmoothFirstDerivative.m demonstrates real-time smoothed differentiation, using a simple adjacent-difference algorithm (line 47) and plotting the raw data as a black line and the first derivative data in red. The demonstration script RealTimeSmoothSecondDerivative.m
computes the smoothed second derivative by using a central difference algorithm (line 47). Both of these scripts pre-calculate the simulated data in line 28 and then accesses the data point-by-point in the processing loop (lines 31-52).  In both cases the maximum number of points is set in line 17 and the smooth width is set in line 20. Again, the derivatives are delayed compared to the original signal. Any derivative order can be calculated this way using the derivative coefficients in the Matlab/Octave derivative functions listed on Differentiation.html#Matlab.

Peak detection. The little script realtimepeak.m demonstrates simple real-time peak detection based on derivative zero-crossing, using mouse clicks to simulate data. Each time your mouse clicks form a peak (that is, go up and then down again), the program will register and label the peak on the graph and print out its n and y values.

Peak detected at n=13 and y=7.836

Peak detected at n=26 and y=1.707

I
n this case, a peak is defined as any data point that has lower amplitude points adjacent to it on both sides, which is determined by the nested 'for' loops in lines 31-36. Of course, a peak can not be registered until the point following the peak is recorded, because there is no way to predict ahead of time whether that point will be lower or higher than the previous point.

If the data are noisy, it's better to smooth the data stream before detecting the peaks, which is exactly what RealTimeSmoothedPeakDetection.m does, which
reduces the chance of false peaks due to random noise but has the disadvantage of delaying the peak detection further. Even better, the script RealTimeSmoothedPeakDetectionGauss.m uses the technique described on PeakFindingandMeasurement.htm#findpeaks; it locates the positive peaks in a noisy data set that rise above a set amplitude threshold ("AmpThreshold" in line 55), performs a least-squares curve-fit of a Gaussian function to the top part of the raw data peak (in line 58), identifies each peak (line 59), computes the position, height, and width (FWHM) of each peak from that least-squares fit, and prints out each peak found in the command window. The peak parameters are measured on the raw data, so they are not distorted by smoothing. If you look at the animation on the right, you can see that the "peak" label appears next to each detected peak just a fraction of a second after the top of the peak, but the actual peak times listed are not delayed. In this example, the actual peak times are x=500, 1000, 1100, 1200, 1400. (Also note that the first visible peak, at x=300, is not detected because it falls below the amplitude threshold, which is 0.1 in this case).

Peak detected at x=500.1705, y=0.42004, width= 61.7559
Peak detected at x=1000.0749, y=0.18477, width= 61.8195
Peak detected at x=1100.033, y=1.2817, width= 60.1692
Peak detected at x=1199.8493, y=0.36407, width= 63.8316
Peak detected at x=1400.1473, y=0.26134, width= 58.9345

The script additionally numbers the peaks and saves the peak parameters of all the peaks in a matrix called PeakTable, which you can interrogate as each peak is encountered if you are looking for particular peak patterns.  See PeakFindingandMeasurement.htm#UsingP for some ideas on using Matlab/Octave notation and functions to do this.

Peak sharpening. The script RealTimePeakSharpening.m demonstrates real-time peak sharpening using the second derivative technique. It uses pre-calculated simulated data in line 30 and then accesses the data point-by-point in the processing loop (lines 33-55). In both cases the maximum number of points is set in line 17, the smooth width is set in line 20, and the weighting factor (K1) is set in line 21. In this example on the left, the smooth width is 101 points, which accounts for the the delay in the sharpened peak compared to the original (about 1.12 seconds without real-time plotting and 3.8 second with real-time plotting).

Real-Time Frequency Spectrum. The script RealTimeFrequencySpectrumWindow.m computes and plots the Fourier frequency spectrum of a signal. Like the scripts above, it loads the simulated real-time data from a ".mat file" and then accesses that data point-by-point in the processing loop. A critical variable in this case is "WindowWidth", the number of data points taken to compute each frequency spectrum. The larger this number, the fewer the number of spectra that will be generated, but the higher will be the frequency resolution. On a standard desktop PC (Intel Core i5 3 Ghz running Windows 10 home), this script generates about 50 spectra per second with an average data rate (points per seconds) of about 50,000 Hz. Smaller spectra (i.e. lower values of WindowWidth) generate proportionally lower average data rates (because the signal stream is interrupted more often to calculate and graph a spectrum).

If the data stream is an audio signal, it's also possible to play the sound through the computer's sound system synchronized with the display of the frequency spectra; to do this, set PlaySound=1. Each segment of the signal is played, just before the spectrum of that segment is displayed, as shown on the right. The sound reproduction will not be not perfect, because of the slight delay while the computer computes and displays the spectrum before going on to the next segment. In this demonstration script, the data file is in fact an audio recording of an 8-second excerpt of the 'Hallelujah Chorus' from Handel's Messiah with a sampling rate of 8192 Hz, which is included in the Matlab distribution. The figure above shows one of the 70 spectra generated with a WindowWidth of 1024. You can adjust the argument of the 'pause' function for your computer to minimize this problem and to make the sound play at the correct pitch.

Real-Time Fourier Filter. The script RealTimeFourierFilter.m is a demonstration of a real-time Fourier filter. It pre-computes a simulated signal starting in line 38, then access the data point-by-point (line 56, 57), and divides up the data stream into segments to compute each filtered section. "WindowWidth" (line 55) is the number of data points in each segment. The larger this number, the fewer the number of segments that will be generated, but the higher will be the frequency resolution within each segment. On a standard desktop PC (Intel Core i5 3 Ghz running Windows 10 home), with a window width of 1000 points, this script generates about 35 filtered segments per second with an average data rate (points per second) of about 34,000 Hz. Smaller segments (i.e. lower values of WindowWidth) generate proportionally lower average data rate (because the signal stream is interrupted more often to calculate and graph the filtered spectrum). The result of applying the filter to each segment is displayed in real time during the data acquisition, and then at the end the script compares the entire raw signal input to the reassembled filtered output, shown the the figure on the left.

In this demonstration, a bandpass filter is used to detect a 500 Hz
sine wave (the frequency is controlled by 'f' in line 28) that occurs in the middle third of a very noisy signal (line 32), from about 0.7 sec to 1.3 sec. The 500 Hz sine wave is so weak it can not be seen at all in the raw signal (upper panel of the figure on the left). The filter center frequency (CenterFrequency) and width (FilterWidth) are set in lines 46 and 47.

Real-time classical least squares. The classical least squares technique can be applied in real time to 2D chromatography with array detectors that can acquire a complete spectrum multiple times per second over the entire chromatogram. This is explored in "Spectroscopy and chromatography combined: time-resolved Classical Least Squares".

To apply any of these examples to real-time data from your sensor or instrument, you need only the main processing 'for' loop, replacing the first lines after the 'for' statement with a call to a function that acquires a single point of raw data and assigns it to y(n). If you don't need the data plotted out point-by-point in real time, you can speed things up greatly by removing the "drawnow" statement at the end of the 'for' loop or by removing all the plotting code.

In the examples here, the output of the processing operation is used to plot or to print out the processed data point-by-point, but of course it could also be used as the input to another processing function or to a digital-to-analog converter or simply to trigger an alarm if certain specified results are obtained (e.g. if the signal exceeds a certain value for a specified length of time, or if a peak is detected at a specified position or height, etc).

## Z. Dealing with variable data arrays in spreadsheets

When applying spreadsheet templates of the type described in this book to your own data, it's often necessary to modify the templates to accommodate different numbers of data points or of components. This can be tedious to do, especially because you need to remember the syntax of each of the spreadsheet functions that you want to modify. This section describes ways to construct spreadsheets that automatically adapt to different data sets, without your taking the time and effort to modify the spreadsheet formulas for each case. This involves employing some less-commonly used built-in functions in Excel or OpenOffice Calc, such as MATCH, INDIRECT, COUNT, IF, and AND.

The MATCH function. In signal processing using spreadsheets, it's common to have x-y arrays of data of variable length, such as spectra (x=wavelength, y=absorbance or intensity) or chromatograms (x=time, y=detector response). For example, consider this small array of x and y values pictured in the spreadsheet fragment on the left. Spreadsheet formulas normally refer to cells by their row and column address, but for an x-y data set like this, it's more natural to refer to a data point by its independent variable x, rather than its row and column address. For example, suppose you want to select the data point where x=2, irrespective of what cells they inhabit. You can do that with the MATCH function. For example, if you set cell B2 to the desired x value (2), then the cell formula MATCH(B2,A5:A11) + ROW(A5) will return the row number of that point, which is 6 in this case. Later, if you were to move or expand this table, by dragging it or by inserting or deleting rows or columns, the spreadsheet will automatically adjust the MATCH function to compensate, returning the new row number of the requested point.

The INDIRECT function. The usual way to reference the value in a cell is to specify its row and column address. For example, the small array of x and y values pictured above, to refer to the contents of column B, row 6, you could write "=B6", which in this case will evaluate to 5.9. This is referred to as "direct" addressing. In contrast, to use "indirect" addressing you can write "=INDIRECT("B"&A1)", then put the number "6" in cell A1. The "&" character is simply "glue" that joins "B" to the contents of A1, so in that case "B"&A1 evaluates to "B6" and the result is the same as before: the contents of cell B6, which is 5.9. However, if you change cell A1 to 9, then "B"&A1 would evaluate to "B9", and the result would be the contents of cell B9, which is 9.1. In other words, the indirect function allows the addresses of cells to be calculated within the spreadsheet rather than being typed in as a fixed number. This makes it possible for spreadsheets to adjust their own addresses based on a calculated result, for example to adapt their calculations to fit the number of data points in that particular data set.

These examples were done in what is called the "A1" reference style, where the columns are referred to by letters; it's also possible to use the "R1C1" reference style, where both the and the rows columns are referred to by numbers. For ex"ample =INDIRECT("R"&A2&"C"&A1,FALSE)", with the row number in A2 and the columns number in A1. (The "FALSE" just means that the "R1C1" reference style is used).
The "R1C1" reference style allows both the row and column address to be expressed as numbers that can be calculated within the spreadsheet,.

You can use the same technique to compute ranges of cell addresses. For example, you could compute the sum of the y values in column B, rows 5 to 11, by writing SUM(B5 : B11) by direct addressing. But suppose you wanted to compute the sum of all the numbers in column B between a variable first row and variable last row. If you put the first row number in A1 and the last row number in row A2, the address of the first cell would be "B"&A1 and the address of the last cell would be "B"&A2. So you would form the range of cell addresses by using "&" to glue together those two addresses, with a colon in-between ("B"&A1&" : B"&A2). The sum would be SUM(INDIRECT("B"&A1&" : B"&A2)), which is 56. Yes, it's longer, but the advantage over direct addressing is that you can adjust the range by changing just two cells rather that retyping the formula. It's the same for other functions that need a range of cells, such as AVERAGE, MAX, MIN, STDEV, etc. For examples of its use in signal processing, see
VariableSmooth.xlsx (screen image).

For functions that require two ranges, separated by a comma, you can use the same technique. For example. suppose you want to compute the slope of the linear regression line between the x values in column A and the y values in column B in the spreadsheet except in the previous figure, using the built-in SLOPE function. SLOPE requires two ranges, first the dependent (y) values and them the independent x values. By direct addressing, the slope is SLOPE(B5 : B11,A5 : A11) . By indirect addressing, you need two separate "indirect" functions, one for each range, separated by a comma. Here's what it looks like all together: SLOPE(INDIRECT("B"&A1&" : B"&A2), INDIRECT("A"&A1&" : A"&A2)), where the x values are in column A, the y values in column B, and the first and last row numbers are in cells A1 and A2 respectively. It works exactly the same for the two related functions that calculate the INTERCEPT and RSQ (the R2 value) of the regression line. I admit that the formula is confusing to read at first, but it works. Just break it down into its parts.

A working example. An example of the use of the MATCH and INDIRECT functions working together is demonstrated in the spreadsheet "SpecialFunctions.xlsx" (Graphic), which has a larger table of x-y data stored in columns A and B, starting in row 7. The idea here is that you can select a limited range of x values to work with by typing in the lowest x and the highest x value in cells B2 and B3, the two cells with a yellow background, shown on the left. The spreadsheet uses the MATCH functions in cells F2 and F3 to compute the corresponding row numbers, which are then used in the INDIRECT functions in the "Properties of selected data range" section to compute the maximum, average, and average of x and of y, and also the slope, intercept, and R2 values of the y vs x linear regression line over that selected x interval. The regression line, fitting only the data from x=20 to 29, is shown in red in the graph on the right, superimposed on the complete data set (blue dots). By simply changing the x-axis limits in cells B2 and B3, the spreadsheet and the graph re-calculates, without your having to edit any of the cell formulas. Try it yourself. (Hint: Cells with red mark in upper right corner contain helpful pop-up notes:
you can float the mouse pointer over any such cell to reveal its cell formula and/or an explanation).

Columns J and K of this sheet also show how to use the "IF" and "AND" functions to copy data from columns A and B into columns J and K only those data points that fall between the two specific x limits.

If desired, you can add more data to the end of columns A and B, limited only be the range of the match functions in cells F2 and F3 (which are initially set to 1000, but that could be as large as you need). The total number of numerical values in the data set is computed in cell I15, using the "COUNT" function (which, as the name suggests, counts the number of cells in a range that contains numbers and does not count empty cells or cells with letters).

Measuring peak location. A common signal processing operation is finding the x-axis value where the y-axis value is maximum. This can be broken down into four steps:
(1) Define the range of values, either directly or using the indirect function, e.g. INDIRECT("B"&F2&":B"&F3)
(2) Determine the maximum y value in that range with the MAX function, e.g.:
MAX(INDIRECT("B"&F2&":B"&F3))
(3) Determine the row number in which that number appears with the MATCH function, e.g.: MATCH(H20,B7:B1000,0)+ROW(A6)
(4) Determine the value of x in that row with the INDIRECT function, e.g.: INDIRECT("A"&H21)
Each step references the results of the one before it. These steps are illustrated in the same "" spreadsheet in column H, rows 20-23. The result is that the maximum y (21.5) occurs at x=28. These steps can even be combined into one long formula (cell H23), although this is harder to read, and harder to document, than the formulas for the separate steps. The peak finder spreadsheet uses this technique.

The LINEST function. Indirect addressing is particularly useful when using array functions such as LINEST or the matrix algebra functions. The demonstration spreadsheet "IndirectLINEST.xls" (graphic link) shows how this works for the multiwavelength spectroscopy analysis of a mixture of three overlapping components by the CLS method. The measured mixture spectrum is in column C, rows 29-99 and the spectra of the three pure components are in columns D, E, and F. Cell C12 "=COUNT(C29:C1032)" counts the number of rows of data (i.e. number of wavelengths) in column C starting at row 29, and cell G3 counts the number of components (in this case 3). These are used to determine the first and last row and column for the indirect addresses in LINEST in cell C17. The measured peaks heights calculated by LINEST for the three peaks are given in row 17, columns C, D, can E, and the predicted standard deviations are in the row below. In this spreadsheet the data are actually simulated (over in columns O - U), so the true peaks heights are known and therefore the absolute accuracy can be calculated (row 26, C, D, and E) and compared to the predicted standard deviations. Press the F9 key to recalculate with an independent noise sample, which is equivalent to taking another measurement of the same sample. Because of the use of INDIRECT addressing, you can add or subtract data points at the end of columns C - E and the calculations work with no other changes. Examples of its use in signal processing are on CurveFittingB.html#spreadsheets.

## AA. Computer simulation of signals and instruments.

Throughout this book, I have often used computer simulations to test, demonstrate, and determine the range of applicability and the accuracy of various signal processing techniques. The aim is to generate realistic computer-simulated signal by adding together three components:

(a) known signal component, such as one or more peaks, pulses, or sigmoidal steps,
(b) a baseline, which may be fixed, variable, flat, sloped, curved or stepped, and
(c) random noise, which may have various frequency distributions (colors) and amplitude dependencies.

Such simulations can be done either in Matlab/Octave, using the built-in and downloadable functions for various peak shapes and types of random noise, or in spreadsheets, which can also be used to create attractive and intuitive user interfaces; some spreadsheet examples include SimulatedSignal6Gaussian.xlsx, PeakSharpeningDemo.xlsx, PeakDetectionDemo2.xls, TransmissionFittingDemoGaussian.xls, BeersLawCurveFit2.xls, and RegressionDemo.xls (above).

It's possible to make any aspect of a computer generated signal randomly variable from measurement to measurement, with the aim of making  the simulation as close as possible to the real signal behavior that you may have to measure. For example, in section E, "The Battle Rounds: a comparison of methods", the signal to be measured is a Gaussian peak located near the center of the recorded signal, with a fixed shape and width. The baseline, on the other hand, is highly variable, both in amplitude and in shape, and there is also added white noise. In another simulation, "Why measure peak area rather than peak height?", the signal peak itself is subject to a variable broadening process that causes the measured peak to be shorter and wider but which has no effect of the total area. In the section "Measuring a buried peak", the signal is a small "child" peak that is buried under the tail of a much stronger "parent" peak. In all these cases, the true underlying signal is known to the software, so that, after the software measures the simulated observed signal with all its baseline and noise variability, it can calculate the error of measurement, allowing you to compare different methods ’
or to optimize the method's variables to obtain the best accuracy.

In some cases it may be possible to simulate important aspects of an entire measurement instrument system. Several examples are shown in https://terpconnect.umd.edu/~toh/models/. This is most useful if both the signal magnitude and the noise can be predicted from first principles. For example, in optical spectroscopy, the principles of physics and of geometrical optics can be used to predict the intensity of an incandescent light source, the transmission of a monochromator, and the signal generated by a photomultiplier, including the photon noise.

When these are combined, it's possible to simulate the fundamental aspects of such instruments as a scanning fluorescence spectrometer (above) or an atomic absorption instrument (below),

to predict the analytical calibration curves of absorption spectroscopy, to compare the theoretical signal-to-noise ratios of absorption and fluorescence measurement, and to predict the detection limits of atomic emission measurement of various elements, and the effect of slit width on signal-to-noise ratio in absorption spectroscopy (below).

You can also simulate the operation of a lock-in amplifier, a wavelength modulation spectroscopy system, and even basic analog electronic and operational amplifier circuits. Note that these are not simulations of particular commercial instruments. They are not training tools for instrument operators. Rather, they are interactively manipulated mathematical models that describe various parts of or aspects of each system, for the purpose of illuminating hidden aspects of the instrument's internal operation.

## AB. Who uses this book and its associated web site, documents and software?

In the last few years, this site (http://terpconnect.umd.edu/~toh/spectrum/) has been accessed from Internet Service Providers in over 162 countries and 6 non-region-specific categories (e.g. satellite providers), including many countries in the developing world, some very small countries (e.g. Liechtenstein, the Faroe Islands), relatively isolated countries (Cuba, North Korea, Myanmar/Burma), and even some war-torn regions (Afghanistan, Syria, Iraq). Breath of Internet access is often an issue. For example, I've got fewer views from Cuba that from other Spanish-speaking countries with smaller populations, such as Bolivia, Dominican Republic, Costa Rica, Puerto Rico, Panama, and Uruguay, even though Cuba has many active scientists, especially in the medical and pharmaceutical fields.

The first Web version went up in 1996, but I didn't start keeping track of views until 2008; since then there have been over 2 million page views. The distribution of page view counts among countries is very long-tailed, with one-third of the views coming from the USA (except during major US holidays), half of the views coming from only 5 countries (USA, India, Germany, United Kingdom, and China) and 99% of the views coming from only 39 countries. Among the countries that have a relatively large number of page views relative to their populations are the USA, Germany, UK, Canada, Australia, Netherlands, Switzerland, Singapore, Israel, Belgium, Taiwan, South Korea, and Scandinavia. (Another web site of mine on a related subject, Interactive Computer Models for Analytical Chemistry Instruction, had got an additional 825,000 views).

The Internet Service Providers with the largest number of views are Comcast, Verizon FIOS, Time Warner, Cloudflare, At&t U-verse, Deutsche Telekom (Germany), BSNL (India), and Cox Communication. Most views worldwide come from Windows machines, about 20% from Linux and Macintosh, and 10% from mobile devices. I've made efforts to make my pages more usable from mobile devices like smartphones.

About one quarter of the views come directly from educational institution ISPs that have "School", "Ecole", "College", "Hochschule", "Univ...", "Academic", or "Institute of Technology" in their names. (The number of educational users is certainly larger than that because some users are no doubt accessing from other ISPs in homes or businesses). An analysis of 200,000 views in 2015 showed that the biggest educational users have been the University of California System (UCLA, Berkley, etc.), Indian Institute Of Technology system, the University of Texas system, Massachusetts Institute Of Technology, the University of Michigan, the University of Maryland (my home institution), Delft University of Technology (Netherlands), Stanford University, China Education And Research Network Center, the University Of Wisconsin System, and the University of Illinois.

Many of the large national laboratories are users, including Bell Canada, Oak Ridge, Pacific Northwest, Lawrence Livermore, Sandia, Brookhaven, National Renewable Energy Laboratory, SLAC, FermiLab, Lawrence Berkeley, NRC Canada, CERN, NIST, NASA, JPL, and NIH.

The most popular pages on the site recently have been Peak Finding and Measurement, Smoothing, Integration, Deconvolution, InteractivePeakFitter, and Signal Processing Tools. About 50% of the page views originate from search engines (80% of those using Google). The most common search keywords used are: "peak area", "convolution", "deconvolution", "peak detection", "signal processing pdf", "findpeaks matlab", "Fourier filter", and "smoothing". About 40% of the traffic comes from direct links (bookmarks or typed URLs) and about 10% comes from referring websites, usually from Wikipedia or from MathWorks. Unfortunately, page loads and search terms have become almost completely encrypted in recent years, so I can no longer tell which pages are being viewed and what is being download. (Interestingly, that is not the case with Interactive Computer Models for Analytical Chemistry Instruction, which has only 75% encryption).

There have been over 100,000 downloads of my software and documentation files, currently averaging several hundred file downloads per month, from both my web site and from my files on the Matlab File Exchange. The most commonly downloaded files are IntroToSignalProcessing.pdf, PeakFinder.zip, ipf12.zip, CurveFitter....xlsx, iSignal6.zipipeak7.zip, PeakDetection.xlsx, and the complete site archive SPECTRUM.zip

What factors influence the number of page views from different countries?  The tools of data analysis, specifically regression via LINEST, can help answer this question. Obviously, one would expect that a country's population would be a factor, but it turns out that the correlation between log(page loads) and log(population) is very poor, with a coefficient of determination (log-log correlation coefficient or R
2 value) of only 0.36 (n=163 countries; over 160,000 total page loads over the period from 2008 to 2017; graphic link). Note that because of the very large range of population sizes, I did a log-log correlation in order to prevent the results from being totally dominated by the top few countries.

I also investigated the effect of other factors that might be more specific to the language and subject matter of my particular site, including
• the number of English speakers in each country,
• the number of Internet users,
• the number of universities, and
• the total research and development budget of each country.
All of that information is freely available on the internet (graphic link). By a good margin, the most influential factor was the research and development budget ,for which the R2 value was 0.76. This is perhaps not surprising given that my site concerns a very narrow and specialized topic: the technical aspects of computerized scientific data processing.

A log-log multilinear regression on all 5 of these factors together yielded a R
2 value of 0.84 (n=53 countries for which all 5 factors were reported), which is a modest improvement over the research and development budget alone.

For an Excel spreadsheet with all these data and calculations (between 2008 and 2015), see FinalCountriesSummary.xlsx

AC. The Law of Large Numbers

The Law of Large Numbers is a theorem that describes large collections of numbers or observations that are subject to independent and identically distributed random variation, such as the result of performing the same experiment a large number of times. The average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed. It is an important idea because it guarantees stable long-term results for the averages of some random events. This is why gambling casinos are able to make money; their games are designed to give the casino a small advantage in the long run but highly variable results in the short term, guaranteeing plenty of (noisy) winners, which encourages the gamblers, but even a greater number of (quiet) losers. And that is why investors in the stock market often make money in the long run, despite the unpredictable day-to-day variation, up one day and down the next, and why it is so hard to see climate change in the much wilder short-term hot and cold day-to-day and year-to-year swings in the weather. Short term is closer and easier; long term is harder to see.

But "The average ... will tend to become closer as more trials are performed" does not mean that the average becomes steadily and irreversibly closer. In fact, the average can wander around quite a bit. Take the example on the left, which shows the running average of a set of normally distributed independent random numbers with a population mean of 1.000 and a standard deviation of 1.000, as more and more numbers from that population are averaged, up to 1000. (This is generated by the Matlab script RunningAverage.m, shown on the left). Note that the average wanders around, reaching and crossing over the true population average twice in this case before ending up near 1.0 after 1000 points are accumulated. But if you ran this script again, the final average may not be so close to 1.0. In fact, the predicted standard deviation of the average of 1000 random numbers is reduced by a factor of 1/sqrt(1000), which is about 0.031, or 3% relative, meaning that most results will fall within 6% of the true average of 1.000, that is, between 0.94 and 1.06.

The uncertainty of uncertainty. The situation is even worse if you wish to estimate the standard deviation of a population from small samples. The Matlab script
RunningStandardDeviation.m simulates this for the same population in the previous example. As shown in the graph on the left, the sample standard deviation wanders around alarmingly for small samples and only settles down slowly. Even worse, the standard deviation for very small samples is biased down, often returning values far lower than the population standard deviation.

There is a well-documented tendency for people to overestimate the quality of small numbers of observations, sometimes referred to as hasty generalization, or insensitivity to sample size, or the gambler's fallacy. This is related to the field of study of a famous pair of psychologists named Amos Tversky and Daniel Kahneman, who collaborated in a long-running study of human cognitive biases in the 1970s. They formulated a hypothesis that people tend to believe in a false "Law of Small Numbers", the name they coined for the mistaken belief that a small sample drawn from a large population is representative of that large population. We would like to believe that scientists are immune to these foibles and that they always think logically and correctly. But scientists are only human, so it is important to be aware of this tendency, particularly when a small sample of data supports your favorite hypothesis. It is tempting to stop there, "while you are ahead". This is called "confirmation bias". Don't do it.

Of course in many practical experimental measurements, you may really be constrained to a rather small number of repeated measurements. There may be a fixed number of data points and no possibility of gathering more. Or the cost, in money or in time, of gathering more data may be excessive, even in a laboratory environment. For example, the process of calibrating an analytical instrument for quantitative measurement may involve the preparation and measurement of several standard samples or solutions of known composition. If the calibration curve (the relationship between instrument reading and sample composition) is non-linear, it takes several different standards to define the curve. You have to consider not only cost of preparing many standards but also the cost of cleaning up and safely storing or disposing of the (potentially hazardous) chemicals afterwards. The bottom line is, if you are limited to a small number of data points, do not over-represent the precision of your results. To use the 3-sigma rule to determine uncertainty ranges for a set of data, the distribution must be normal (Gaussian) and you need to know the standard deviation. The problem is that, for small sets of data, both are uncertain.

AD. Spectroscopy and chromatography combined: time-resolved Classical Least Squares

The introduction of high-speed UV-Visible array detectors into high performance liquid chromatography (HPLC) instruments has significantly increased the power of that method. The speed of such detectors is such that they can acquire a complete spectrum multiple times per second over the entire chromatogram. An example of this is described in a technical report from Shimadazu Scientific Instruments (https://solutions.shimadzu.co.jp/an/n/en/hplc/jpl217011.pdf) which considers the separation of three positional isomers of methyl acetophenone: o-methyl (o-MAP), m-methyl a(m-MAP), and p-methyl (p-MAP). The ultraviolet absorption spectra of these three isomers at a concentration of 400 μg/mL each is shown below on the left, and the chromatographic separation, using the column and conditions specified in their report, are shown in the middle. The report goes on to describe their commercial software, which uses a complex iterative approach to extract the spectra and the chromatographic characteristics from the raw data.

CLSPercentErrors =        0.0021993   0.0020162    0.0015607
PerpDropPercentErrors =  -1.6315     -0.78697      3.272

Here I present a comparatively simple non-iterative technique based on the same chemical system, in which we consider each spectrum acquired by the detector as a separate sample mixture and apply the Classic Least-squares method previously introduced, in which the spectra of the components are known beforehand and where adherence to the Beer-Lambert Law is expected. The spectra and chromatographic peaks are simulated digitally in the Matlab/Octave script TimeResolvedCLS.m, shown in the figure below, by modeling the spectrum of each component as the sum of three Gaussian peaks and the chromatographic peaks as exponentially modified Gaussians. To make this simulation as realistic as possible, the parameters were carefully adjusted to match the graphics in the technical report closely, and the other parameters, such as the spectral resolution, sampling rate, and detector noise (2 milliabsorbance units, mAU), were based on that report also. Note that the chromatographic peaks (middle figure) are nowhere near baseline resolved. Therefore, it is to be expected that quantitative calibration based on the measurement of peak areas in this chromatogram (for example by the perpendicular drop method might be inaccurate, especially if the peak heights are very different. In fact, in this case, even though the concentrations of the three components are much lower (0.05 μg/mL for each), the peak areas measured by perpendicular drop are only about 2% from the true values, mainly due to the slight asymmetry and nearly equal height of the three peaks. The spectra (left-hand figure) are even more highly overlapped than the chromatographic peaks, but they are distinct in shape, and that is the key.

Basically, we treat this as a series of 3-component CLS calculations, one for each time slice of the detector. The actual calculations can be done in two ways, depending on whether the spectra are processed one by one or are collected for the entire chromatogram and then processed all at once, using either "Alternative calculation #1", lines 113-146, or "Alternative calculation #2", lines 150-170. The first method, shown on the left, looks like chromatography as it executes; it computes the chromatographic peaks of the three components point by point as they evolve in time and plots them in the first three quadrants of figure window 3 (on the right). The second method calculates the entire chromatogram in one step at the end and makes the same final plots. (The second method is faster computationally, but that's not significant because the chromatography takes much longer than the calculations). Either way, the result is the same; the chromatographic peaks of the three components are completely separated mathematically, so their areas are easily measured, no matter how much they overlap! Note that, although the three spectra must be known, no knowledge of the chromatography peaks is required; they emerge separate and intact from the data, purely computationally.

Stress test. In order to test the abilities and limitations of this method, I have prepared a series of increasingly challenging scenarios, starting with the one pictured above and becoming progressively more difficult by making the chromatographic peaks more closely spaced, making the peak more asymmetrical, making the spectra more similar, and making the concentrations unequal. These scenarios are listed in the table below, along with the typical percent errors in peak area measurement by the CLS method and links to the corresponding graphics and Matlab/Octave m-files. Each is a more challenging variation on the first one; #2 has much more chromatographic peak overlap; #3 has much more asymmetrical chromatographic peaks (higher tau); #4 has much more similar spectra - in fact, the peak wavelengths differ by only 0.1 nm, making them look identical; in #5, component 2 (the middle peak) has a concentration 100 times lower; and #6 is the same as #5 except that the peaks are highly asymmetrical. In all of these cases, the normal perpendicular drop area measurement technique is either impossible (because there are no distinct peaks for each component) or are very much in error, but the CLS techniques works well, giving very low errors except when the middle peak concentration is 0.0001, which approaches the random noise limit of the detector. (Another variation, TimeResolvedCLSbaseline.m, includes the correction for baseline shifts.)

 Peak resolution Spectral similarity Peak asymmetry Concentration ratios % errors in area measurement Links 1. Normal Normal Slight: tau=10 .05   .05   .05 0.0022%      0.002%    0.0016% Graphic   m file 2. Unresolved Normal Slight: tau=10 .01   .01   .01 -0.06%       -0.053%    -0.041% Graphic   m file 3. Partly resolved Normal Great: tau=40 .05   .05   .05 -0.0004%    -0.013%    -0.066% Graphic   m file 4. Unresolved Almost complete Slight: tau=10 .01   .01   .01 0.054%         0.049%     0.04% Graphic   m file 5. Unresolved Almost complete Slight: tau=10 .01  .0001  .01 0.026%         2.4%       0.019% Graphic   m file 6. Unresolved Almost complete Great: tau=40 .01  .0001  .01 -0.04%         -3.8%       -0.03% Graphic   m file

Even when the peaks are resolved well enough for the perpendicular drop method to work, it can suffer from interaction between adjacent peak heights; that is, a change in the peak height of one peak can affect the measurement of the area of adjacent overlapped peaks, because of shifts in the valley point between them. This is illustrated by TimeResolvedCLScalibration.m, which simulates the measurement of 10 different three-component mixtures similar to the above (but modified so perpendicular drop measurement is possible), where the concentrations vary independently and randomly over a 1 x 10-4 to 9.5 x 10-4 microgram/mL range, and then plots measured peak area vs concentration for each component. (Each time you run this, you will get a different mix of concentrations). Linear least-squares fits of peak area vs concentration are calculated, as shown below. In this typical example, the average absolute percentage error in area measurement for the perpendicular drop method is about 5%, with an R2 of 0.995, and for the CLS measurement is less than 1%, with an R2 of 0.9995. (Even if the detector noise (line 22) is set to zero in this simulation, the errors in the perpendicular drop method remain, because they are caused by overlap between adjacent peaks, rather than by noise).

Though clearly the CLS method is very effective, all of this really only proves that the mathematics works well; the method still has the serious limitation that it requires that the spectra of all the components be known accurately. This requirement can be met in some applications, but in liquid chromatography there is a potential pitfall. If gradient elution and/or temperature programming are used, and if the spectra of those chemical compounds are sensitive to the solvent and/or to temperature, for example shifting their peaks slightly, then there will likely be additional errors in the CLS procedure. Obviously this depends on the particular chemical system and will have to be evaluated on a case-by-case basis.

But this suggests another interesting use for this method: speeding up a chromatographic method that normally would achieve complete baseline separation (from which accurate spectra of each component could be obtained in situ), and then adjusting the column and/or flow rate to achieve faster but incompletely resolved chromatograms to which the CLS method could be applied quickly and accurately to multiple samples.

In other applications, some or all the components may simply be unknown, and you may want to obtain their spectra. This can be done in situ if the peak separation is at least as good as that depicted in the first figure above, because you can see that at each peak maximum there is virtually no contribution from the adjacent peaks. Therefore, measuring the spectrum of each component at its chromatography peak maximum should give good spectra for each component in this case.

But what if the peaks are even more overlapped than this, so that pure component spectra are never achieved? In that case, more sophisticated methods must be used, such as the one described in the Shimadzu technical report. This involves making initial estimates of spectral and chromatographic peaks, followed by an iterative search for the best fit to the experimental data, subject to the imposition of some important known prior constraints, such as non-negativity of spectra and of the chromatography peaks (those peaks are always positive, except for random noise on the baseline), and the unimodality of the chromatography peaks (that is, each component gives one and only one chromatography peak). Methods of this type will be left to a future expansion of this book.
.

## AE. The mystery peak challenge

The objective of this exercise is to learn as much as we can about the underlying properties of a digitized signal using the signal processing tools in this book and, if possible, to obtain a mathematical description of the signal (script). At first glance, the signal (MysteryPeak.mat) appears to be a single, asymmetrical peak with a maximum at x=55.5. The signal-to-noise ratio seems to be very good - there is little visible noise unless you look very closely - and the signal begins and ends near zero, so baseline correction is likely not an issue. The bad news is that we do not know anything else. The asymmetry might be due to some asymmetrical process applied to an originally symmetrical peak shape, but it could be a group of closely spaced overlapping peaks, which is suggested by the faint bumps in the shape. Some quick preliminary curve fitting can be done using command line peakfit.m:

for a two-Gaussian model (shape 1)
[FitResults,GOF]=peakfit([x y],0,0,2,1)

for a single Logistic model (shape 3)

[FitResults,GOF]=peakfit([x y],0,0,1,3)

for a 4 exponentially broadened Gaussian model (shape 39)
[FitResults,GOF]=peakfit([x y],0,0,4,39,1)

Alternatively, it is even easier to use the interactive peak fitter
ipf.m for this purpose; you can quickly change the model shape, number of peaks, starting guesses, data region to be fitted, etc., with single keystrokes and mouse clicks. Either way, the three initial fits in the figures show that the signal contains a small amount of random noise, which appears to be white (so the signal has probably not been smoothed, which is fortunate) and which has a relative standard deviation of about 0.2%, based on 1/5th of the visual peak-to-peak value in the residual plot. But unfortunately these fits are not successful because their fitting errors (0.5 to 0.8%) are all significantly larger than the 0.2% random noise. Trying different shapes and greater numbers of peaks does not help either, resulting in either higher fitting errors, unstable fits, or zero peak heights for additional peaks (as in the four-Gaussian fit).

Another approach to the problem of asymmetrical peaks is to use the technique of first-derivative symmetrization described previously. This applies specifically to exponential broadening, a commonly- encountered peak broadening mechanism. The idea is that if you compute the first derivative of an exponentially broadened peak, multiply it by a weighting factor equal to the time constant of the exponential, and add it to the original broadened signal, the result will be the original peak before broadening, which makes the peak overlap less severe. This works for any original peak shape. Even if you do not know the time constant beforehand, you can try different values until the baseline after the peak is as low as possible but not negative, as shown in this GIF animation. This is easily done interactively in iSignal, version 7, which has smoothing, derivatives, symmetrization, and curve fitting, or by using the symmetrize.m function. The derivative of y with respect to x, by the "derivxy.m" function, shown by the green line in the figure above, is quite noisy. As usual, we must smooth the derivatives of noisy signals to make them useful, but we must not over-smooth and distort the signals. As a rule of thumb, a smooth width equal to 1/10th of the number of data points in the halfwidth does not distort the signal visibly. Our peak has about 530 points (
=halfwidth(1:length(y),y)), so a smooth width near 53 will not distort the signal peak, but it does eliminate most of the noise from the derivative (red line, above right). Also, we can see that the derivative, in y/x units, is comparable in numerical magnitude to the original signal, so the weighting factor (in x units) is probably somewhere near 1.0. Next, we add the first derivative times that factor to the original signal, looking at the trailing edge as we try six different factor values near 1.0. The graph on the left shows that the optimum value is about 1.25.

When this symmetrization is applied to the entire signal, the result, shown below on the right, has more distinct bumps. When that modified signal is used for curve fitting, it is found that a 3-Gaussian model works quite well, with a fitting error of only 0.25%. This is evidence that the signal actually consists of three exponentially broadened Gaussians. Ordinarily there is no independent way to check the accuracy of the peak parameters so measured, but - full disclosure - the signal in the case was not actually unknown but rather was generated by the file MysteryPeaks.m. The true time constant is 1.3, the peak positions are 53, 55, and 57.5 and the peak areas are each equal to 1.0. The curve-fit results after symmetrization are within 0.1% for the peak positions and within 2% of the areas. In contrast, direct fitting of three equal-alpha exponential Gaussians to the original data, which is now possible because we have determined a value for the time constant
(expressed as the number of data points: symmfactor./(x(2)-x(1)), gives less accurate results in this case (graphic), probably because the overlap of the asymmetrical peak is greater. Moreover, it takes three times longer to compute.

## AF. Developing Matlab Apps

The usual way of developing programs in Matlab is to write scripts or functions, developed and used in a screen environment similar to the screen images of Matlab that I have shown before. In these environments, you can change the values of parameters by typing them into the editor window or the command line. Then, you save the modified script or function and re-run it. But there is another development path that results in programs that use a much more contemporary graphical user interface (GUI) employing drop-down menus, number wheels, tap-and-drag sliders, and such to support more intuitive user interactivity. There are examples of such apps in the toolboxes that are included in your version of Matlab (or can be optionally purchased from Matlab); type "ver" at the command line to see which ones are included in yours. The process of development your own apps is rather more complex than coding the mathematically equivalent script or function. But fortunately, Matlab has a built-in drag-and-drop environment to build user interfaces; just click on the APPS button at the top left. This brings up several app-related buttons as well as a list of apps already installed.

Rather than detail all the required steps here, I will defer to the many excellent tutorials and YouTube videos already available. For example, there is a video tutorial titled "Getting Started and Hello World app" at https://www.youtube.com/watch?v=iga-YS6VbyE. "Hello World" refers to very first simple example often used for learning a new programming language, which simply writes that phrase on the display.

A more pertinent example is shown on the right. "Create and Run a Simple App Using App Designer" displays a waveform whose amplitude you can interactively control with a slider. There are many such examples that are built into the App Designer. When you click "New" in the Designer mode, you'll get a display (shown below) of several examples that are already constructed. You can learn a lot by studying these.

The advantage of apps, compared to scripts and functions, is that they are easier to use, especially for users who are not programmers. The disadvantage is that they are more complex for the programmer. In fact, the amount of code and of coding time and effort that goes into the user interface design and interactivity usually far exceeds the code that is required for the actual mathematical computations.

I will give one final example that compares the operation and coding of a Matlab app to that of a Matlab script that does the same thing mathematically. The Matlab script is "GaussianSelfDeconvDemo3.m" and the app is called "SelfDeconvolutionDemo" which you can run by downloading and double-clicking. Both demonstrate the Fourier deconvolution of a pair overlapping peaks with a 1:2 height ratio, with the aim of increasing the resolution of the peaks. The sequence of operations is:
(1) create the simulated signal with two peaks and random noise,
(2) create a zero-centered convolution function of the same shape and with variable width,
(3) add an adjustable constant to the Fourier transform of the convolution function,
(4) divide the Fourier transform of the simulated signal by the modified Fourier transform of the convolution function,
(5) inverse transform that result, and
(6) apply Fourier filtering to reduce noise.
The script itself takes only about 40 lines of code, most of which deals with plotting and labeling, plus two external user functions. To experiment with different peak shapes, separations, etc., you would have to edit the script, save it, and re-run it. The Matlab app, on the other hand, has sliders for each of these variables, which allows those variables (and only those) to be adjusted simply by sliding the pointers. Each slider has a numerical range that you can set to be in a "reasonable range" for that variable. Moreover, the app has a pull-down menu to choose the peak shape (in this example, just two shapes: Gaussian and Lorentzian). Each time one any of these controls is changed, the app recalculates and updates the plots. You must add the specific programming for the mathematical calculations, which you type into the white space in the Code View, but you can call any functions that you are previously written and have saved in the Matlab path. You can also package any Matlab app you create into a single file, including any functions that you have called, so it can be easily shared with others.

The coding required to get all this to work is more complex and a simple script, but the app is easier and more fool-proof for a non-Matlab programmer to operate and it is a quicker to explore the several interacting variables in an app such as this, compared to editing, saving, and re-running a script.

Unfortunately, Octave and Python do not have built-in mobile development capabilities, but there are packages you can use to create mobile applications in Python, like Kivy, PyQt, or Beeware's Toga library. These libraries are all major players in the Python mobile space.

This page is part of "A Pragmatic Introduction to Signal Processing", created and maintained by Prof. Tom O'Haver , Department of Chemistry and Biochemistry, The University of Maryland at College Park. Comments, suggestions and questions should be directed to Prof. O'Haver at toh@umd.edu. Updated March, 2022.
Unique visits to this site since June 22, 2008: