Audio sampling rates and the Fourier transform
Christopher Montgomery (“Monty”) recently posted an excellent argument against distributing music in 192 kHz, 24-bit form, as opposed to the usual 44.1 kHz (or 48 kHz), 16-bit form. I think, however, that many of the people who are inclined to doubt this sort of thing are going to doubt it at a much more fundamental level than the level he’s addressed it at. And I don’t just mean the math-phobic; I know I would have doubted it, once. For years, and even after finishing an undergraduate degree in electrical engineering, I wondered whether speaking of signals in terms of their frequency content was really something that could be done as glibly and freely as everyone seemed to assume it could be. It’s an assumption that pervades Monty’s argument – for instance, when he states that “all signals with content entirely below the Nyquist frequency (half the sampling rate) are captured perfectly and completely by sampling”. If you don’t believe in speaking of signals in terms of their frequency content, you won’t know what to make of that sentence.
As it happens, the assumption is completely correct, and the glibness and freeness with which people talk of the frequency domain is completely justified; but it originally took some serious proving by mathematicians. To summarize the main results, first of all, the Fourier transform of a signal is unique. When you’ve found one series of sine waves and cosine waves that when added together are equal to your signal, there is no other; you’ve found the only one. (Fourier transforms are usually done in terms of complex exponentials, but when one is dealing with real signals, they all boil down to sines and cosines; the imaginary numbers disappear in the final results.) If you construct a signal from sinusoids of frequencies below 20 kHz, there’s no possibility of someone else analyzing it some other way and finding frequencies higher than that in it – unless, of course, he does it wrong (an ever-present danger).
Also, the Fourier representation is complete: any signal can be exactly represented as a sum of sinusoids (generally an infinite sum of them, or an integral which is the limit of an infinite sum of them). There are no signals out there which defy Fourier analysis, and which might be left out entirely when one speaks of the “frequency content” of a signal. Even signals that look nothing like sine waves can be constructed from sine waves, though in that case it takes more of them to approximate the signal well.
But the main thing that makes it possible to be so glib about the frequency domain is that the Fourier transform is orthogonal. (Or in its complex-exponential variants, unitary, which is the corresponding concept for complex numbers.) What it means for a transform to be orthogonal can be illustrated by the example of coordinate transforms in three-dimensional space. In general, a coordinate transform of a three-dimensional object may twist it, bend it, or stretch it, but an orthogonal transform can only rotate it and possibly flip it over to its mirror image. When viewing 3D objects on a computer screen, applying an orthogonal transform just results in looking at the same object from a different angle; it doesn’t fundamentally change the object. At most it might flip the ‘handedness’, changing a right hand into a left hand or vice versa. In the Fourier transform there are not just three numbers (the three coordinates) being transformed but an infinite number of them: one continuous function (the signal) is being transformed into another continuous function (its spectrum); but again, orthogonality means that sizes are preserved. The “size”, in this case, is the total energy of the signal (or its square root – what mathematicians call the L2 norm, and engineers call the root-mean-square). Applying that measure to the signal yields the same result as does applying the same measure to its spectrum. This means that one can speak of the energy in different frequency bands as being something that adds together to give the total energy, just as one speaks of the energy in different time intervals as being something that adds up to give the total energy – which of course is the same whether one adds it up in the time domain or the frequency domain. This also applies, of course, to differences between signals: if you make a change to a signal, the size of the change is the same in the frequency domain as in the time domain. With a transform that was not orthogonal, a small change to the signal might mean a large change in its transform, or vice versa. This would make it much harder to work with the transform; you would constantly have to be looking over your shoulder to make sure that the math was not about to stab you in the back. As it is, it’s a reliable servant that can be taken for granted. As in the case of 3D coordinate transforms, but in a vaguer sense, the Fourier transform is just a different way of looking at the same signal (“looking at it in the frequency domain”), not something that warps or distorts it.
Engineers these days seem to go mostly by shared experience, in feeling comfortable with the Fourier transform: it hasn’t stabbed any of their fellow-professionals in the back, so it probably won’t do so for them, either. But as a student, I didn’t feel comfortable until I’d seen proofs of the results described above. In general, learning from experience means learning a lot of things the hard way; that just happens not to be so in this particular case: there are no unpleasant surprises lurking.
Now, when trying to use the Fourier transform on a computer, things do get somewhat more complicated, and there can be unpleasant surprises. Computers don’t naturally do the Fourier transform in its continuous-function version; instead they do discrete variants of it. When it comes to those discrete variants, it is possible to feed them a sine wave of a single frequency and get back an analysis saying that it contains not that frequency but all sorts of other frequencies: all you have to do is to make the original sine wave not be periodic on the interval you’re analyzing it on. But that is a practical problem for numerical programmers who want to use the Fourier transform in their algorithms; it’s not a problem with the continuous version of the Fourier transform, in which one always considers the entire signal, rather than chopping it at the beginning and end of some interval. It is that chopping which introduces the spurious frequencies; and in contexts where this results in a practical problem, there are usually ways to solve it, or at least greatly mitigate it; these commonly involve phasing the signal in and out slowly, rather than abruptly chopping it. In any case, it’s a limitation of computers doing Fourier transforms, not a limitation of computers playing audio from digital samples – a process which need not involve the computation of any Fourier transforms.
Much more could be said about the Fourier transform, of course, but the above are some of the main reasons why it is so useful in such a wide variety of applications (of which audio is just one).
Having explained why sentences like
“All signals with content entirely below the Nyquist frequency (half the sampling rate) are captured perfectly and completely by sampling”
are meaningful, and not merely some sort of mathematical shell game, a few words about Monty’s essay itself. As regards the ability of modern computer audio systems to reproduce everything up to the Nyquist limit, I happen to have been sending sine waves through an audio card recently – and not any kind of fancy audio device, just five-year-old motherboard audio, albeit motherboard audio for which I’d paid a premium of something like $4 over a nearly-equivalent motherboard of the same brand with lesser audio. This particular motherboard audio does 192 kHz sample rates, and I was testing it with sine waves of up to the Nyquist frequency (96 kHz). Graphed in Audacity, which shows signals by drawing straight lines between the sample points, the signals looked very little like a sine wave. But when I looked at the output on an oscilloscope with a much higher sample rate, it was a perfect sine wave. Above 75 kHz, the signal’s amplitude started decreasing, until at 90 kHz it was only about a third of normal; but it still looked like a perfect sine wave. Reproducing a sine wave given only three points per wavelength is something of a trick, but it’s a trick my system can and does pull off, exactly as per Monty’s claims. Accurate reproduction of things only dogs can hear, in case one wants to torture the neighboorhood pooch with extremely precise torturing sounds! (Or in my case, in case one wants to do some capacitor ESR testing.)
The limits of audio perception are not something where I’ve looked into
the literature much, but I have no reason to doubt what Monty says about
it. Something I did wonder, after reading his essay, though, was: what
about intermodulation distortion in the ear itself? That is, distortion
of the same sort that he describes in amplifiers and speakers. Being
made of meat, the human ear is far fromnot
perfectly linear; and pretty much any nonlinearity gives some amount of
intermodulation distortion. Unlike in the case of intermodulation
distortion in audio equipment, though, this would be natural
intermodulation distortion: if, for instance, one heard a violin being
played in the same room, one would be hearing whatever intermodulation
distortion resulted in the ear from its ultrasonic frequencies; those
would thus comprise part of the natural sound of a violin, and
reproducing them thus could be useful. Also, nonlinearities can be
complicated: any given audio sample might not excite some particular
nonlinearities that might nevertheless be excited by a different sort of
music. But as the hypothetical language (“could”, “would”) indicates,
these are theoretical possibilities, which can be put to rest by
appropriate experiments. As per a test Monty links to, which was
“constructed to maximize the possibility of detection by placing the
intermodulation products where they’d be most audible” – and
nevertheless found that ultrasonics made no audible difference. I only
took note of that sentence on re-reading; but this
nonlinearity-in-the-ear idea is what that test was designed to check for.
Poking around at the Hydrogen Audio forums, the explanation for why nonlinearity in the ear doesn’t produce audible lower frequencies seems to be that:
- Ultrasonics get highly attenuated in the outer parts of the ear, before they could do much in the way of intermodulation distortion. (It’s quite common for higher frequencies to get attenuated more, even in air; this is why a nearby explosion is heard as a “crack”, but a far-off one is more of a boom.)
- Intermodulation distortion then imposes a further attenuation, the spurious frequencies introduced by distortion having much less energy than the original frequencies.
- Generally in music the ultrasonic parts are at a lower volume than the audible parts to begin with.
Multiply these three effects together, or even just the first two of them, and perhaps one always gets something too small to be heard. In any case, as Monty states, it’s impossible to absolutely prove that nobody can hear ultrasonics even in the most specially-constructed audio tracks. But when one is considering this sort of thing as a commercial proposition, the question is not whether exceptional freaks might exist, but what the averages are.
(Update: Monty tells me that contrary to what I’d originally stated above, “by most measures the ear is quite linear”, and “exhibits low harmonic distortion figures and so virtually no intermodulation.” The text above has been corrected accordingly. I’d seen references to nonlinearity in the hair cells; and it’d be hard to avoid it in neurons; but those are after the frequencies have been sorted out.)