On Recording and Encoding Parameters (DRAFT!)

by Allan Peace

[Editor: The GBCC mailing list has seen questions of the "best" or "recommended" recording and MP3 encoding practices. While the GBCC is not a standard setting organization, we should encourage good recording practices across our hobby. Allan's monograph is a response to these initial discussions. At this time, further details are being researched and this copy should be considered a draft.]

There are generally five forms of degradation that occur when broadcasting or recording:

1.  Bandwidth limiting

Very high and low frequencies are removed giving a muffled or thin sound respectively, to the result. Irrecoverable, once the audio information is lost. Digital Audio must be sampled at twice the frequency of the highest pitch sound. So a 5000 Hz bandwidth AM broadcast must be sampled at 11000 Hz or greater to preserve all the information transmitted. If sampled at say 8000 Hz then the audio between 3500-5000 Hz will be lost, irrecoverably. The recording will sound more muffled and duller. Instruments like the cymbal will be almost inaudible.

Common bandwidths are:

Medium

Frequency Range

Human ear

20-20,000 Hz  (+2000/-15000 depending on how old you are, and your ear's history of abuse)

DVD, Vinyl

20-20,000 Hz

CD, R-R tape

20-20,000 Hz

Good FM radio

20-15000 Hz

BBC7 Real Audio

20-15000 Hz

Good Cassette

20-12000 Hz

Usual Cassette

40-10000 Hz

Good AM Radio

50-9000 Hz

Usual AM Radio

100-5000 Hz

Digital Phone

300-3400 Hz

Analog Phone

500-2400 Hz

2 way radio

300-2500 Hz

2.   Noise and Distortion

Unwanted sounds are introduced by the medium, e.g. hiss, humm, clipping, crossover/coring distortions, clicks, pops and whistles. These can be reduced by various techniques, provided that the wanted signal is substantially louder than the unwanted. However the removal of the unwanted noises usually introduces other forms of distortion and/or information loss.

3.   Quantisation distortion

Digital media carve the audio up into chunks usually called samples. These samples can only take fixed levels and so some information is lost when the sound detail falls between the available levels.

4.   Time smear

Compression systems generally group a number of digital samples into a "frame." Lossless compression systems are able to reconstruct the original samples, but lossy systems (like MP3) have only a fixed number of possible frames and hence cannot reconstruct all possible samples accurately. The frames can be thought of as like the syllables in speech. Each frame can only have one "sound" associated with it. If a sound is shorter than the frame period then it will be lost or distorted to occupy the whole frame period. It is difficult to describe the effect of this distortion but it gives the audio a sort of bubbly "underwater" sound.

5.  Analog Compression

AM and some FM radio stations use "compressors" and limiters to reduce the dynamics (the difference between loud and quiet sounds) of the signal to make their signal "louder" or to correct for incompetent mix panel operators (or total lack of them). Unless analog compression is applied by a known and linear "law" (such as the Dolby and dBx Noise Reduction Systems) then the resulting signal distortion is irreversible. MP3 doesn't (intentionally) change the dynamics.


So choosing MP3 parameters has to bear all these things in mind. There are four main parameters to be set in an MP3 coder.  These are

1. Sample Rate (and implicitly the frame rate) and precision

Sample Rate:  Fairly easy to determine: 

Choose a rate that is at least equal to the source quality for the final output and one or two notches higher for any intermediate compressed files (if possible).  I stick to 44 KHz for almost everything so I can make CDs  from the files without a lot of fiddle and then sample rate doesn't cause any noticable degradation.

The MP3 is a proprietory standard so its internal frame rate info isn't easily come by, but, at a guess, it  is roughly 4 thousanths of a second for 44 KHz, 5 for 32 K, 8 for 22 K, 15 for 11 K and 20 for 8 K. This means that an MP3 file at 22 KHz sample rate can only code about 120 unique sounds per second which is why some people claim that such MP3 recordings lose their "detail." The sharp attacks of percussive instruments are lost. I suspect that they allow the frame length to be changed with the stream rate so high stream rate to sample rate ratios will result in shorter frames and hence more of them and hence greater precision in the output (see stream rate below).

Sample Precision: Generally 16 bit audio precision is used for all sound recording except for extreme requirements.  16 bit can accurately reproduce sounds from a pin-drop to the sound of a  noisy crowd (it roughly matches the linear dynamic range of the human ear).  24 or 32 bit sound is only needed for professional recordings when lots of tracks have to be mixed down or if you need to accurately  reproduce the noise of a turbojet engine flying through your living room followed by a pin dropping.  8 bit is used only for dicatation machines, cheap toys and telephony (and very old computer sound cards).  12 bit is used for digital telephony.

2. The Audio Data Stream Rate

The stream rate is a little more trickey. This parameter sets the rules in the MP3 coder as to how much aural detail is to be discarded to "squeeze" the information into the digital form. As a rule of thumb, the information loss is not apparent at all if the stream rate is set to 3-4 times the sample rate (i.e., 128 Kb/s for a 44 KHz sample rate mono signal). At twice the sample rate (80 Kb/s for 44 KHz), the distortion effects are just noticable on music and "difficult" sounds such as applause, and when set to equal the sample rate (48 Kb/s for 44 KHz), the degradation is quite noticable. Set to half the sample rate (24 Kb/s) and the effect is almost unbearable as the sound takes on that classic "my audio is being strangled in a huge glass bottle" MP3 type sound. Audio information is discarded both by reducing the precision of the frame's data and by reducing the frame rate.

So it is:

Variable stream rate allows the CODEC to alter this data stream rate dynamically, depending on the complexity of the audio being coded at the given instant. I have never noticed a significant improvement in file size using this technique and compatibility with cheaper CODECs (in MP3 players) is an issue with variable rate streams. Generally, I don't use them for compatibility reasons.

3. The CODEC "Precision"

The CODEC precision is a very difficult parameter to judge. The CODEC is given a finite time to analyse each frame to ensure that it doesn't spend excessive time on unwanted detail. In "fast" modes, the CODEC skips some parts of its processes used to identify more "difficult" sounds. The upshot is that the fast (and cheap) CODECs sometimes misidentify sounds and the result is that the sound replayed does not match the sound recorded very well. Percussive and non-repetitive sounds such as applause seem to be the worst affected. Use the best CODEC and its "highest quality" setting to be sure that the MP3 representation of your recording is good. If lower settings (or "free" CODECs) are used, then check the result in percussive and applause sections to ensure that it has coded the sounds in a way you find acceptable. It's too late to do anything once the MP3 file is the only copy of a program that you have. Miscoding cannot be repaired.

4. Stereo coding

Stereo: A full 2 channel audio recording will require two times the bit stream rate of a mono recording (funny about that). However, as long as there is no complex coding of the stereo pair (such as occurs with Dolby Surround) then the stereo can be treated as a mono signal plus directional information. The stereo audio can be processed in four ways:

  1. Full dual channel stereo (perfect 2 channel stereo)
  2. Mid-Side: where the difference between the left and right speaker audio is treated as a slightly less complex audio channel but at the same sample rate.
  3. Low Complexity: where the difference is treated as a lower sample rate channel.
  4. Intensity: where the mono signal's position in the audible stereo field is controlled by a very low rate channel (sort of like an automated pan pot).

The Intensity Mode should only be used where there is normally only a single source of audio in the stereo field and you don't mind the stereo field sounding like it is a mono recording with somebody fiddling the balance control to make it sound like stereo.

The full dual channel uses 200% of the mono stream rate, mid-side technique uses about 150%, Low Complexity uses about 125% and Intensity uses 110%, all other things being equal. Again the "rule of thumb" is that Dual Channel should be used for Dolby Surround audio and master recordings, M-S for "normal" stereo, low-complexity for recordings where the stereo is not important (just provides ambience or direction) and Intensity for material where the sound only comes from a single source (i.e., a stereo panel discussion such as My Music would probably not lose too much by Intensity coding.  A dual miked duet or an orchestra probably would lose a lot)


Generally, if you don't want excessive MP3 artifacts on your final recordings, then remember that final result is the sum total of all the distortions of all the media used to transmit and record the program. Any alteration of the audio usually degrades it further unless the process is known to undo a specific distortion accurately. Generally, the more you can avoid changing the format of audio data and/or companding it, the better it will be.

Also if the audio framing process is applied multiple times, it can introduce peculiar interactions between the framing rates that can lead to swooshing and bubbling effects that sound similar to shortwave radio reception. Thus any intermediate recordings should be made at equal or higher bandwidths and either at higher framing rates than the final recording or in a media that does not involve framing (like WAV). Ideally MP3 or other lossy compressions should only be applied once when the final recording is committed to the final media after all processing and editing. It won't be too much worse if intermediate compressions are a factor of 2 or more less severe than the final compression. Applying two sets of the same MP3 compression and re-expansion will double the distortion of the final file (as will MiniDisk Ogg-Vorbis compression).


Sony Minidisks are recorded in a proprietory compressed format similar to MP3 run at 48kHz sample rate and 256k stream rate. I have seen reports that it takes about 5 or so analog copies for the compression artifacts to become noticable. The digital stream they record on the disk is not decodable by non-Sony products because Sony will not grant a licence to produce a software or hardware decoder for the ATRAC data stream. Thus the need for the analog interface. I dislike proprietory standards so I have always avoided Sony Minidisk as a format (especially as Sony are not renowned for their consideration of the user when it comes protecting "their" intellectual property).


So, a couple of examples of my reckoning as to what MP3 rates to use.

Input:

Data Rate

Bandwidth:

Stereo:

Formats / Encoding

BBC7 Real Audio stream

44 Ks/s

15 KHz

low compexity - intensity

Intermediate storage file 1: 44 KHz dual chan. stereo wav file (stream rate = 1440 kbit/sec) = 700 Megabyte per Hour

Intermediate storage file 2: 44 KHz dual chan. stereo MP3 file (stream rate = 160 kbit/sec) = 70 Megabyte per Hour

Final Edited file for CD: 44 KHz M-S stereo MP3 file (stream rate = 128 kbit/sec) = 60 Megabyte per Hour

Internet mailing file: 22 KHz LowComplexity stereo MP3Pro file (stream rate = 64 kbit/sec) = 30 Megabyte per Hour

Cassette/FM Radio  to sound card

44 KHz

 

dual chan. stereo wav file (stream rate = 1440 kbit/sec) = 700 Megabyte per Hour

Intermediate storage file 1: 44 KHz dual chan. stereo wav file (stream rate = 1440 kbit/sec) = 700 Megabyte per Hour

Intermediate storage file 2: 44 KHz dual chan. stereo MP3 file (stream
rate = 160kbit/sec) = 70 Megabyte per Hour

Final Edited file for CD: 44 KHz M-S stereo MP3 file (stream rate = 128 kbit/sec) = 60 Megabyte per Hour

Internet mailing file: 22 KHz LowComplexity stereo MP3Pro file (stream rate = 64 kbit/sec) = 30 Megabyte per Hour

  Full bandwidth Music Source to sound card

44KHz

 

dual chan. stereo wav file (stream rate = 1440 kbit/sec) = 700 Megabyte per Hour

Intermediate storage file 1: 44 KHz dual chan. stereo wav file (stream rate = 1440 kbit/sec) = 600 Megabyte per Hour

Intermediate storage file 2: 44 KHz dual chan. stereo MP3 file (stream rate = 256 kbit/sec) = 150 Megabyte per Hour

Final Edited file for CD: 44 KHz M-S stereo MP3 file (stream rate = 160 kbit/sec) = 70 Megabyte per Hour

Internet mailing file: 22 KHz LowComplexity stereo MP3Pro file (stream rate = 64 kbit/sec) = 30 Megabyte per Hour

Even so most recipients of a 14 meg file of a half hour comedy in the email get upset and so the 7 meg comedy file is the norm on the net. This is best achieved by further reducing the sample rate to 16 KHz and the stream rate to 32 kb/sec and using intensity stereo if possible. The results are barely passable even on spoken word material though.

Material that is already bandlimited to 5 KHz (poor AM and shortwave recordings)can be downsampled to 11 KHz without loss and such material is usually mono anyway so it can be run at a stream rate of 24 Kb/s without too much distortion. I have never heard anything below that rate that is acceptable in MP3 format though.

If you really must get down that low 8 KHz sample and GSM codec which is the standard digital phone CODEC is probably the best combination.


Compatibility is also an issue when choosing datarates. If you want to be able to play your files on your DataKey, your MP3 CD player, your DVD player and your computer, then it is a good idea to stick to the commonly used rates.  I try to stick to 44 and 22 KHz sampling and 320/256/160/128/64/32 kbit/sec stream rates as these are well supported.  Obscure data rates and/or file formats could leave you with a lot of unplayable files in 20 years' time (the digital equivalent of having a 60 rpm home cut wax cylinder at the wrong track pitch)


In the special case of Real Audio from BBC7, the audio data received is obviously distorted but not disasterously so. However, coding with similar rate (40 Kb/sec) would double the distortion. 128kb/sec seems to introduce a little more distortion while 160 seems to be no worse than the original. Real Audio seems to use a mixture of M-S and Intensity stereo coding depending on the nature of the audio. Thus intensity is used for a single audio source such as a contestant in My Music while it uses M-S for the audience applause. Thus the single voice (virtual mono) has much more "bandwidth" for the voice information than when reproducing an orchestra or audience applause which needs almost two full channels for the true stereo field and hence only has half the bandwidth available for each channel.


So, from the above...

If your source is a quality CD and you want to keep it that way, then it has been established through experiment that there will be no significant compression degradation with a MP3 with a data rate that is 4 times the sample rate or higher.

The BBC7 webstream is 15 KHz audio bandwidth, so any audio should be sampled at 44 K/s if information is not to be lost. The Beeb has had a policy of the very best of technical standards for all its broadcasts. Most Beeb stuff (post 1950 anyway) is recorded in full audio bandwidth so the main limitation is the broadcast medium. This is why the Ted Kendall Goon Shows still can sound excellent even though they were recorded 50 years ago. Once the broadcast or recording medium has distorted the signal, however, there is little prospect of recovering the lost information.

It depends what the purpose of the MP3 file is for, as to what is likely to be the best trade-off between file size and quality is. Files that are for "listening only" can be generally much lower rates than those used as source material for editing. Adequate quality for listening is a matter of taste. Reference recordings for further editing must be the best copy that can be obtained and stored. Ideally, everything should be stored in a super CD format at 96 KHz sampling and 24bit WAV audio but that isn't going to happen. You only get 2 hours on a DVD in that format.


On Producing Monophonic Recordings from Pseudo-stereo Sources

[ Editor: BBC radio broadcasts were monophonic before 1972. Thus many of the classic comedy series such as the Goon Show, Hancock's Half Hour, Take It from Here, Beyond Our Ken, Round the Horne, and all but the last series of I'm Sorry I'll Read that Again are monophonic material. So are audiobooks and other recordings of readings by single actors. Yet we are frequently presented with "stereo" recordings and broadcast repeats of this material. How should we handle this situation?]

I would suggest that you combine the stereo pair of audio channels from the BBC7 Real Audio in a 50/50 mix as it will give a small improvement in the background noise. However this only applies to BBC sourced mono material and material where the left-right phasing can be guaranteed to be correct, such as vinyl records. The Beeb take great care to ensure that their channels are phased properly and vinyl is intrinsically correct. If you try the same trick on most tape recordings you get nasty variations in the high frequency output. For mono cassettes, it is recommended to use the right channel (unless obviously inferior to the left) as this track is slightly less susceptible to variation as it is close to the centre of the tape whereas the left is recorded on the upper edge.

I use a phase meter to check my recordings before combining them. If you use Cooledit/Adobe Audition, it has such a phase graph that can quickly identify if a program has any "Side" channel information. If there is any, then the stereo pair cannot be combined without degradation.


On Providing Leader and Trailer Time for Compatibility

[Editor: Some MP3s played back by some decoders on some devices prematurely terminate playback, even though close scrutiny in another playback program or audio editor reveals the material is in fact still there. Differences have been seen playing the same MP3 file from an optical disk and from a hard drive. Some evidence suggests that a workaround but not a fix is available by adding padding or silence at the end of a recording before encoding.]

I use 1 second leadin and 6 seconds per hour of recording for leadout, but I agree with Bruce that the problem appears to be with the decoder and that the information is not lost without a trailer. There is no problem at the leadin, but a leadin is useful to separate items when played in a sequence. The leadout trailer is useful though in that it gives compatibility with other peoples CODECs. The only exception I make is where the tracks segue, as they do on some ripped CDs, in which case padding must not be added and they must be played on a player that performs the track join correctly. Usually I join such recording to make a single track to avoid the segue dovetailing problem.

As far as I have been able to determine the problem is that there are two measures of track length in most digital media. The header usually states the length in hh:mm:ss.ddd format, but there is also the actual recording frame count multiplied by the player's frame period. When the track is played back, the termination of the track can be triggered either by the header timecode or when the player runs out of frames. If the frame size of the recorder is different from the player's (even by a very small amount) then problems will result. Some MP3 parameter values and timings are inevitably subject to digital rounding errors and it is this that causes premature track termination. Only players that play all the available frames will play or convert the whole track all the time.

Unfortunately, many editors (including Cooledit, which I use) seem to use the header's track time duration to construct the raw PCM audio from the MP3 and if this reconstructed file is re-saved, after editing, then the last few seconds is permanently lost. This appears to be a "feature" of the Fraunhoffer CODEC, which is unfortunate, as this is the "reference" MP3 CODEC. It can be fixed by using an MP3 decoder that uses all frames to reconstruct the PCM stream but I have not had the time to characterise all the MP3 CODECs I have. It also depends on the sample and frame rates chosen. It seems that problem is in rounding errors in the ratio of frame to sample and with rounding in the frame time period.

I did the calcs on the problem about 2 years ago and found that a rounding error in the sample to frame ratio would account for the problem, almost exactly, but I have never been able to confirm it.

This is one of the problems with using proprietory CODECS such as MP3. There is precious little information available to allow users to identify CODEC shortcomings and devise workarounds.  I approached Syntrillium on this issue and they were unable to find the problem. Presumably the Fraunhoffer CODEC was as much a "black box" to them as it is to us poor users. Now that Adobe has taken them over, chances are we will never know, as they are even more secretive than Fraunhoffer..

Eventually we probably should put together a properly researched FAQ regarding these issues. Perhaps some better informed user have already done this in other net groups. May have a quick search to try to find such a document.

I have been having similar problems with the Real stream from the BBC. Every now and then, there is sufficient single packet loss in the Real Audio stream and/or slippage between the BBC7 encoder and my decoder and then the Real Player inserts a 3-5 second gap to allow the decoder's FIFO data buffer to catch up. When I come to edit the file for archiving, the gap can be edited out and the recordings segments join almost seamlessly. It is not a net transmission fault as it would first appear, but simply a timing issue due to the inevitable errors in timing between the BBC in UK and Yours Truly here in Oz, that cannot be absorbed by the Real Player's 10 second buffer during the reception sessions that can last many hours.

I don't use variable rate and still have problems. The only major problem usually seen, is by us folk trying to decode MP3 back to raw PCM for editing. Failure to play to the end is annoying; permanent truncation of the sketche's punchline is a disaster! The MP3 file contains the data but the editor's PCM is truncated by about 2-7 seconds per hour for certain combinations of sample and stream rate. From memory 44K/96K was the worst offender on Cooledit 2000's CODEC. A check of other net groups shows that it has been observed by many others but there is no good, solid, definitive explanantion or fix for the problem, even from those that write the software.

As John suggests, putting a leading and trailing silence on edited MP3 files ensures that the file is playable on most decoders and can be re-edited easily. It's not mandatory but it saves having to find a CODEC that works 100% with a given file. Rather than silence, I usually leave the continuity announcements on BBC7 stuff, which has the same effect and often saves the original broadcast date for posterity as a bonus.

I have collected a few interesting technical links if people want to follow the subject further, but, be warned, it is heavy going for those without an audio engineering background: