3D-Audio For Sound Designers: Spatial Hearing Part 1B

April 1, 2021| systematic-sound

Now let`s talk about spectral cues, cone of confusion, head movement, related transfer functions, distance perception and much more.

Spectral Cues

In contrast to the binaural cues, the spectral cues work in a monaural way due to the reflection and diffraction of the individual’s auditory system, mainly from the pinna as well as from the head and shoulders. Where binaural cues derived from IDT and ILD are mainly related to lateralization, the spectral cues are mostly connected to the vertical and front-to-back localization (although the so called „Head Shadowing Effect“ plays a major part in lateralization).

Because of the geometry of our auditory system, we can observe different delays between the direct signal that goes straight into to the ear channel and the parts that get reflected, diffracted or absorbed mainly by the pinna, depending on the location of the source and its angle of incidence.

All this acts as a filter effect that alters the incident sound spectrum, creating direction depended notches and peaks. In addition to this it must be considered, that the geometry of the pinna differs from individual to individual significantly. It is astonishing that nonetheless our localization capabilities didn’t differ that much. This means that those spectral alterations must be qualified as individualized localization cues that we humans are capable to adapt to. Taking into consideration the dimensions of the pinna (approx. 65 mm on average) we can derive that the related effects will be effectively noticeable at higher frequencies where the wavelength is comparable to the dimension of the pinna. The effect will start around 2-3 kHz and be prominent around 5-6 kHz.

Cone of Confusion and Head Movement

ITD and ILD are the dominant localization cues at low and high frequencies, but are inadequate of determine the unique position. However, an infinite number of spatial positions exist that possess identical IDT and ILD values. In an idealized model where we disregard the implications the form of the head has and if we assume that the ears are two separate points free in space which build a cone around the inner aural axis in the tree-dimensional-space, we get the form in which these phenomena occur. That is what is called the cone of confusion. An extreme case is when the source is located at 0° at the median plane where sound level and delays to both ears are identically. Another case of identical ITD and ILD is when there are two sound sources at front-back mirror positions (as at azimuth 45° and 135°).

There is no obvious way of distinguishing between those front and rear sources with ITD and ILD only. Binaural cues are ambiguous. This ambiguity can be solved by head movement (so called dynamic cue first mentioned by Wallach). If we move our head clockwise and counterclockwise around the vertical axis, we can effectively change ITD, ILD and the sound pressure spectra at the ears. This way we are able to generate more measures and compare the localization cues dynamically to resolve ambiguities.

Head Related Transfer Functions

As mentioned before, when a sound from a certain position arrives our ears, after interacting with our anatomic structures, his properties have been changed due to the reflection, diffraction and absorption, mainly by the pinna, our head, shoulders and torso. Various localization information obtained from the ITD, ILD and the spectral cues come together, for the auditory system to process them and comprehensively locate the sound source.

When you take for example a fixed-point source and a fixed head position, then you can describe this transmission process as a linear time-invariant process. The head related transfer function for each ear describes the overall filtering effect imposed by our anatomical structures and are introduced as an acoustic transfer function of the LTI process that can be displayed in these equations:

(Left Ear HRTF)

(Right Ear HRTF)

PL and PR represent the spectral alterations at the left and right ear. P0 represents the complex-valued sound pressures in the frequency domain at the right and left ear at 0° center head. Due to our unique anatomical structure, each human has its own individual HRTF´s his hearing relies on. They can be measured in a time consuming and complicated process.

The main problem is that he has to stand still with measurement microphones in his ears, while sound sources are recorded from all relevant positions that are necessary to create virtual sound sources in the application. This means that basically everyone must have his HRTF´s measured and use its personal ones in theory. But there are two reasons why this is not much of a deal in VR. First, we have head tracking and an open field of view with the head display on that system.

This gives us a directional and visual cue and we can disambiguate uncertainties, that may occur due to incompatible HRTF´s and we can slowly learn to get along with them. Secondly it has been shown, that besides all the differences in the measures of HRTF from individuals, there are some basic patterns that work for all humans. These are the so called „Directional-Bands“.

They are basically frequency boosts and attenuations related to source positions. Regions around 0,3-06 kHz and 3-6 kHz seem to relate to frontal positions, 8 kHz seems to correspond to the overhead position and the 1,2 kHz and 12 kHz areas appear to be related to the rear perception. This is the reason, why binaural reproduction over headphones using averaged HRTF´s does work.

Distance and Depth Perception

From an acoustic perspective, distance is a term used to describe how far away a source appears to the listener where depth is used to describe the overall front to back distance of a scene and the sense of perspective created.

In terms of distance there are a few basic rules that our hearing applies to determine the distance of a source (apart from the things said to the open and closed sound field). The most obvious one is that we determine distance with level attenuation. But we need a context for that level difference. When a source is moving, we can figure out the relative loudness cue, meaning a scale that is changing and therefore we can extract the information out of it, if a source is nearer or further away. But this mechanism only applies effectively for known sound sources. The brain is more reactive on relative changes in level and spectral differences.

In addition to that, the sound loose energy when its traveling through the air. And since high frequencies have less energy than low frequencies, we perceive sounds that are farer away not only at a lower level, but also duller due the loss of high frequency energy (air damping).

The strongest cues for our distance perception in enclosed rooms, is the ratio between the direct and reverberant sound component (wet/dry ratio). The ratio between the direct signal, the early reflections and the reverberation tail, as well as the composition of the reverberation components and their timings, tells us a lot about the size and geometry of a room and the position of a source in It. As I have described under II. A. 1. b), the initial time delay is part of that impression.

The last rule of distance perception that I want to describe is the motion parallax. When sound is traversing very quickly through our sounds field, it is an indication that it is very close to us, because we know, thanks to our hearing experience, that sound travels through air at a certain speed (approx. 343 m/s at 20°C). Because it can´t go faster, we can make the assumption that it must be very close to be able to cross our hearing field that suddenly. This phenomenon takes place at very close distances around 0,25 m or less. More about depth of sound sources in the following subchapter.

Source Width and Envelopment

Sound sources in space can be perceived as small pin-pointed events or subjectively as an event with bigger dimension. This subjective phenomenon is subsidized under the term „Apparent Source Width“. The ASW has been found to relate closely to a binaural measurement known as inter aural cross correlation, which measures the degree of similarity between the signals at the two ears comparing different frequency bands and time windows. If a small-time window is measured (early IACC), that’s up to about 80 MS, then we can see that there is a correlation between the measured early reflections and the broadening of the sound source.

In a reverberant environment it can be hard to tell if a perceived source is „wide“ or just diffuse and hard to localize. Furthermore, it can be quite difficult to distinguish the individual source width of a big sound source from the width of the overall sound stage, which describes the distance perceived between the outer left and right limits of the hole stereophonic scene.

When we try to describe the environmental spatial impression of a sound field, spaciousness can be used to describe the hearing impression of an „open“ space when a sound appears to exist outside of the listeners personal space in his surroundings. On the other side, envelopment is used to describe the sense of immersivity and involvement in a reverberant sound field, with that sound appearing to come from all around.

Spatial Hearing with Multiple Sources
Summing Localization Law for two Sources

The major mechanism can be described with a phenomenon called: two-channel-stereophonic localization. If the two sound sources are emitting the same signal with the same sound level, then the listener will locate a virtual sound source symmetrical in the middle of the two sources. When they are playing at different levels, the source will lean toward the source with the bigger level in the panorama.

If the level difference is larger than 15 dB, than the virtual source will be located at the position of the respected real sources the position didn’t change even if you further increase the level difference. And finally, the stereophonic law of sine describes, that when the spatial position of the virtual source is completely determined by the amplitude ratio between the two loudspeaker signals and the pan angle between them in respect to the listeners position, frequency and head radius are irrelevant.

Cocktail Party Effect

Describes a psycho-acoustical effect, that refers to our ability to focus attention on the speech of a specific speaker by disregarding irrelevant information coming from the surroundings. Although the sound components are similar in intensity and frequency, our auditory system is still able to separate the desired signal from the interfering noise signal.

From the physical point of view, one of the predominant elements in the cocktail effect is the spatial separation of noise and speech. In consequence, we know that on the psycho-physiological level, selective listening is governed by our capacity to discriminate sounds from different sources – that is, by our capacity to localize the noise. That means that the cocktail party effect is a kind of binaural auditory effect associated with the spatial hearing of multiple sources through a comprehensive processing of the binaural sound information’s by the high-level neural system.

In case you missed: 3D-Audio For Sound Designers: Spatial Hearing Part 1A

Find out Systematic Sound Libraries at We Sound Effects



All-in-one batch processing tool. Process a large amount of files within one-click. Find Out More


Get 5GB+ Free Royalty Sound Effects on any order at We Sound Effects. Find Out More

Subscribe to our newsletter to get the latest news, interviews, jobs, tutorials, resources and much more.


Categories: Immersive Audio, Tutorials
Subscribe to We Sound Effects Newsletter



We respect your privacy.