Spatial audio (also referred to as 3D audio or 360 audio) is a sonic experience where the audio changes with the movement of the viewer’s head; 3D audio effects manipulate the audio waves produced by stereo speakers, surround-sound speakers, speaker-arrays, or headphones.
If we take a look at the fundamental characteristics that make up the human hearing sense, we can see that we are not only perceiving loudness, pitch, time, timbre and the spectral qualities of a sound, but also a subjective perception of the spatial attributes of the sound.
Since virtual worlds in games are 3D, we need to respect that and integrate audio with a certain amount of realism, in a way that it makes sense to the player. I’m not talking about absolute realism in the terms of a simulation, but we will need to go this path a long way, to get audio right in a virtual 3D world.
And since spatialization is basically the intent to reverse engineer the human spatial hearing, we need to tackle this issue first. What makes up the so called „Spatial Hearing“, is the ability of our auditory system to track the position of a sound source in space in terms of direction- and distance-changes over time (including the information that the source is moving).
The human ability to locate the position and distance of a source, depends on various factors like the position and direction of the source, the acoustic environment and geometry of the space as well as the acoustical properties of the source itself. This ability varies across each individual and can´t be generalized.
Sound Sources in Space
On a very fundamental level, we can tell that the spatial characteristics that make up a natural sound, are the distinction between an open and enclosed sound field and whether its perceived as a single source entity, or more like a genuine environment.
In an open sound field, it is more likely, that sound sources in the distance are grouped together without a strong notion of directivity. Everything is more likely to blend together and to be perceived as from outside of the listener’s head. The sounds are often damped, because they have traveled a significant distance in the air and lack therefore the presence of a sound, that is only a few meters away, affected by the reflections of a room or hall. Indoors, the sound tends to be strongly altered by the effects of reflections, absorptions and diffractions as well as pressure patterns and standing waves. Sources are often in a distance of a few meters and their perception is strongly dominated by the reflections that make a large part of the spatial characteristics.
Besides that, we can categorize the spatial characteristics of sounds in space as a source being a discrete emitter, in terms that it can be perceived as a single, localizable entity. Environments often consist of a hard to distinguish mash of rather unspecific, general background sounds that are perceived as an embedding ambient sound that is hard to localize, due to the diffused character.
Spatial Hearing in the Open Field
The free field is an acoustic term to describe an environment, in which there are more likely none, or only very few reflections. In real life you can probably come close, if you go outdoors on a mountain top without surroundings (but then again you would have the reflections of the ground). These environments are rarely experienced in real life. However, this ideal construction is suited to explain how sound travels in space on a very basic level.
The first conclusion is, that sound radiates as an omnidirectional point source. As a consequence, this takes a significant amount of energy, because it is distributed over a sphere of increasing surface area, as it expands away from the source. The surface of the sphere is described by its radius r, so that the sound intensity S = 4π r2. Given the case that the source has a power P0 we can then derivate the sound intensity J1 at a distance r1:
If we now double the distance 2r1, we can see that the sound intensity has been reduced to around a quarter of its initial value:
Sound intensity is proportional to the square of the sound pressure. Knowing the sound pressure, we can derivate the sound pressure level from it:
This tells us, that a drop of the sound intensity to one quarter of the initial values, results in a reduction of the sound pressure to one half. This corresponds to a reduction of the sound level to about -6 dB SPL for every doubling in distance from the source.
The second conclusion we can draw here is, that sources radiate more likely in an omnidirectional manner for low frequencies and more directionally as the frequency rises. Close to the source (near field), the radiation patterns are more likely to be obvious and the level drop can be quite significant, getting less and less obvious as one moves further away from it (far field). With increased distance to the source, the wave front curvature get less, to a point where it becomes so shallow that it can´t be perceived as a directional source. Then it can be considered as a plane wave.
The question of how much distance is sufficient for this effect to be noticeable depends on the distance to the source, as well as the dimensions of the source itself, if there is only one or various sources in a row (wavefront superposition) and lastly the wave length of the sounds.
Spatial Hearing in Enclosed Spaces
In closed spaces with a geometric form and reflective surfaces, we perceive not only the direct part of the auditory event, but also the reflections of the room. With this reflections, the human brain can decode the spacial dimensions from both, the environment and the source (plus the source position and distance), as well as informations of the source´s position relative to the rooms geometry (indirect localization). A schematic view of these reflections can be illustrated with a room impulse response in the time domain:
The first component that arrives the receiver is the direct signal. It is then followed by the early reflections, that result from bouncing of reflecting surfaces. They represent the shortest delay times with the shortest delay paths.
The pre-delay (or Initial Time Delay) is the time between the direct signal and the first early reflections. The bigger the room dimensions, the longer the pre-delay, because the early reflections will need more time when they travel through air. The pre-delay also gives us information about the relative position of the source in the room, as well as its relation to the room geometry.
A longer pre-delay means that the source is more likely to be away from a wall or reflective surface and vice versa, a shorter pre-delay means that its most likely to be near a reflective geometric obstacle. The early reflections as well will be sparser in bigger room and denser in smaller ones. After that, a lot of reflections come in and blend together into a reverb tail, with a significant lower amount of energy, because of the surface absorption and air damping compared to the direct signal and the early reflections.
This can be described with a high frequency decay function, that basically represents a quicker decay for high frequencies over distance. With these properties in mind, we can later on effectively simulate convincing and distinctive reverbs in the game.
Roughly speaking we can distinguish two kinds of sound fields in enclosed spaces. In the direct sound field, that is at close distance to the source, where the direct signal component dominates in level. In the diffuse sound field, the early and late reflections do dominate in level.
They are separated by the critical distance, that’s the distance from the source where the direct and reverberant sound components have the same energy. The variables that have to be considered are: The directivity factor of the source (D = 1 for an omnidirectional source), the room volume in m³(V), the absorption surface of the room (A) and the reverberation time measured after the Sabine Equation E60 in seconds (T60).
The critical distance is a fairly important line of orientation in a room. We can develop a feeling for that ratio of direct and indirect sound in our hearing and subconscious figure out where the mid-poison of a room ruffly is. With moving in and out of the direct and diffuse sound field, we get very quickly an impression of the reverberation pattern of that room and therefore about the reflective materials (walls), size and geometry of the room. We know from our hearing experience for example, that the room must have reflective materials and a rather big size, when the critical distance gets smaller, and the room has a higher RT60 reverberation time.
Sound Source Localization
When it comes to the basic mechanisms, two terms need to be differentiated: Localization on the one hand, refers to the determination of the exact position of the source in the three-dimensional space. On the other hand, lateralization tries to determine the lateral displacement of an audio event in a strict horizontal (one-dimensional) manner along the ear axis.
And finally, we have to take into consideration the spatial attributes of the source, meaning if it is a single event or if it consists of multiple auditory events.
Localization Cues for a Single Source
The first academic notes on this subject’s ca be found in Lord Rayleigh´s (1842-1910) Duplex Theory. It describes the basic mechanisms of our horizontal hearing (lateralization). He observed, that sound arriving at the listener located at one side from the median plane could be easily located and the signal at the ear that was facing to the source would be received louder than at the other ear, due to the shadowing effect. He noticed however, that this was not noticeable at frequencies lower that 1000 Hz. Lord Rayleigh´s also noticed that our hearing is sensible to differences in phase of low frequency tones at the two ears. And lastly he figured out that the localization system isn’t always accurate when it arrives at the same angle from opposite directions.
The mechanisms he discovered are the interaural time delay and the interaural level difference. These mechanisms are based on the Precedence Effect (Law of the first wavefront), that basically describes the fact that the first wavefront falling on the ear determines the perceived direction of the sound, even if the early reflections in a room are louder than the direct signal.
Furthermore, the distance and size of the source is important too. Small objects seem to appear as point sources that can easily pinned down and large objects are more likely to emit from a volumetric extend. Another thing to be acknowledged is that due to our hearing physiology, we are more likely to localize sounds with higher frequencies and sharp attacks, than sustained sounds with lower frequencies.
In the same way we perceive perspective in our vision, our two ears allow us to get a spatial impression of our surroundings. The differences in time, level and spectrum from the source and the respective ear, build the main mechanisms that drive our hearing.
The first difference is described in the time domain. The Interaural Time Delay is the time difference between the arrival of the same sound at each ear. in the horizontal plane. In publications, the average ear distance varies from 17 cm to 18 cm. Therefore, the maximum possible distance the sound has to travel to get around the head is 21 cm at an angle of 90°, resulting in a max. delay time between the two ears of: 0,21/c = based on diff. ear distances and speed of sound, results vary from 0,61 MS to around 0.7 MS (binaural delay).
These time differences are particularly registered at the start- and endpoints of a sound event. There is no way to distinguish front and back sources when their angle and distance result in same delay times (e.g., right/front – left/back source position confusion). The ITD starts at frequencies of around 1 kHz and is most prominent at around 700-800 Hz, but completely ineffective in terms of modulation, at frequencies beyond 1,5 kHz. The reason why this method only works in that frequency domain, has to do with proportion of the head and the wavelength of the perceived signal. When the path length from one ear to the other equals half of the wavelength of the signal measured (approx. 700 Hz), the interaural phase difference begins to provide ambiguous cues. For frequencies above 1500 Hz the dimension of the head is larger than the wavelength. That’s leads to absolute ambiguous time cues, due to cross correlation.
When a sound source deviates from the median plane, the sound pressure at the farther ear is attenuated due to the shadowing effect of the head, resulting in a difference in level or intensity of the sound reaching the two ears. This effect becomes effective for frequencies above 700 Hz till about 1500 Hz. This means that the interaural level difference has an impact on lateralization throughout the frequency spectrum. Experimental result show that interaural level differences above 15-20 dB will completely move an image to one side. When ILD cues contradict ITD cues, the ITD wins for frequencies that are manly above 1500 Hz.
Resume Directional Localization Cues
- For frequencies of around 5-6 kHz and up to 16 kHz and above, where the wavelength is smaller than the head size, the main mechanism of localization is built on ILD due to the shadowing effect of the head.
- The localization effect of spectral cues is particularly prominent in the 5-6 kHz area. Main reason is the dimension of the pinna. They mainly provide horizontal and front-to-back localization and also help to resolve lateral ambiguities.
- In the 1600-700 Hz area both mechanisms ITD and ILD are active. Wavelengths are similar to the head size. At this point (1500 Hz) we start to see level differences due to head shadowing as well as inter aural phase delay differences.
- For frequencies of 700-80 Hz, the ITD delivered from interaural phase differences is the dominant localization cue, as wavelengths start getting longer than the average head size.
- At above 80 Hz wavelength are so long that it is very hard for us to localize the sound. This is a gradual fade and therefore it is often overseen that we are capable of locating subwoofers to some extent.
- And lastly, we have to consider the dynamic cue that is introduced by slight head movements to resolve front-to-back ambiguity and for vertical localization and the visual cue that or eyes provide, that can enhance our hearing perception if we have a visual target we can focus on.
Let`s move forward to 3D-Audio For Sound Designers – Spatial Hearing Part 1B
WE SOUND EFFECTS BUNDLE 2020
FREE BONUS CONTENT ON ANY ORDER AT WE SOUND EFFECTS