The Google Virtual Reality (VR) SDK features a best-in-class audio rendering engine that is highly optimized for mobile VR. The goal of the engine is to give listeners a truly realistic spatial audio experience by replicating how sound waves interact with the environment and the listener's head and ears.
How Spatial Audio works
Spatial Audio is a powerful tool that you can use to control user attention. You can present sounds from any direction to draw a listener's attention and give them cues on where to look next. But most importantly, Spatial Audio is essential for providing a believable VR experience. When VR users detect a mismatch between their senses, the illusion of being in another world breaks down.
The Google VR SDK simulates the main audio cues humans use to localize sounds:
- Interaural time differences.
- Interaural level differences.
- Spectral filtering done by our outer ears.
Interaural time differences: When a sound wave hits a person's head, it takes a different amount of time to reach the listener's left and right ears. This time difference varies depending on where the sound source is in relationship to the listener's head. The farther to the left or right side of the head the object is located, the larger this time difference is.
Interaural level differences: For higher frequencies, humans are unable to discern the time of arrival of sound waves. When a sound source lies to one side of the head, the ear on the opposite side lies within the head's acoustic shadow. Above about 1.5 kHz, we mainly use level (volume) differences between our ears to tell which direction sounds are coming from.
Spectral filtering: Sounds coming from different directions bounce off the inside of the outer ears in different ways. The outer ears modify the sound's frequencies in unique ways depending on the direction of the sound. These changes in frequency are what humans use to determine the elevation of a sound source.
Spatial Audio in Google VR
To simulate sound waves coming from virtual objects, we use a technology known as ambisonics to envelop the listener's head in a sphere of sound. The Google VR audio system surrounds the listener with a high number of virtual loudspeakers to reproduce sound waves coming from any direction in the listener's environment. The denser the array of virtual loudspeakers, the higher the accuracy of the synthesized sound waves.
Virtual loudspeakers are made possible through the use of head-related transfer functions (HRTFs). The cues discussed in the previous section are captured within these HRTFs. When audio is played through HRTFs over headphones, the listener is fooled into thinking the sound is located at a particular point in 3D space.
In the real world, as sound waves travel through the air, they bounce off of every surface in our environment, resulting in a complex mix of reflections. The Google VR SDK breaks this complex set of sound waves down into three components:
- Direct sound
- Early reflection
- Late reverb
The first wave that hits our ears is the direct sound which has travelled directly from the source to the listener. The farther a sound source is from the listener, the less energy it has resulting in a lower volume than closer sounds.
The first few reflected waves that arrive at your ears are known as the early reflections. These give the listener an impression of the size and shape of the room they are in. The Google VR SDK spatializes the early reflections in real time and then creates new, artificial sources for each of them.
Over time, the density of reflections arriving at your ears builds more and more until the individual waves are indistinguishable. This is what we refer to as the late reverb. The Google VR SDK has a powerful built-in reverb engine that can be used to very closely match the sound of real rooms. If you change the size of the room or the surface materials of the walls around you, the reverb engine reacts in real time and adjusts the sound waves to match the new conditions.
The Google VR audio system can also simulate the ways in which sound waves traveling between the source and listener are blocked by objects in between. The Google VR audio system simulates these occlusion effects by treating high and low frequency components differently, with high-frequencies being blocked more than low frequencies. This mimics what is happening in the real-world.
Closely related to the effect of occlusion is a sound object’s directivity pattern. A directivity pattern is a shape or pattern that describes the way in which sound emanates from a source in different directions. For example, if you walk in a circle around someone playing a guitar, it sounds much louder from the front (where the strings and sound hole are) than from behind. When you are behind, the body of the guitar and the person holding it occlude the sound coming from the strings.
With the GVR audio system, a user can change the shape of a directivity pattern for an object and mimic the non-uniform ways in which real-world objects emit sound. There are two available parameters:
- Alpha: Changes the shape of the sound emission pattern.
- Sharpness: Controls how wide or narrow the emission pattern is.
Head movements and sound
By moving our heads, we can perceive the relative changes in all of the time level and frequency cues. This helps us to localize sounds more accurately.
When a user moves his head in VR, his head-mounted display tracks the movements. The Google VR SDK uses rotation information to rotate sounds around inside the virtual loudspeaker array in the opposite direction of head movement. In this way, virtual sounds stay locked in position.
GVR Audio Room
For each part of your GVR experience, you should first determine if you need a GVR Audio Room. Audio Rooms provide early reflections and reverb, which help make the sound more realistic when there are nearby walls or structures. They are—not surprisingly—most useful when your scene takes place in an actual room. For outdoor scenes, an Audio Room can feel less natural, because you may have only one reflective surface (the ground).
You have full control over the amount of reverb and the material of the surfaces, so take care to match the room sound to the environment.
For producing the general ambience of a scene, like the wind in the trees, ocean waves, and birds, there are two choices for sound playback.
GVR Audio Sources.
GVR Audio Sources are the most flexible and work best for objects that move dynamically or that users might interact with. For example, if you attach an Audio Source to a bird, the listener hears the sound of the bird change naturally as it flies near, and then farther away. You can sprinkle Audio Sources throughout the environment to create the general ambience.
GVR SoundField Sources.
GVR SoundField Sources play back ambisonic files that let you hear audio from every direction. This is similar to how skybox or 360 photos work. Since ambisonic files only respond to head rotation, they work best as sounds in the distance.
For more information, see YouTube's ambisonic spatial audio support.
Animate the sound source
If you want to command the listener's attention, but a sound source is out of view, you can animate the position of the sound. This enables the listener to pinpoint sounds much more quickly.
Repeat the sound
To help the listener pinpoint a sound, play it more than once. This is why, for example, your phone's ringtone is not a single beep. If it was, you would have a hard time finding it, and you might not even be sure it was your phone. You can achieve the same effect by using sounds that comprise many distinct elements.
Use more complex sounds
You should avoid using overly quiet sounds, sounds lacking in high frequencies, or simple tones like a sine wave beep. Instead, craft sounds that have sufficient volume levels, are complex, and contain a full spectrum of frequencies.
Tips for crafting sounds
Audio Source sounds
For Audio Sources, make sure the sound files you use are monophonic and don't include reverb.
SoundField Source sounds (ambisonic sounds)
For SoundField Source sounds, we currently support first-order ambisonic files. These files are more complex than Audio Source files, and the tools and libraries that support them are still in the early stages.
With digital audio workstation (DAW) software and a plugin such as Ambix, you can create ambisonic files two ways:
Using monophonic files, place sounds on a virtual sphere around the user. You can move the sounds around and add effects to them.
Use an ambisonic microphone like the SoundField ST450, TetraMic, or Zoom H2n to capture the sound of an environment in 3D. You can load the captured sound into the Ambix plugin and run effects on it, rotate it if needed, and then mix.
Check your work
After you get your VR experiences up and running, make sure to check your work. You want to ensure that what you see matches what you hear. For example, if you can hear the sound of ocean waves crashing, but the ocean looks frozen, it takes away from the sense of realism.
Be sure to visit all the places your users will go in your VR experience to confirm that everything sounds natural. In experiences where users can roam freely, they love putting their ear right up to sound sources.
Take extra care to ensure that the sounds you're using are of high quality, are at a clear but comfortable volume, and adjust realistically with any movement. Because most of your users will be listening through headphones, make sure to test your sound on a variety of headphones, not laptops or desktop speakers.