Personalized Spatial Audio Content for VR Applications

When viewing any 3D experience, be it a movie, game, or interactive virtual reality (VR), objects don’t just jump out of the screen. They whirl around you and come to life through sound. Hear the crackle of the fire just outside of view. Hear the ice puck fire from one end of the rink to the other, bounce off the wall, then slide under your feet. All of this is amplified through sound.

Most people have heard of 5.1 or 7.1 surround sound setups. These add a third dimension to the sound of an experience by using multiple speakers configured in specific ways. This works fine for 3D movies, but most VR applications use headphones to get that immersive kick. What if we could deliver audio that moves around the VR world in 3D with just a pair of headphones?

This is possible using a technique known as binaural rendering. The theory is simple: we have only two ears, which is just like having a two-channel stereo setup, yet we can hear sounds above, below, in front, and behind us. We can do this because our ears have evolved to listen for ‘spatial cues’ in the sound that let us know where it is coming from. By understanding these cues, we can virtually render any sound we want over headphones and give it the spatial property of a real-world sound.

This raises a question: how do we capture these spatial cues? Anechoic chambers are carefully designed rooms that absorb over 90% of the reverberation and acoustic reflections. This is so we can capture the sound in its purest form, without any interference from the environment. By analyzing sound in an anechoic chamber and how a person’s ears hear it, we can extract the cues. However, this is very time consuming: it can take hours and requires not just an anechoic chamber but also specialized equipment.

We are solving this problem by studying how physical body shape (head size, outer ear shape, etc.) interacts with spatial cues in sound. We study the spatial cues through advances in computer vision, acoustic simulation, and machine learning. By exploiting computer vision we can create 3D models of ourselves using merely a set of images. Taking pictures does not require specialized hardware. By creating 3D models from images, we can synthesize spatial cues for a given person from these models. Using these models, we can then render 3D audio without the laborious capture method.

social