Meta AI has partnered with the University of Texas to unlock three new audiovisual perception-based models that can help improve AR/VR experiences. The release is another step towards the direction Meta has taken to transition to a virtual universe.
The first model, the Visual Acoustic Matching model or AViTAR, can help transform the acoustics of audio clips and make them sound like the target space of a specific frame. For example, an audio clip that sounds like it was recorded in an empty space could be paired with the image of a crowded restaurant and result in a sound that sounds like it’s in the restaurant.
The second model, called Visually-Informed Dereverberation or VIDEO, as its name suggests, performs the opposite function. VIDA uses observed sounds and visual cues to remove reverberations from a certain audio-only modality. This model improves speech quality, which also facilitates automatic speech recognition.
The third model, VisualVoiceseparates speech from background noise using audio-visual cues.
While considerable research has gone into creating better visuals, Meta AI also intends to create an equally immersive sound for users. “Achieving spatially correct audio is key to delivering a realistic sense of presence in the Metaverse,” said Mark Zuckerberg, the company’s founder and CEO. “Whether you’re attending a concert or just chatting with friends around a virtual table, having a realistic idea of where the sound is coming from makes you feel like you’re actually there.”