DenseAV can learn the meaning of words and the location of sounds using only self-supervision from video. To learn these patterns, DenseAV uses audio-video contrastiv learning to associate sound with the visual world. Intuitively speaking, its much easier to predict what you are seeing from what you are hearing when you understand language and can recognize sounds. This is how DenseAV can learn without labels.


Interestingly, contrastive learning with CLS tokens or average pooled representations isnt enough to be able to localize objects from sound and language. DenseAV uses a contrastive similarity based on inner products between local audio and visual representation tokens. This dramatically improves its ability to localize information.


Theres many ways that a sound can be related to an visual object. For instance, the word "dog" and the sound of a bark both conjure the image of a dog despite being very different types of sound. In an analogy with multi-head attention we provide DenseAV with multiple features to compute inner products with. Amazingly, DenseAV naturally organizes it's features into sound-features and language features without knowing a-priori what is sound and what is language.

Source : https://mhamilton.net/denseav
Demo : DenseAV - a Hugging Face Space by mhamilton723
PAPER

Comments (0)