MIT/CSAIL researchers add realistic sounds to silent videos, a step toward automating sound effects for movies?
“From the gentle blowing of the wind to the buzzing of laptops, at any given moment there are so many ambient sounds that aren’t related to what we’re actually looking at,” says MITPhD student Andrew Owens . “What would be really exciting is to somehow simulate sound that is less directly associated to the visuals.”
The notion of artificial sound generation has been around for sometime now, with concepts such as procedural audio, and in many ways its long overdue that the same amount of attention and computing power that is afforded to visual effects, be directed towards sound generation. CSAIL is directed by Tim Berners-Lee and is the largest research laboratory at MIT and one of the world’s most important centres of information technology research. I have found several articles which discuss this new development and have selected sections of them here :
the following is a selection from articles:
“Researchers envision future versions of similar algorithms being used to automatically produce sound effects for movies and TV shows, as well as to help robots better understand objects’ properties.
“When you run your finger across a wine glass, the sound it makes reflects how much liquid is in it,” says CSAIL PhD student Andrew Owens, who was lead author on an upcoming paper describing the work. “An algorithm that simulates such sounds can reveal key information about objects’ shapes and material types, as well as the force and motion of their interactions with the world.”
The team used techniques from the field of “deep learning,” which involves teaching computers to sift through huge amounts of data to find patterns on their own. Deep learning approaches are especially useful because they free computer scientists from having to hand-design algorithms and supervise their progress.
The paper’s co-authors include recent PhD graduate Phillip Isola and MIT professors Edward Adelson, Bill Freeman, Josh McDermott, and Antonio Torralba. The paper will be presented later this month at the annual conference on Computer Vision and Pattern Recognition (CVPR) in Las Vegas.
In a series of videos of drumsticks striking things — including sidewalks, grass and metal surfaces — the computer learned to pair a fitting sound effect, such as the sound of a drumstick hitting a piece of wood or of rustling leaves.
The findings are an example of the power of deep learning, a type of artificial intelligence whose application is trendy in tech circles. With deep learning, a computer system learns to recognize patterns in huge piles of data and applies what it learns in useful ways.
In this case, the researchers at MIT’s Computer Science and Artificial Intelligence Lab recorded about 1,000 videos of a drumstick scraping and hitting real-world objects. These videos were fed to the computer system, which learns what sounds are associated with various actions and surfaces. The sound of the drumstick hitting a piece of wood is different than when it disrupts a pile of leaves.
Once the computer system had all these examples, the researchers gave it silent videos of the same drumstick hitting other surfaces, and they instructed the computer system to pair an appropriate sound with the video.
To do this, the computer selects a pitch and loudness that fits what it sees in the video, and it finds an appropriate sound clip in its database to play with the video.
To demonstrate their accomplishment, the researcher then played half-second video clips for test subjects, who struggled to tell apart whether the clips included an authentic sound or one that a computer system had added artificially.
But the technology is not perfect, as MIT PhD candidate Andrew Owens, the lead author on the research, acknowledged. When the team tried longer video clips, the computer system would sometimes misfire and play a sound when the drumstick was not striking anything. Test subjects immediately knew the audio was not real.
And the researchers were able to get the computer to produce fitting sounds only when they used videos with a drumstick. Creating a computer that automatically provides the best sound effect for any video — the kind of development that could disrupt the sound-effects industry — remains out of reach for now.
Although the technology world has seen significant strides of late in artificial intelligence, there are still big differences in how humans and machines learn. Owens wants to push computer systems to learn more similarly to the way an infant learns about the world, by physically poking and prodding its environment. He sees potential for other researchers to use sound recordings and interactions with materials such as sidewalk cement as a step toward machines’ better understanding our physical world.