May 2014 — ECE Assistant Professor Dhruv Batra has earned a National Science Foundation (NSF) Faculty Early Career Development (CAREER) Award for his machine perception research in high-level, holistic scene understanding. The CAREER grant is the NSF's most prestigious award, given to junior faculty members who are expected to become academic leaders in their fields.
Batra has also been awarded a three-year, $150,000 Young Investigator Program (YIP) award from the Army Research Office to support his research.
Instead of enhancing any single aspect of machine perception (such as face recognition), Batra and his students are taking a fundamentally different approach—they plan to build a "holistic scene understanding" system. Their approach employs a "society of agents" that develops understanding from the interaction of multiple computer vision modules. Batra says that he draws inspiration from the work of pioneering AI researchers, such as Minsky, McCarthy, Papert, and Marr from the 1960s and 1970s, who were "simply ahead of their time, and had ideas that needed the computational and statistical tools of today (millions of images and thousands of CPUs and GPUs) to be brought to fruition."
Although there are computer vision systems for applications such as face recognition, handwriting recognition, and pedestrian detection, "these systems are inherently naive and limited in their understanding of an image," Batra says. For example, he notes, "a patch from an image may seem like a face, but may simply be an incidental arrangement of tree branches and shadows."
"A vision module operating in isolation often produces nonsensical results, such as faces floating in thin air, a mistake that no human observer of the scene will ever make," he continues. Using multiple vision modules, such as 3-D scene layout, object layout, and pose and activity recognition, Batra plans to create a vision system that holistically understands a scene well enough to realize that a human face is unlikely to be floating on a tree.
"My...goal is to develop models, algorithms, and large-scale implementations to enable the next generation of computer vision systems that understand the scene behind the image as well as humans do," says Batra. His proposed systems will attempt to answer questions such as "where is the ground" or "what is the person in the image doing." Additional capabilities may include interpreting intent or anticipating the future, answering questions such as "is the person paying attention?" or "is she headed for an accident?"
Batra will use several vision modules in conjunction to generate a small number of guesses, or "diverse plausible hypotheses," that can help interpret a scene. These modules might make their guesses by identifying flat surfaces, detecting and segmenting all objects in the image, estimating human poses, and categorizing the scene into a type such as natural, urban, or beach. Once each module has a guess for the scene, a "mediator" program will score the guesses and identify the one possibility that is most consistent across the modules. Batra's approach can also produce multiple possibilities that can then be sent to a human operator for feedback.
Computer vision systems that can understand the meaning of an image will improve a wide range of applications. As one example, Batra emphasizes how important it is for a pedestrian detector on an autonomous car to know the difference between a real person and a picture of a person on a billboard.
"Improved vision systems will fundamentally change our lives—from self-driving cars bringing mobility to the physically impaired, to unmanned aircraft helping law enforcement with search and rescue in disasters," writes Batra.