Dhruv Batra (left) and Devi Parikh (right) are developing Visual Question Answering capability for computers. Visual machine perception requires powerful computational capability. The team shares a 500-core CPU cluster, each an order of magnitude more powerful than a laptop, and a GPU cluster.
Teaching computers to understand images is a complex undertaking, especially if the goal is to enable the computer to provide a natural-language answer to a specific question — a vital goal in applications such as using computers for assisting the vision-impaired in real-world situations.
Devi Parikh and Dhruv Batra, assistant professors of ECE, have won a $92,600 Google Research Award to develop a new approach to this task, which they call Visual Question Answering, or VQA. These questions require an understanding of vision, language, and common sense knowledge to answer.
This is Parikh’s third Google Research grant, and Batra’s second.
Parikh and Batra will use the funding to collect a large dataset of images, questions, and answers that can be used to develop and evaluate VQA systems.
“Answering any possible question about an image is in one sense the holy grail of image understanding,” Parikh and Batra wrote in their proposal. “With the recent advances in image classification, object detection, and image captioning, we believe the time is ripe to take on such an endeavor.”
Current image description approaches tend to produce generic descriptions from videos and images. But that isn’t what real-world users want or need.
Parikh and Batra are using millions of question-answer examples on images of common scenes to endow computers with natural-language visual question answering capability.
“What they want is to be able to poll an intelligent device and ask specific goal-driven questions,” according to Parikh and Batra. “What they want is to be able to elicit situationally relevant information – Can I cross the street Is there something sharp in the scene that I should avoid”
The researchers’ immediate goal is to compile a dataset of a quarter of a million images and about 10 million question-answer pairs. This publicly available dataset will be useful to evaluate a number of different approaches for teaching machines to provide real-world answers to specific questions about visual content.
The question-answer pairs will be generated by workers asked to generate questions that a human looking at the image could easily answer but a smart robot probably could not.
Parikh, who received her Ph.D. at Carnegie Mellon University, has been with Virginia Tech since January 2013. She leads the Computer Vision lab. She is a recipient of the Army Research Office (ARO) Young Investigator Program (YIP) award (2014), the Allen Distinguished Investigator Award in Artificial Intelligence from the Paul G. Allen Family Foundation (2014), Virginia Tech ICTAS JFC award (2015), and an Outstanding New Assistant Professor award from the College of Engineering at Virginia Tech in 2015.
Batra also came to Virginia Tech in January 2013 and earned his doctorate at Carnegie Mellon University. He is head of the Machine Learning & Perception Lab. He is a recipient of the ARO YIP award (2014), the National Science Foundation (NSF) CAREER award (2014), Virginia Tech ICTAS JFC award (2014), and Virginia Tech College of Engineering Outstanding New Assistant Professor award (2015). Research from his lab has been featured in Bloomberg Business, The Boston Globe, and a number of popular press magazines and newspapers.