Machine Perception Research | ECE | Virginia Tech

Research Areas

Teaching machines to 'see' in a more human way Machine Perception gives a machine the ability to explain, in a human manner, why it is making its decisions, to warn when it is about to fail, and to provide an understandable characterization of its failures. Computer Vision builds machines that can see the world like humans do, and involves designing algorithms that can answer questions about a photograph or a video.

Current Research

Teaching Computers about Facial Expressions

ECE researchers are working with colleagues in Psychology to create a database of human facial expressions that will be of interest to a broad research community. The database was obtained using a Kinect sensor, which provides standard 2-D images as well as 3-D representations. The fully annotated dataset includes seven expressions (happiness, sadness, surprise, disgust, fear, anger, and neutral) for 32 subjects (males and females) ages 10 to 30, and with a variety of skin tones. The dataset has been instrumental in the creation of a preliminary system that automatically recognizes human facial expressions using both 2-D and 3-D data.

Visual Question Answering

Given an image and a free-form natural-language question about the image (e.g., "What kind of store is this?" or "Is it safe to cross the street?"), the machine's task is to automatically produce a concise, accurate, free-form, natural language answer ("bakery", "Yes"). An ECE team is investigating Visual Question Answering (VQA), which has applications with high societal impact that involve humans working in collaboration with machines to elicit and extract situationally relevant information from visual data. This research could improve the way visually impaired users live their daily lives, and revolutionize how society at large interacts with visual data.

Our main thesis is that VQA represents not a single, narrowly defined problem (e.g., image classification) but a rich spectrum of semantic scene-understanding problems and associated research directions. Each question in VQA may lie at a different point on this spectrum--from questions that directly map to existing well-studied computer-vision problems ("What is this room called?" = indoor scene recognition) all the way to questions that require an integrated approach of language (semantics), vision (scene), and reasoning (understanding) over a knowledge base ("Does the pizza in the back row next to the Coke seem vegetarian?").

We explore approaches that map to a sequence of waypoints along this spectrum including (i) pure computer vision; (ii) integrating vision + language; (iii) integrating vision + language + knowledge bases. We are also exploring approaches to (a) make these models interpretable; (b) train the machine to be curious and actively ask questions to learn; (c) use VQA as a new modality to learn more about the visual world than what existing annotation modalities allow; and (d) train the machine to know what it knows and what it does not.