Place and scene recognition from video

While navigating in an environment, a vision system has to be able to recognize where it is and what the main objects in the scene are. We present a context-based vision system for place and object recognition. The goal is to identify familiar locations (e.g., office 610, conference room 941, Main Street), to categorize new environments (office, corridor, street) and to use that information to provide contextual priors for object recognition (e.g., table, chair, car, computer). We have trained a system to recognize over 60 locations (indoors and outdoors) and to suggest the presence and locations of more than 20 different object types. The algorithm has been integrated into a mobile system that provides real-time feedback to the user.

As a test-bed for the approach proposed, we use a helmet-mounted mobile system. The system is composed of a web-cam that is set to capture 4 images/second at a resolution of 120x160 pixels (color). The web-cam is mounted on a helmet in order to follow the head movements while the user explores their environment. The user receives feedbackabout system performance through a head-mounted display.

Kevin Murphy Antonio Torralba

We use a low-dimensional global image representation thatcaptures the "gist" of the scene.This can be used as input to a Bayes net/ HMM, as shown below.(See our ICCV03 paper for details.)

Below we show the performance of place recognition for a sequence that starts indoors and then goes outdoors.(ICCV03 Figure 3). Top. The solid line represents the true location, and the dots represent the posterior probability associated with each location. There are 63 possible locations, but we only show those with non negligible probability mass. Middle. Estimated category of each location. Bottom. Estimated probability of being indoors or outdoors.

Some images from the dataset.

Publications

Context-based vision system for place and object recognition
Antonio Torralba, Kevin P. Murphy, William T. Freeman and Mark Rubin,ICCV 2003.
Using the forest to see the trees: a graphical model relatingfeatures, objects and scenes
Kevin P. Murphy, Antonio Torralba and William T. Freeman, NIPS 2003.

Movies

AVI of place recognitionusing wearable camera.If P(place-category(t)|vG(1:t)) > threshold, we print the category ofthe place (office, kitchen, etc) in the top right corner(black = correct, red = incorrect).If P(place(t)|vG(1:t)) > threshold, we print the name of the specificplace (office 101, kitchen #3, etc) in the bottom right corner(black = correct, red = incorrect).
AVI of place recognitionusing wearable camera. This one shows the HMM belief statesuperimposed on a topological map.
Text output is the same as above movie.The bottom half shows a map of the 9th floor of the AI lab (NE43).Blue solid circle indicates P(place(t)|vG(1:t)) as computed using the HMM;black hollow circle indicates P(place(t)|vG(t)) as computed using theinstantaneous gist;red/green cross = true location.The size of the circles is proportional to the probability.Notice how the HMM provides temporal smoothing.Nevertheless, there are discontinuous jumps, which apparently violatetopological constraints, because we apply Dirichlet smoothing to thetransition matrix. This effect can be reduced (at the cost ofincreased latency upon moving to a new location) by down-weighting thelikelihood by an exponential factor (see equation for \tilde{b}_t onp4 of ICCV paper).
WMV movie which shows how Dan Roth ported our place recognition system to anER1 mobile robot.

Data

The video data used to generate the results in Figure 3 of theICCV03 paper is availableas part of theMIT CSAILdatabase of object and scenes.Look for the folder called "paperSequence".
The matlab file here contains the80 dimensional gist vectors for the video sequence, and the placenumbers and names:
```
placeNames: {1x20 cell}     placeNums: [1x3430 double]         gists: [80x3430 double]
```
If you typeplot(foo.placeNums,'o-')the results look slightly different from Figure 3, since the names ofthe places were changed somewhat. But it is qualitatively similar.Note that although we considered 63 places in the ICCV03 paper, only20 occur in this particular sequence.
The file gistsICCV03.zip (14MB) contains17 files, similar to the above, for the 17 video sequences used in the ICCV03 paper (see here for the list of files used fortraining and testing).

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。