Projects

Torr Vision Group

About People Projects Publications Prizes Talks Videos Code Collaborators Links Travel Directions

Projects

Incremental Dense Semantic Stereo Fusion for Large-Scale Semantic Scene Reconstruction
Vibhav Vineet*, Ondrej Miksik*, Morten Lidegaard, Matthias Nie?ner, Stuart Golodetz, Victor A. Prisacariu, Olaf Kahler, David W. Murray, Shahram Izadi, Patrick Perez, Philip H. S. Torr

We propose an end-to-end system that can process the data incrementally and perform real-time dense stereo reconstruction and semantic segmentation of unbounded outdoor environments. The system outputs a per-voxel probability distribution instead of a single label (soft predictions are desirable in robotics, as the vision output is usually fed as input into other subsystems). Our system is also able to handle moving objects more effectively than prior approaches by incorporating knowledge of object classes into the reconstruction process. In order to achieve fast test times, we extensively use the computational power of modern GPUs.

The Semantic Paintbrush: Interactive 3D Mapping and Recognition in Large Outdoor Spaces
Ondrej Miksik*, Vibhav Vineet*, Morten Lidegaard, Ram Prasaath, Matthias Niessner, Stuart Golodetz, Stephen L. Hicks, Patrick Perez,d Shahram Izadi, Philip H. S. Torr

We have been developing smart glasses, with which a user can interactively capture a full 3D map, segment it into objects of interest and refine both segmentation and 3D parts of the model during capture, all by simply exploring the space and ‘painting’ or ‘brushing’ by a handheld laser pointer device onto the world. These enhanced images are then displayed to the user on head-mounted AR-glasses, hence stimulating the residual vision of the user.

SemanticPaint: Interactive Segmentation and Learning of 3D Worlds
Stuart Golodetz*, Michael Sapienza*, Julien Valentin, Vibhav Vineet, Ming-Ming Cheng, Anurag Arnab, Victor Adrian Prisacariu, Olaf Kaehler, Carl Yuheng Ren, David Murray, Shahram Izadi and Philip Torr

SIGGRAPH Emerging Technologies 2015

We present a real-time, interactive, open-source framework for the geometric reconstruction, object-class segmentation and learning of 3D scenes. The user interacts physically with the real-world scene, touching objects and using voice commands to assign them appropriate labels. These user-generated labels are leveraged by an online random forest-based machine learning algorithm, which is used to predict labels for previously unseen parts of the scene.

SemanticPaint: Interactive 3D Labeling and Learning at your Fingertips
Julien Valentin, Vibhav Vineet, Ming-Ming Cheng, David Kim, Jamie Shotton, Pushmeet Kohli, Matthias Niessner, Antonio Criminisi, Shahram Izadi and Philip Torr

ACM Transactions on Graphics 2015 (TOG)

We present a new interactive and online approach to 3D scene understanding. Our system, SemanticPaint, allows users to simultaneously scan their environment, while interactively segmenting the scene simply by reaching out and touching any desired object or surface.

Joint Object-Material Category Segmentation from Audio-Visual Cues
A. Arnab, M. Sapienza, S. Golodetz, J. Valentin, O. Miksik, S. Izadi, P. H. S. Torr.

It is not always possible to recognise objects and infer material properties for a scene from visual cues alone, since objects can look visually similar whilst being made of very different materials. In this paper, we therefore present an approach that augments the available dense visual cues with sparse auditory cues in order to estimate dense object and material labels. Since estimates of object class and material properties are mutually informative, we optimise our multi-output labelling jointly using a random-field framework. We evaluate our system on a new dataset with paired visual and auditory data that we make publicly available. We demonstrate that this joint estimation of object and material labels significantly outperforms the estimation of either category in isolation

Conditional Random Fields as Recurrent Neural Networks
S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P. H. S. Torr.

Pixel-level labelling tasks, such as semantic segmentation and depth estimation from single RGB image, play a central role in image understanding. Recent approaches have attempted to harness the capabilities of deep learning techniques for image recognition to tackle pixel-level labelling tasks. To solve this problem, we introduce a new form of convolutional neural network, called CRF-RNN, which expresses a Conditional Random Field (CRF) as a Recurrent Neural Network (RNN). Our short network can be plugged in as a part of a deep Convolutional Neural Network (CNN) to obtain an end-to-end system that has desirable properties of both CNNs and CRFs. Importantly, our system fully integrates CRF modelling with CNNs, making it possible to train the whole system end-to-end with the usual backpropagation algorithm. We apply this framework to the problem of semantic image segmentation, obtaining competitive results with the state-of-the-art without the need of introducing any post-processing method for object delineation.

InfiniTAM
V. A. Prisacariu, O. Kahler, M.M. Cheng, J. Valentin, P. H. S. Torr, I. D. Reid, D. W. Murray.

InfiniTAM an Open Source, multi-platform framework for real-time, large-scale depth fusion and tracking, released under an Oxford Isis Innovation Academic License

Hexcopter Guide "Dog" for the Visually Impaired
Morten Lidegaard, Stephen Hicks, Philip H. S. Torr,

In the UK alone over 1.5 million people are currently living with sight loss of some degree. Of these nearly 190.000 (2014) are of severe character. These people are experiencing difficulties getting around and travelling as a result of their impairment. Travelling freely and being independent is a huge problem for blind and partially sighted. travilling in familiar environments are fine as long as they do not change too much. But visiting unfamiliar places can be very difficult and even dangerous when not being able to see properly or not even at all. This project is aiming to leverage robotics to target the problem of lack of indepence when travelling in both familiar and unfamiliar places - indoors and outdoors. An aerial robotic platform is being developed with this task in mind. The robotic platform is going to contain cameras and an onboard processing unit for performing different tasks like autonomous navigation. Furthermore, a video-link to a powerful stationary PC will be included for heavier processing tasks. High-level control commands can be sent from a ground station to the robot for user-guidance of the robot.

Dense Semantic Image Segmentation with Objects and Attributes
S. Zheng, M. Cheng, J. Warrell, P. Sturgess, V. Vineet, C. Rother, P. H. S. Torr,

The concepts of objects and attributes are both important for precisely describing images, since verbal descriptions often contain both adjectives and nouns (e.g. 'I see a glossy red wall'). In this paper, we formulate the problem of joint visual attribute and object class image segmentation as a dense multi-labeling problem, where each pixel in an image can be associated with both an object-class and a set of visual attributes labels. In order to learn the label correlations, we adopt a boosting based piecewise training approach with respect to the visual appearance and co-occurrence cues. We use a filtering-based mean-field approximation approach for efficient joint inference. Further, we develop a hierarchical model to incorporate region-level object and attribute information. Experiments on the aPascal, CORE and attribute augmented NYU indoor scenes (aNYU) datasets show that the proposed approach is able to achieve state-of-the-art results.

BING: Binarized Normed Gradients for Objectness Estimation at 300fps
Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, Philip Torr,

Training a generic objectness measure to produce a small set of candidate object windows, has been shown to speed up the classical sliding window object detection paradigm. We observe that generic objects with well-defined closed boundary, share surprisingly strong correlation in normed gradients space, when resizing their corresponding image windows into a small fixed size. Based on this observation and computational reasons, we propose to resize an image window to 8 x 8 and use the normed gradients as a simple 64D feature to describe it, for explicitly training a generic objectness measure. We further show how the binarized version of this feature, namely binarized normed gradients (BING), can be used for efficient objectness estimation, which requires only a few atomic operations (e.g. ADD , BITWISE SHIFT , etc.). Experiments on the challenging PASCAL VOC 2007 dataset show that our method efficiently (300fps on a single laptop CPU) generates a small set of category-independent, high quality object windows, yielding 96.2% object detection rate (DR) with 1,000 proposals. With increase of the numbers of proposals and color spaces for computing BING features, our performance can be further improved to 99.5% DR.

Urban 3D Semantic Modelling Using Stereo Vision
Sunando Sengupta, Eric Greveson, Ali Shahrokni, Philip Torr,

In this paper we propose a robust algorithm thatgenerates an efficient and accurate dense 3D reconstructionwith associated semantic labellings. Intelligent autonomoussystems require accurate 3D reconstructions for applicationssuch as navigation and localisation. Such systems also need torecognise their surroundings in order to identify and interactwith objects of interest. Considerable emphasis has been givento generating a good reconstruction but less effort has gone intogenerating a 3D semantic model.

Simultaneous Human Segmentation, Depth and Pose Estimation via Dual Decomposition
Glenn Sheasby, Jonathan Warrell, Yuhang Zhang, Nigel Crook Philip Torr,

The tasks of stereo matching, segmentation, and human pose estimation have been popular in computer vision in recent years, but attempts to combine the three tasks have so far resulted in compromises: either using infra-red cameras, or a greatly simplified body model. We propose a framework for estimating a detailed human skeleton in 3D from a stereo pair of images. Within this framework, we define an energy function that incorporates the relationship between the segmentation results, the pose estimation results, and the disparity space image.

Automatic Dense Visual Semantic Mapping from Street-Level Imagery
Sunando Sengupta, Paul Sturgess, Lubor Ladicky, Philip Torr,

This paper describes a method for producing a semantic map frommulti-view street-level imagery. We define a semantic map as anoverhead, or bird's eye view of a region with associated semanticobject labels, such as car, road and pavement. We formulate theproblem using two conditional random fields. The first is used tomodel the semantic image segmentation on the street view imagerytreating each image independently. The output of this stage is thenaggregated over many images and forms the input for our semantic mapthat is a second random field defined over a ground plane. Eachimage is related by a simple, yet effective, geometrical functionthat back projects a region from the street view image into theoverhead ground plane map. We introduce, and make publiclyavailable, a new dataset created from real world data. Ourqualitative evaluation is performed on this data consisting of a14.8 km track, and we also quantify our results on arepresentative sub-set.

Human Instance Segmentation from Video using Detector-based Conditional Random Fields
Vibhav Vineet, Jonathan Warell, Lubor Ladicky, Philip Torr

In this work, we propose a method for instance based human segmentation in images and videos, extending the recent detector-based conditional random field model of Ladicky et.al. Instance based human segmentation involves pixel level labeling of an image, partitioning it into distinct human instances and background. To achieve our goal, we add three new components to their framework. First, we include human partsbased detection potentials to take advantage of the structure present in human instances. Further, in order to generate a consistent segmentation from different human parts, we incorporate shape prior information, which biases the segmentation to characteristic overall human shapes. Also, we enhance the representative power of the energy function by adopting exemplar instance based matching terms, which helps our method to adapt easily to different human sizes and poses. Finally, we extensively evaluate our proposed method on the Buffy dataset with our new segmented ground truth images, and show a substantial improvement over existing CRF methods.

Computer Games
Philip Torr, Jon Rihan, Nicolas Lord, Amir Saffari, Glenn Sheasby and Sam Hare

This proposal concerns research into vision algorithms that might be useful for real world commercial games. Sony Entertainment Europe are an ideal partner in this enterprise as they have pioneered this form of human/machine interaction in the games industry, with the launch of the EyeToy, and continue to be the lead player.
Complex computer software developed by Oxford Brookes Vision Group has helped to create a 'magical' new technology for the Sony PlayStation, that author J.K. Rowling calls 'a reading experience like no other'.
Wonderbook? is a physical book that interacts with the PlayStation 3? via a camera, and allows the player to control the computer through natural hand gestures. Books come to life on the screen in dramatic new ways that can be used equally for entertainment and education. The first implementation is to be the 'Book of Spells' from the World of Harry Potter series, complete with new writing by the author, and was released by Christmas 2012.
Wonderbook was announced in June at a Sony press conference in Los Angeles. Astonished comments in the press called it 'unique', 'innovative', 'ground-breaking', and 'a new medium for storytelling'. Of her Wonderbook Book of Spells, J.K. Rowling said 'it's the closest a Muggle can come to a real spellbook. I've loved working with Sony's creative team to bring my spells, and some of the history behind them, to life.'
The contribution from Oxford Brookes Vision Group was to enable the computer to distinguish the skin of the player's hands from the background. It came as part of a Knowledge Transfer Partnership with Sony Computer Entertainment Europe, led by Professor Philip Torr, who is a world expert in computer vision systems. The Computer Vision Group has previously won the prestigious best KTP of the year in a collaboration with Vicon that resulted in a new product range with added video functionality.

SONY WONDERBOOK - Maths and Magic: the challenge

Creation of Content for 3D Displays
Philip Torr, Karteek Alahari, Srikumar Ramalingam

3D display technology has the potential to be the most important display innovation since the introduction of colour. Evidence that this move to 3D imminent is provided by the recent introduction of a UK developed commercial 3D display on Sharp's Actius range of laptops. The major problem standing in the way is a shortage of 3D content. This research aims to address this problem by developing basic science in the area of 3D content generation in collaboration with Sharp Laboratories Europe.

VideoTrace
Anton van den Hengel, Anthony Dick, Thorsten Thormahlen, Ben Ward, Philip Torr

VideoTrace is a system for interactively generating realistic 3D models of objects from video models that might be inserted into a video game, a simulation environment, or another video sequence. The user interacts with VideoTrace by tracing the shape of the object to be modelled over one or more frames of the video. By interpreting the sketch drawn by the user in light of 3D information obtained from vision techniques, a small number of simple 2D interactions can be used to generate a realistic 3D model. Each of the sketching operations in VideoTrace provides an intuitive and powerful means of modelling shape from video, and executes quickly enough to be used interactively. Immediate feedback allows the user to model rapidly those parts of the scene which are of interest and to the level of detail required. The combination of automated and manual reconstruction allows VideoTrace to model parts of the scene not visible, and to succeed in cases where purely automated approaches would fail. VideoTrace has been featured across much of the internet (slashdot, etc) and a spinout company is planned.

Video Trace

Analysis of Human Motion
Philip Torr, Andrew Stoddart, Manish Jethwa, Matthieu Bray, Morne Pistorius, David Jarzebowski, Carl Ek

In collaboration with world leading motion capture company Vicon, we are exploring new methods for markerless motion capture e.g. inferring from video alone the pose of a person. Vicon's marker based technology is used through out the film industry. Work from Oxford Brookes has recently been licensed by Vicon for inclusion in forth coming products.

Combinatorial Optimization for Vision
Philip Torr, Pushmeet Kohli, Pawan Kumar

We are actively developing new combinatorial optimization algorithms for vision. These algorithms include improvements to belief propagation and more efficient ways of performing graph cuts, with applications in dense stereo, segmentation, image editing and motion capture.

New View Synthesis: Stereo Views from Video
Philip Torr, Oliver Woodford, Andrew Fitzgibbon

With Sharp we are exploring ways of producing a stereo pair for each frame in a video sequence, enabling the conversion of 2D movies to 3D and with it the creation of content for Sharp's new 3D LCDs. While similar to standard new view synthesis, we are solving the additional problems of narrow baseline camera motion, independently moving objects and synthesising non-visible surfaces.

Object Recognition - Face Detection
Sami Romdhani, Philip Torr, Bernhard Sch?kopf

We developed face detection methods based on cascaded classifiers, predating Viola but perhaps not as a fast as the features we used were not as efficient as Haar wavelet. Cascades and tree based cascades have been a feature of our work for object recognition and tracking.

Object Recognition and Segmentation
Pawan Kumar, Philip Torr, Andrew Zisserman

We solve the problem of object recognition using a part-based model.Given an image, multiple hypotheses are generated for the positionof each part of the object. A Markov random field is defined overthe parts of the object where each state represents a putativeposition of the part. The MAP estimate of the location of the objectis obtained using belief propagation. Once the parts of the objectare localized, we use graph cuts to obtain the segmentation of theobject by constraining the shape of the segmentation to be object-like.

Single View Reconstruction
Philip Torr, Karteek Alahari, Srikumar Ramalingam

We are developing a system for recovering approximate 3D from single views of the scene. We are exploring this reconstruction problem in light of the recent developments in combinatorial optimization techniques for vision problems.

Department of Engineering Science, University of Oxford,
Parks Road, Oxford, OX1 3PJ,
United Kingdom.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。