hide
Free keywords:
Computer Science, Computer Vision and Pattern Recognition, cs.CV
Abstract:
We propose a new efficient single-shot method for multi-person 3D pose
estimation in general scenes from a monocular RGB camera. Our fully
convolutional DNN-based approach jointly infers 2D and 3D joint locations on
the basis of an extended 3D location map supported by body part associations.
This new formulation enables the readout of full body poses at a subset of
visible joints without the need for explicit bounding box tracking. It
therefore succeeds even under strong partial body occlusions by other people
and objects in the scene. We also contribute the first training data set
showing real images of sophisticated multi-person interactions and occlusions.
To this end, we leverage multi-view video-based performance capture of
individual people for ground truth annotation and a new image compositing for
user-controlled synthesis of large corpora of real multi-person images. We also
propose a new video-recorded multi-person test set with ground truth 3D
annotations. Our method achieves state-of-the-art performance on challenging
multi-person scenes.