hide
Free keywords:
Computer Science, Computer Vision and Pattern Recognition, cs.CV
Abstract:
In this paper we propose an approach for articulated tracking of multiple
people in unconstrained videos. Our starting point is a model that resembles
existing architectures for single-frame pose estimation but is several orders
of magnitude faster. We achieve this in two ways: (1) by simplifying and
sparsifying the body-part relationship graph and leveraging recent methods for
faster inference, and (2) by offloading a substantial share of computation onto
a feed-forward convolutional architecture that is able to detect and associate
body joints of the same person even in clutter. We use this model to generate
proposals for body joint locations and formulate articulated tracking as
spatio-temporal grouping of such proposals. This allows to jointly solve the
association problem for all people in the scene by propagating evidence from
strong detections through time and enforcing constraints that each proposal can
be assigned to one person only. We report results on a public MPII Human Pose
benchmark and on a new dataset of videos with multiple people. We demonstrate
that our model achieves state-of-the-art results while using only a fraction of
time and is able to leverage temporal information to improve state-of-the-art
for crowded scenes.