English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT
  Coherent Multi-sentence Video Description with Variable Level of Detail

Senina, A., Rohrbach, M., Qiu, W., Friedrich, A., Amin, S., Andriluka, M., et al. (2014). Coherent Multi-sentence Video Description with Variable Level of Detail. Retrieved from http://arxiv.org/abs/1403.6173.

Item is

Files

show Files
hide Files
:
1403.6173v1.pdf (Preprint), 485KB
Name:
1403.6173v1.pdf
Description:
-
OA-Status:
Visibility:
Public
MIME-Type / Checksum:
application/pdf / [MD5]
Technical Metadata:
Copyright Date:
-
Copyright Info:
-

Locators

show

Creators

show
hide
 Creators:
Senina, Anna1, Author           
Rohrbach, Marcus1, Author           
Qiu, Wei1, Author           
Friedrich, Annemarie2, Author
Amin, Sikandar1, Author           
Andriluka, Mykhaylo1, Author           
Pinkal, Manfred2, Author
Schiele, Bernt1, Author           
Affiliations:
1Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society, ou_1116547              
2External Organizations, ou_persistent22              

Content

show
hide
Free keywords: Computer Science, Computer Vision and Pattern Recognition, cs.CV,Computer Science, Computation and Language, cs.CL
 Abstract: Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description are mainly focused on single sentence generation and produce descriptions at a fixed level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from the SR. To produce consistent multi-sentence descriptions, we model across-sentence consistency at the level of the SR by enforcing a consistent topic. We also contribute both to the visual recognition of objects proposing a hand-centric approach as well as to the robust generation of sentences using a word lattice. Human judges rate our multi-sentence descriptions as more readable, correct, and relevant than related work. To understand the difference between more detailed and shorter descriptions, we collect and analyze a video description corpus of three levels of detail.

Details

show
hide
Language(s): eng - English
 Dates: 2014-03-242014
 Publication Status: Published online
 Pages: 10 p.
 Publishing info: -
 Table of Contents: -
 Rev. Type: -
 Identifiers: arXiv: 1403.6173
BibTex Citekey: 850
URI: http://arxiv.org/abs/1403.6173
 Degree: -

Event

show

Legal Case

show

Project information

show

Source

show