Coherent Multi-sentence Video Description with Variable Level of Detail

Senina, Anna; Rohrbach, Marcus; Qiu, Wei; Friedrich, Annemarie; Amin, Sikandar; Andriluka, Mykhaylo; Pinkal, Manfred; Schiele, Bernt

Datensatz

DATENSATZ AKTIONENEXPORT

Zur Ablage hinzufügen

Lokale TagsFreigabegeschichteDetailsÜbersicht

Freigegeben

Forschungspapier

Coherent Multi-sentence Video Description with Variable Level of Detail

MPG-Autoren

/persons/resource/persons79477

Senina, Anna
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society;

/persons/resource/persons45307

Rohrbach, Marcus
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society;

/persons/resource/persons98388

Qiu, Wei
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society;

/persons/resource/persons86644

Amin, Sikandar
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society;

/persons/resource/persons71836

Andriluka, Mykhaylo
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society;

/persons/resource/persons45383

Schiele, Bernt
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society;

Externe Ressourcen

Es sind keine externen Ressourcen hinterlegt

Volltexte (beschränkter Zugriff)

Für Ihren IP-Bereich sind aktuell keine Volltexte freigegeben.

Volltexte (frei zugänglich)

1403.6173v1.pdf
(Preprint), 485KB

Ergänzendes Material (frei zugänglich)

Es sind keine frei zugänglichen Ergänzenden Materialien verfügbar

Zitation

Senina, A., Rohrbach, M., Qiu, W., Friedrich, A., Amin, S., Andriluka, M., et al. (2014). Coherent Multi-sentence Video Description with Variable Level of Detail. Retrieved from http://arxiv.org/abs/1403.6173.

Zitierlink: https://hdl.handle.net/11858/00-001M-0000-0019-87B3-2

Zusammenfassung

Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description are mainly focused on single sentence generation and produce descriptions at a fixed level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from the SR. To produce consistent multi-sentence descriptions, we model across-sentence consistency at the level of the SR by enforcing a consistent topic. We also contribute both to the visual recognition of objects proposing a hand-centric approach as well as to the robust generation of sentences using a word lattice. Human judges rate our multi-sentence descriptions as more readable, correct, and relevant than related work. To understand the difference between more detailed and shorter descriptions, we collect and analyze a video description corpus of three levels of detail.