非表示:
キーワード:
-
要旨:
Humans use rich natural language to describe and communicate visual
perceptions. In order to provide natural language descriptions for
visual content, this paper combines two important ingredients. First,
we generate a rich semantic representation of the visual content
including e.g. object and activity labels. To predict the semantic
representation we learn a CRF to model the relationships between
different components of the visual input. And second, we propose
to formulate the generation of natural language as a machine translation
problem using the semantic representation as source language and
the generated sentences as target language. For this we exploit the
power of a parallel corpus of videos and textual descriptions and
adapt statistical machine translation to translate between our two
languages. We evaluate our video descriptions on the TACoS dataset,
which contains video snippets aligned with sentence descriptions.
Using automatic evaluation and human judgments we show significant
improvements over several base line approaches, motivated by prior
work. Our translation approach also shows improvements over related
work on an image description task.