Seeing with Humans: Gaze-Assisted Neural Image Captioning

Sugano, Yusuke; Bulling, Andreas

Datensatz

DATENSATZ AKTIONENEXPORT

Zur Ablage hinzufügen

Lokale TagsFreigabegeschichteDetailsÜbersicht

Freigegeben

Forschungspapier

Seeing with Humans: Gaze-Assisted Neural Image Captioning

MPG-Autoren

/persons/resource/persons134261

Sugano, Yusuke
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society;

/persons/resource/persons86799

Bulling, Andreas
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society;

Externe Ressourcen

Es sind keine externen Ressourcen hinterlegt

Volltexte (beschränkter Zugriff)

Für Ihren IP-Bereich sind aktuell keine Volltexte freigegeben.

Volltexte (frei zugänglich)

arXiv:1608.05203.pdf
(Preprint), 3MB

Ergänzendes Material (frei zugänglich)

Es sind keine frei zugänglichen Ergänzenden Materialien verfügbar

Zitation

Sugano, Y., & Bulling, A. (2016). Seeing with Humans: Gaze-Assisted Neural Image Captioning. Retrieved from http://arxiv.org/abs/1608.05203.

Zitierlink: https://hdl.handle.net/11858/00-001M-0000-002B-AC67-2

Zusammenfassung

Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captioning by studying the interplay between human gaze and the attention mechanism of deep neural networks. Using a public large-scale gaze dataset, we first assess the relationship between state-of-the-art object and scene recognition models, bottom-up visual saliency, and human gaze. We then propose a novel split attention model for image captioning. Our model integrates human gaze information into an attention-based long short-term memory architecture, and allows the algorithm to allocate attention selectively to both fixated and non-fixated image regions. Through evaluation on the COCO/SALICON datasets we show that our method improves image captioning performance and that gaze can complement machine attention for semantic scene understanding tasks.