Learning to Segment in Images and Videos with Different Forms of Supervision

Khoreva, Anna

doi:10.22028/D291-26995

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Thesis

Learning to Segment in Images and Videos with Different Forms of Supervision

MPS-Authors

/persons/resource/persons79309

Khoreva, Anna
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society;
International Max Planck Research School, MPI for Informatics, Max Planck Society;

External Resource

http://dx.doi.org/10.22028/D291-26995
(Any fulltext)

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Khoreva, A. (2017). Learning to Segment in Images and Videos with Different Forms of Supervision. PhD Thesis, Universität des Saarlandes, Saarbrücken. doi:10.22028/D291-26995.

Cite as: https://hdl.handle.net/21.11116/0000-0000-293F-D

Abstract

Much progress has been made in image and video segmentation
over the last years. To a large extent, the success can be attributed to
the strong appearance models completely learned from data, in particular
using deep learning methods. However,to perform best these methods require
large representative datasets for training with expensive pixel-level
annotations, which in case of videos are prohibitive to obtain. Therefore,
there is a need to relax this constraint and to consider alternative forms
of supervision, which are easier and cheaper to collect. In this thesis,
we aim to develop algorithms for learning to segment in images and videos
with different levels of supervision.
First, we develop approaches for training convolutional networks with weaker
forms of supervision, such as bounding boxes or image labels, for object
boundary estimation and semantic/instance labelling tasks. We propose to
generate pixel-level approximate groundtruth from these weaker forms of
annotations to train a network, which allows to achieve high-quality
results comparable to the full supervision quality without any
modifications of the network architecture or the training procedure.
Second, we address the problem of the excessive computational and memory
costs inherent to solving video segmentation via graphs. We propose
approaches to improve the runtime and memory efficiency as well as the
output segmentation quality by learning from the available training data
the best representation of the graph. In particular, we contribute with
learning must-link constraints, the topology and edge weights of the graph
as well as enhancing the graph nodes - superpixels - themselves.
Third, we tackle the task of pixel-level object tracking and address the
problem of the limited amount of densely annotated video data for training
convolutional networks. We introduce an architecture which allows training
with static images only and propose an elaborate data synthesis scheme
which creates a large number of training examples close to the target
domain from the given first frame mask. With the proposed techniques we
show that densely annotated consequent video data is not necessary to
achieve high-quality temporally coherent video segmentationresults.
In summary, this thesis advances the state of the art in weakly supervised
image segmentation, graph-based video segmentation and pixel-level object
tracking and contributes with the new ways of training convolutional
networks with a limited amount of pixel-level annotated training data.