Zusammenfassung
In this work we proposed a novel transductive method to solve the problem of
learning from partially labeled data. Our main idea was to aggregate information
obtained from several clusterings to infer the labels of the unlabeled data.
While our method is not restricted to a specific clustering method, we chose
to use in our experiments the normalized variant of 1-spectral clustering, which
was demonstrated to produce in most cases better clusterings than the standard
spectral clustering method. Our approach yielded results which were at least
comparable to, and in some cases even significantly better than the best results
obtained by state-of-the-art methods reported in the literature.
Furthermore, we proposed a novel active learning framework that is able to
query the labels of the most informative points which help in the classification
of the unlabeled points. For the majority vote scheme we provided some
guarantees on the number of points that should be drawn from each cluster in
order
to infer the correct label of the cluster with high probability. Moreover, in
the
ridge regression scheme we proposed an algorithm that in each step selects the
most uncertain point in terms of the prediction function of the classier (the
point that lies near the decision boundary of the classifier). In both cases,
experimental results show the strength of our methods and confirm our
theoretical
guarantees.
The results look very promising and open several interesting directions of
future research. For the SSL scheme, it is interesting to test the performance
of several other clustering approaches, such as k-means, standard spectral
clustering, hierarchical clustering, e.t.c. and combine them together in one
general
method. Our intuition is that the algorithm should be able to select only the
good clusterings that provide discriminative information for each specific
problem.
Apart from ridge regression, it would be beneficial to experiment with other
fitting approaches that produce sparse representations in our constructed
basis. For the active learning framework, one interesting direction is to
further generalize it into more general clusterings that take into account the
hierarchical structure of data. In that way, we will take advantage of the
underlying hierarchy and by adaptively selecting the pruning of the cluster
tree we can (potentially) further improve our sampling strategy. Additionally,
we believe that in the multi-clustering scenario extensive improvements of our
algorithm can be proposed in order to better take advantage of the variation in
the multiple clustering representations of the data. Finally, as our methods
scale to large-scale problems and partially labeled data occurs in many
different areas ranging from web documents to protein data, there is room for
many interesting applications of the proposed methods.