Help Guide Disclaimer Contact us Login
  Advanced SearchBrowse





Aggregation of Multiple Clusterings and Active Learning in a Transductive Setting


Arvanitopoulos-Darginis,  Nikolaos
International Max Planck Research School, MPI for Informatics, Max Planck Society;

There are no locators available
Fulltext (public)
There are no public fulltexts available
Supplementary Material (public)
There is no public supplementary material available

Arvanitopoulos-Darginis, N. (2012). Aggregation of Multiple Clusterings and Active Learning in a Transductive Setting. Master Thesis, Universität des Saarlandes, Saarbrücken.

Cite as:
In this work we proposed a novel transductive method to solve the problem of learning from partially labeled data. Our main idea was to aggregate information obtained from several clusterings to infer the labels of the unlabeled data. While our method is not restricted to a specific clustering method, we chose to use in our experiments the normalized variant of 1-spectral clustering, which was demonstrated to produce in most cases better clusterings than the standard spectral clustering method. Our approach yielded results which were at least comparable to, and in some cases even significantly better than the best results obtained by state-of-the-art methods reported in the literature. Furthermore, we proposed a novel active learning framework that is able to query the labels of the most informative points which help in the classification of the unlabeled points. For the majority vote scheme we provided some guarantees on the number of points that should be drawn from each cluster in order to infer the correct label of the cluster with high probability. Moreover, in the ridge regression scheme we proposed an algorithm that in each step selects the most uncertain point in terms of the prediction function of the classier (the point that lies near the decision boundary of the classifier). In both cases, experimental results show the strength of our methods and confirm our theoretical guarantees. The results look very promising and open several interesting directions of future research. For the SSL scheme, it is interesting to test the performance of several other clustering approaches, such as k-means, standard spectral clustering, hierarchical clustering, e.t.c. and combine them together in one general method. Our intuition is that the algorithm should be able to select only the good clusterings that provide discriminative information for each specific problem. Apart from ridge regression, it would be beneficial to experiment with other fitting approaches that produce sparse representations in our constructed basis. For the active learning framework, one interesting direction is to further generalize it into more general clusterings that take into account the hierarchical structure of data. In that way, we will take advantage of the underlying hierarchy and by adaptively selecting the pruning of the cluster tree we can (potentially) further improve our sampling strategy. Additionally, we believe that in the multi-clustering scenario extensive improvements of our algorithm can be proposed in order to better take advantage of the variation in the multiple clustering representations of the data. Finally, as our methods scale to large-scale problems and partially labeled data occurs in many different areas ranging from web documents to protein data, there is room for many interesting applications of the proposed methods.