ausblenden:
Schlagwörter:
-
Zusammenfassung:
This paper addresses the problem of semi-supervised classification on document
collections using retraining (also called self-training). A possible
application is focused Web crawling which may start with very few, manually
selected, training documents but can be enhanced by automatically adding
initially unlabeled, positively classified Web pages for retraining. Such an
approach is by itself not robust and faces tuning problems regarding parameters
like the number of selected documents, the number of retraining iterations, and
the ratio of positive and negative classified samples used for retraining. The
paper develops methods for automatically tuning these parameters, based on
predicting the leave-one-out error for a re-trained classifier and avoiding
that the classifier is diluted by selecting too many or weak documents for
retraining. Our experiments with three different datasets confirm the practical
viability of the approach.