38Automated Retraining Methods for Document Classification and Their Parameter 
Tuning

Siersdorfer, Stefan; Weikum, Gerhard; Ngu, Anne H. H.; Kitsuregawa, Masaru; Neuhold, Erich J.; Chung, Jen-Yao; Sheng, Quan Z.

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Conference Paper

38Automated Retraining Methods for Document Classification and Their Parameter Tuning

MPS-Authors

/persons/resource/persons45482

Siersdorfer, Stefan
Databases and Information Systems, MPI for Informatics, Max Planck Society;

/persons/resource/persons45720

Weikum, Gerhard
Databases and Information Systems, MPI for Informatics, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Siersdorfer, S., & Weikum, G. (2005). 38Automated Retraining Methods for Document Classification and Their Parameter Tuning. In Web information systems engineering - WISE 2005: 6th International Conference on Web Information Systems Engineering (pp. 478-486). Berlin, Germany: Springer.

Cite as: https://hdl.handle.net/11858/00-001M-0000-000F-25D8-D

Abstract

This paper addresses the problem of semi-supervised classification on document collections using retraining (also called self-training). A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages for retraining. Such an approach is by itself not robust and faces tuning problems regarding parameters like the number of selected documents, the number of retraining iterations, and the ratio of positive and negative classified samples used for retraining. The paper develops methods for automatically tuning these parameters, based on predicting the leave-one-out error for a re-trained classifier and avoiding that the classifier is diluted by selecting too many or weak documents for retraining. Our experiments with three different datasets confirm the practical viability of the approach.