hide
Free keywords:
-
Abstract:
A fundamental issue in statistics, pattern recognition, and machine learning is
that of classification. In a traditional classification problem, we wish to
assign one of k labels (or classes) to each of n objects (or documents), in a
way that is consistent with some observed data available about that problem.
For achieving better classification results, we try to capture the information
derived by pairwise realtionships between objects, in particular hyperlinks
between web documents. the usage of hyperlinks poses new problems not addressed
in the extensive text classification literature. Links contain high quality
seantic clues that a purely text-based classifier can not take advantage of.
However, exploiting link inoframtion is non-trivial because it is noisy and a
naive use of terms in the link neghborhood of a document can degrade accuracy.
The problem becomes even harder when only a very small fraction of document
labels ar known to the classifier and can be used for training, as it is the
case in a real classification scenario. Our work is based on an algorithm
proposed by Soumen Chakrabarti and uses the theory of Markov Random Fields to
derive a relaxation labelling technique for the class assignment problem. We
show that the extra information contaned in the hyperlinks between the
documents can be explited to achieve significant improvement in the performance
of classification. We implemente our algorithm in Java and ran our experiments
on to sets of data obtained from the DBLP and IMDB databases. We oberved up to
5.5 improvement in the accuracy of the classification and up the 10 higher
recall and precision resultls.