hide
Free keywords:
-
Abstract:
We show that eigenvector decomposition can be used to extract a term taxonomy
from a given collection of text documents. So far, methods based on eigenvector
decomposition, such as latent semantic indexing (LSI) or principal component
analysis (PCA), were only known to be useful for extracting symmetric relations
between terms. We give a precise mathematical criterion for distinguishing
between four kinds of relations of a given pair of terms of a given collection:
unrelated (car - fruit), symmetrically related (car - automobile),
asymmetrically related with the first term being more specific than the second
(banana - fruit), and asymmetrically related in the other direction (fruit -
banana). We give theoretical evidence for the soundness of our criterion, by
showing that in a simplified mathematical model the criterion does the
apparently right thing. We applied our scheme to the reconstruction of a
selected part of the open directory project (ODP) hierarchy, with promising
results.