hide
Free keywords:
-
Abstract:
A common representation used in text categorization is the bag of words model
(aka. unigram model). Learning with this particular representation involves
typically some preprocessing, e.g. stopwords-removal, stemming. This results in
one explicit tokenization of the corpus. In this work, we introduce a logistic
regression approach where learning involves automatic tokenization. This allows
us to weaken the a-priori required knowledge about the corpus and results in a
tokenization with variable-length (word or character) n-grams as basic tokens.
We accomplish this by solving logistic regression using gradient ascent in the
space of all ngrams. We show that this can be done very efficiently using a
branch and bound approach which chooses the maximum gradient ascent direction
projected onto a single dimension (i.e., candidate feature). Although the space
is very large, our method allows us to investigate variable-length n-gram
learning. We demonstrate the efficiency of our approach compared to
state-of-the-art classifiers used for text categorization such as cyclic
coordinate descent logistic regression and support vector machines.