Fast logistic regression for text categorization with variable-length n-grams

Ifrim, Georgiana; Bakir, Goekhan; Weikum, Gerhard

doi:10.1145/1401890.1401936

Datensatz

DATENSATZ AKTIONENEXPORT

Zur Ablage hinzufügen

Lokale TagsFreigabegeschichteDetailsÜbersicht

Freigegeben

Konferenzbeitrag

Fast logistic regression for text categorization with variable-length n-grams

MPG-Autoren

/persons/resource/persons44668

Ifrim, Georgiana
Databases and Information Systems, MPI for Informatics, Max Planck Society;

/persons/resource/persons45720

Weikum, Gerhard
Databases and Information Systems, MPI for Informatics, Max Planck Society;

Externe Ressourcen

Es sind keine externen Ressourcen hinterlegt

Volltexte (beschränkter Zugriff)

Für Ihren IP-Bereich sind aktuell keine Volltexte freigegeben.

Volltexte (frei zugänglich)

Es sind keine frei zugänglichen Volltexte in PuRe verfügbar

Ergänzendes Material (frei zugänglich)

Es sind keine frei zugänglichen Ergänzenden Materialien verfügbar

Zitation

Ifrim, G., Bakir, G., & Weikum, G. (2008). Fast logistic regression for text categorization with variable-length n-grams. In B. Bing Liu, S. Sarawagi, & Y. Li (Eds.), KDD 2008: proceedings of the 14th ACM KDD International Conference on Knowledge Discovery & Data Mining (pp. 354-362). New York, NY: ACM.

Zitierlink: https://hdl.handle.net/11858/00-001M-0000-000F-1BAB-6

Zusammenfassung

A common representation used in text categorization is the bag of words model
(aka. unigram model). Learning with this particular representation involves
typically some preprocessing, e.g. stopwords-removal, stemming. This results in
one explicit tokenization of the corpus. In this work, we introduce a logistic
regression approach where learning involves automatic tokenization. This allows
us to weaken the a-priori required knowledge about the corpus and results in a
tokenization with variable-length (word or character) n-grams as basic tokens.
We accomplish this by solving logistic regression using gradient ascent in the
space of all ngrams. We show that this can be done very efficiently using a
branch and bound approach which chooses the maximum gradient ascent direction
projected onto a single dimension (i.e., candidate feature). Although the space
is very large, our method allows us to investigate variable-length n-gram
learning. We demonstrate the efficiency of our approach compared to
state-of-the-art classifiers used for text categorization such as cyclic
coordinate descent logistic regression and support vector machines.