Fast logistic regression for text categorization with variable-length n-grams

Ifrim, Georgiana; Bakir, Goekhan; Weikum, Gerhard

doi:http://doi.acm.org/10.1145/1401890.1401936

DetailsSummary

Fast logistic regression for text categorization with variable-length n-grams

Ifrim, G., Bakir, G., & Weikum, G. (2008). Fast logistic regression for text categorization with variable-length n-grams. In B. Bing Liu, S. Sarawagi, & Y. Li (Eds.), KDD 2008: proceedings of the 14th ACM KDD International Conference on Knowledge Discovery & Data Mining (pp. 354-362). New York, NY: ACM.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/11858/00-001M-0000-000F-1BAB-6 Version Permalink: https://hdl.handle.net/11858/00-001M-0000-000F-1BAC-4

Genre: Conference Paper

Files

show Files

Locators

show

Creators

show

hide

Creators:
Ifrim, Georgiana¹, Author
Bakir, Goekhan, Author
Weikum, Gerhard¹, Author

Affiliations:
1Databases and Information Systems, MPI for Informatics, Max Planck Society, ou_24018

Content

show

hide

Free keywords: -

Abstract: A common representation used in text categorization is the bag of words model (aka. unigram model). Learning with this particular representation involves typically some preprocessing, e.g. stopwords-removal, stemming. This results in one explicit tokenization of the corpus. In this work, we introduce a logistic regression approach where learning involves automatic tokenization. This allows us to weaken the a-priori required knowledge about the corpus and results in a tokenization with variable-length (word or character) n-grams as basic tokens. We accomplish this by solving logistic regression using gradient ascent in the space of all ngrams. We show that this can be done very efficiently using a branch and bound approach which chooses the maximum gradient ascent direction projected onto a single dimension (i.e., candidate feature). Although the space is very large, our method allows us to investigate variable-length n-gram learning. We demonstrate the efficiency of our approach compared to state-of-the-art classifiers used for text categorization such as cyclic coordinate descent logistic regression and support vector machines.

Details

show

hide

Language(s): eng - English

Dates: Modified: 2009-03-25Date issued: 2008

Publication Status: Issued

Pages: -

Publishing info: New York, NY : ACM

Table of Contents: -

Rev. Type: -

Identifiers: eDoc: 428111
DOI: http://doi.acm.org/10.1145/1401890.1401936
URI: http://portal.acm.org/citation.cfm?id=1401936#
Other: Local-ID: C125756E0038A185-233A36CCB8D757B1C12574F700499649-Ifrim:KDD08

Degree: -

Event

show

hide

Title: Untitled Event

Place of Event: Las Vegas, Nevada, USA

Start-/End Date: 2008-08-24 - 2008-08-27

Legal Case

show

Project information

show

Source 1

show

hide

Title: KDD 2008 : proceedings of the 14th ACM KDD International Conference on Knowledge Discovery & Data Mining

Source Genre: Proceedings

Creator(s):
Bing Liu, Bing, Editor
Sarawagi, Sunita¹, Editor
Li, Ying, Editor

Affiliations:
1 Algorithms and Complexity, MPI for Informatics, Max Planck Society, ou_24019

Publ. Info: New York, NY : ACM

Pages: - Volume / Issue: - Sequence Number: - Start / End Page: 354 - 362 Identifier: ISBN: 978-1-60558-193-4