Efficient Large-Scale Clustering of Spelling Variants, with Applications to 
Error-Tolerant Text Search

Celikik, Marjan

Efficient Large-Scale Clustering of Spelling Variants, with Applications to Error-Tolerant Text Search

Celikik, M. (2007). Efficient Large-Scale Clustering of Spelling Variants, with Applications to Error-Tolerant Text Search. Master Thesis, Universität des Saarlandes, Saarbrücken.

Item is 公開

表示: 全項目非表示: 全項目

基本情報

表示: 非表示:

アイテムのパーマリンク: https://hdl.handle.net/11858/00-001M-0000-0027-C33D-4 版のパーマリンク: https://hdl.handle.net/11858/00-001M-0000-0027-C33E-2

資料種別: 学位論文

ファイル

表示: ファイル

非表示: ファイル

:

Masterarbeit-Celikik-Marjan-2007.pdf (全文テキスト（全般）), 3MB

ファイルのパーマリンク:
-

ファイル名:
Masterarbeit-Celikik-Marjan-2007.pdf

説明:
-

OA-Status:

閲覧制限:
制限付き (Max Planck Institute for Informatics, MSIN; )

MIMEタイプ / チェックサム:
application/pdf

技術的なメタデータ:

著作権日付:
-

著作権情報:
-

CCライセンス:
-

作成者

表示:

非表示:

作成者:
Celikik, Marjan^{1, 2}, 著者
Weikum, Gerhard³, 学位論文主査
Bast, Holger¹, 監修者

所属:
1Algorithms and Complexity, MPI for Informatics, Max Planck Society, ou_24019
2International Max Planck Research School, MPI for Informatics, Max Planck Society, Campus E1 4, 66123 Saarbrücken, DE, ou_1116551
3Databases and Information Systems, MPI for Informatics, Max Planck Society, ou_24018

内容説明

表示:

非表示:

キーワード: -

要旨: In this thesis, the following spelling variants clustering problem is considered: Given a list of distinct words, called lexicon, compute (possibly overlapping) clusters of words which are spelling variants of each other. We are looking for algorithms that are both efficient and accurate. Accuracy is measured with respect to human judgment, e.g., a cluster is 100 accurate if it contains all true spelling variants of the unique correct word it contains and no other words, as judged by a human. We have sifted the large body of literature on approximate string searching and spelling correction problem for its applicability to our problem. We have combined various ideas from previous approaches to two new algorithms, with two distinctly different trade-offs between efficiency and accuracy. We have analyzed both algorithms and tested them experimentally on a variety of test collections, which were chosen to exhibit the whole spectrum of spelling errors as they occur in practice (human-made, OCR-induced, garbage). Our largest lexicon, containing roughly 25 million words, can be processed in half an hour on a single machine. The accuracies we obtain range from 88 - 95. We show that previous approaches, if directly applied to our problem, are either significantly slower or significantly less accurate or both. Our spelling variants clustering problem arises naturally in the context of search engine spelling correction of the following kind: For a given query, return not only documents matching the query words exactly but also those matching their spelling variants. This is inverse to the well-known �did you mean: ...� web search engine feature, where the error tolerance is on the side of the query, and not on the side of the documents. We have integrated our algorithms with the CompleteSearch engine, and show that this feature can be achieved without significant blowup in either index size or query processing time.

資料詳細

表示:

非表示:

言語: eng - English

日付: 受理: 2007-10出版: 2007

出版の状態: 出版

ページ: -

出版情報: Saarbrücken : Universität des Saarlandes

目次: -

査読: -

識別子（DOI, ISBNなど）: BibTex参照ID: Celikik2007

学位: 修士号 (Master)

アイテム詳細

基本情報

ファイル

関連URL

作成者

内容説明

資料詳細

関連イベント

訴訟

Project information

出版物