In this thesis, the following spelling variants clustering problem is
considered: Given a list of distinct words, called lexicon, compute (possibly
overlapping) clusters of words which are spelling variants of each other. We
are looking for algorithms that are both efficient and accurate. Accuracy is
measured with respect to human judgment, e.g., a cluster is 100 accurate if it
contains all true spelling variants of the unique correct word it contains and
no other words, as judged by a human.
We have sifted the large body of literature on approximate string searching and
spelling correction problem for its applicability to our problem. We have
various ideas from previous approaches to two new algorithms, with two
different trade-offs between efficiency and accuracy. We have analyzed both
and tested them experimentally on a variety of test collections, which were
chosen to exhibit the whole spectrum of spelling errors as they occur in
practice (human-made, OCR-induced, garbage). Our largest lexicon, containing
roughly 25 million words, can be processed in half an hour on a single machine.
The accuracies we obtain range from 88 - 95. We show that previous
approaches, if directly applied to our problem, are either significantly slower
or significantly less accurate or both.
Our spelling variants clustering problem arises naturally in the context of
engine spelling correction of the following kind: For a given query, return not
documents matching the query words exactly but also those matching their
variants. This is inverse to the well-known �did you mean: ...� web search
feature, where the error tolerance is on the side of the query, and not on the
the documents. We have integrated our algorithms with the CompleteSearch
engine, and show that this feature can be achieved without significant blowup
in either index size or query processing time.