Better Filtering with Gapped q-Grams

Burkhardt, Stefan; Kärkkäinen, Juha

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Conference Paper

Better Filtering with Gapped q-Grams

MPS-Authors

/persons/resource/persons44209

Burkhardt, Stefan
Algorithms and Complexity, MPI for Informatics, Max Planck Society;

/persons/resource/persons44717

Kärkkäinen, Juha
Algorithms and Complexity, MPI for Informatics, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Burkhardt, S., & Kärkkäinen, J. (2001). Better Filtering with Gapped q-Grams. In A. Amir, & G. Landau (Eds.), Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching (pp. 73-85). Berlin, Germany: Springer.

Cite as: https://hdl.handle.net/11858/00-001M-0000-000F-3179-D

Abstract

The q-gram filter is a popular filtering method for approximate string matching. It compares substrings of length q (the q-grams) in the pattern and the text to identify the text areas that might contain a match. A generalization of the method is to use gapped q-grams, subsets of q characters in some fixed non-contiguous shape, instead of contiguous substrings. Although mentioned a few times in the literature, this generalization has never been studied in any depth. In ths paper, we report the first results from a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. The performance, however, depends on the shape of the q-grams. The best shaoes are rare and often posess no apparen regularity. We show how to recognize good shapes and demonstrate with experiments their advantage over both contiguous and average shapes. We concentrate here on the k mismatches problem, but also outline an approach for extending the results to the more common k differences problem.