hide
Free keywords:
-
Abstract:
A popular and well-studied class of filters for approximate string
matching compares substrings of length $q$, \emph{the $q$-grams}, in
the pattern and the text to identify text areas that contain potential
matches. A generalization of the method that uses {\em gapped}
$q$-grams instead of contiguous substrings is mentioned a few times in
literature but has never been analyzed in any depth. In this paper,
we report the first results of a study on gapped $q$-grams. We show
that gapped $q$-grams can provide orders of magnitude faster and/or
more efficient filtering than contiguous $q$-grams. To achieve these
results the arrangement of the gaps in the $q$-gram and a filter
parameter called \emph{threshold} have to be optimized. Both of these
tasks are nontrivial combinatorial optimization problems for which we
present efficient solutions. We concentrate on the $k$~mismatches
problem, i.e, approximate string matching with the Hamming distance.