ausblenden:
Schlagwörter:
-
Zusammenfassung:
Top-$k$ query processing is an important building block for ranked retrieval,
with applications ranging from text and data integration to distributed
aggregation of network logs and sensor data.
Top-$k$ queries operate on index lists for a query's elementary conditions
and aggregate scores for result candidates. One of the best implementation
methods in this setting is the family of threshold algorithms, which aim
to terminate the index scans as early as possible based on lower and upper
bounds for the final scores of result candidates. This procedure
performs sequential disk accesses for sorted index scans, but also has the
option
of performing random accesses to resolve score uncertainty. This entails
scheduling for the two kinds of accesses: 1) the prioritization of different
index lists in the sequential accesses, and 2) the decision on when to perform
random accesses and for which candidates.
The prior literature has studied some of these scheduling issues, but only for
each of the two access types in isolation.
The current paper takes an integrated view of the scheduling issues and develops
novel strategies that outperform prior proposals by a large margin.
Our main contributions are new, principled, scheduling methods based on a
Knapsack-related
optimization for sequential accesses and a cost model for random accesses.
The methods can be further boosted by harnessing probabilistic estimators for
scores,
selectivities, and index list correlations.
In performance experiments with three different datasets (TREC Terabyte, HTTP
server logs, and IMDB),
our methods achieved significant performance gains compared to the best
previously known methods.