ausblenden:
Schlagwörter:
-
Zusammenfassung:
Counting in general, and estimating the cardinality of (multi-) sets in
particular, is highly desirable for a large variety of applications,
representing a foundational block for the efficient deployment and access of
emerging internetscale information systems. Examples of such applications range
from optimizing query access plans in internet-scale databases, to evaluating
the significance (rank/score) of various data items in information retrieval
applications. The key constraints that any acceptable solution must satisfy
are: (i) efficiency: the number of nodes that need be contacted for counting
purposes must be small in order to enjoy small latency and bandwidth
requirements; (ii) scalability, seemingly contradicting the efficiency goal:
arbitrarily large numbers of nodes nay need to add elements to a (multi-) set,
which dictates the need for a highly distributed solution, avoiding
server-based scalability, bottleneck, and availability problems; (iii) access
and storage load balancing: counting and related overhead chores should be
distributed fairly to the nodes of the network; (iv) accuracy: tunable, robust
(in the presence of dynamics and failures) and highly accurate cardinality
estimation; (v) simplicity and ease of integration: special, solution-specific
indexing structures should be avoided. In this paper, first we contribute a
highly-distributed, scalable, efficient, and accurate (multi-) set cardinality
estimator. Subsequently, we show how to use our solution to build and maintain
histograms, which have been a basic building block for query optimization for
centralized databases, facilitating their porting into the realm of
internet-scale data networks.