hide
Free keywords:
-
Abstract:
Indexing the Web and meeting the throughput, response-time, and
failure-resilience requirements of a search engine
requires massive storage and computational resources and a careful system
design for scalability.
This is exemplified by the big data centers of the leading commercial search
engines.
Various proposals and debates have appeared in the literature as to whether Web
indexes can be implemented
in a fully distributed or even peer-to-peer manner without impeding
scalability, and different partitioning
strategies have been worked out.
In this paper, we resume this ongoing discussion by analyzing the design space
for distributed Web indexing,
considering the influence of partitioning strategies as well as different
storage technologies including Flash-RAM.
We outline and discuss the pros and cons of three fundamental alternatives, and
characterize their total costs
for meeting all performance and availability requirements.
We give arguments in favor
of a system design based on term partitioning over a DHT-based peer-to-peer
network with modern top-k
query processing and a judiciously designed combination of disk and Flash-RAM
storage, and we show that
this design has intriguing properties and a very attractive cost/performance
ratio.