hide
Free keywords:
-
Abstract:
Web archives preserve the history of born-digital content and offer great
potential for sociologists, business analysts, and legal experts on
intellectual property and compliance issues. Data quality is crucial for
these purposes. Ideally, crawlers should gather coherent captures of entire
Web sites, but the politeness etiquette and completeness requirement
mandate very slow, long-duration crawling while Web sites undergo changes.
%big-picture contribution
This paper presents the SHARC framework for assessing the data quality in
Web archives and for tuning capturing strategies towards better quality
with given resources. We define data-quality measures, characterize their
properties, and develop a suite of quality-conscious scheduling strategies
for archive crawling. Our framework includes single-visit and
visit-revisit crawls. Single-visit crawls download every page of a site
exactly once in an order that aims to minimize the ``blur'' in capturing
the site. Visit-revisit strategies revisit pages after their initial
downloads to check for intermediate changes. The revisiting order aims to
maximize the ``coherence'' of the site capture(number pages that did not
change during the capture). The quality notions of blur and coherence are
formalized in the paper. Blur is a stochastic notion that reflects the
expected number of page changes that a time-travel access to a site capture
would accidentally see, instead of the ideal view of a instantaneously
captured, ``sharp'' site. Coherence is a deterministic quality measure
that counts the number of unchanged and thus coherently captured pages in a
site snapshot. Strategies that aim to either minimize blur or maximize
coherence are based on prior knowledge of or predictions for the change
rates of individual pages. Our framework includes fairly accurate
classifiers for change predictions.
All strategies are fully implemented in a testbed, and shown to be
effective by experiments with both synthetically generated sites and a
periodic crawl series for different Web sites.