hide
Free keywords:
-
Abstract:
Web archives preserve the history of Web sites and have high long-term value
for
media and business analysts. Such archives are maintained by periodically
re-crawling
entire Web sites of interest.
From an archivist's point of view, the ideal case to ensure highest possible
data quality
of the archive would be to ``freeze'' the complete contents of an entire Web
site during the time span
of crawling and capturing the site. Of course, this is practically infeasible.
To comply with the politeness specification of a Web site, the crawler needs to
pause
between subsequent http requests in order to avoid unduly high load on the
site's http server.
As a consequence, capturing a large Web site may span hours or even days, which
increases the risk that contents collected so far are incoherent
with the parts that are still to be crawled.
This paper introduces a model for identifying coherent sections of an archive
and, thus,
measuring the data quality in Web archiving.
Additionally, we present a crawling strategy that aims to ensure archive
coherence by
minimizing the diffusion of Web site captures.
Preliminary experiments demonstrate the usefulness of the model and the
effectiveness of the strategy.