Overlap-aware global df estimation in distributed information retrieval systems

Bender, Matthias; Michel, Sebastian; Weikum, Gerhard; Triantafilou, Peter

Local TagsRelease HistoryDetailsSummary

Overlap-aware global df estimation in distributed information retrieval systems

Bender, M., Michel, S., Weikum, G., & Triantafilou, P.(2006). Overlap-aware global df estimation in distributed information retrieval systems (MPI-I-2006-5-001). Saarbrücken: Max-Planck-Institut für Informatik.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/11858/00-001M-0000-0014-6719-8 Version Permalink: https://hdl.handle.net/11858/00-001M-0000-0014-7898-3

Genre: Report

Files

show Files

hide Files

:

MPI-I-2006-5-001.pdf (Any fulltext), 571KB

View Save

File Permalink:
https://hdl.handle.net/11858/00-001M-0000-0014-671B-4

Name:
MPI-I-2006-5-001.pdf

Description:
-

OA-Status:

Visibility:
Public

MIME-Type / Checksum:
application/pdf / [MD5]

Technical Metadata:

View

Copyright Date:
-

Copyright Info:
-

License:
-

Locators

show

Creators

show

hide

Creators:
Bender, Matthias¹, Author
Michel, Sebastian¹, Author
Weikum, Gerhard¹, Author
Triantafilou, Peter², Author

Affiliations:
1Databases and Information Systems, MPI for Informatics, Max Planck Society, ou_24018
2External Organizations, ou_persistent22

Content

show

hide

Free keywords: -

Abstract: Peer-to-Peer (P2P) search engines and other forms of distributed information retrieval (IR) are gaining momentum. Unlike in centralized IR, it is difficult and expensive to compute statistical measures about the entire document collection as it is widely distributed across many computers in a highly dynamic network. On the other hand, such network-wide statistics, most notably, global document frequencies of the individual terms, would be highly beneficial for ranking global search results that are compiled from different peers. This paper develops an efficient and scalable method for estimating global document frequencies in a large-scale, highly dynamic P2P network with autonomous peers. The main difficulty that is addressed in this paper is that the local collections of different peers may arbitrarily overlap, as many peers may choose to gather popular documents that fall into their specific interest profile. Our method is based on hash sketches as an underlying technique for compact data synopses, and exploits specific properties of hash sketches for duplicate elimination in the counting process. We report on experiments with real Web data that demonstrate the accuracy of our estimation method and also the benefit for better search result ranking.

Details

show

hide

Language(s): eng - English

Dates: Date issued: 2006

Publication Status: Issued

Pages: 25 p.

Publishing info: Saarbrücken : Max-Planck-Institut für Informatik

Table of Contents: -

Rev. Type: -

Identifiers: URI: http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2006-5-001
Report Nr.: MPI-I-2006-5-001
BibTex Citekey: BenderMichelWeikumTriantafilou2006

Degree: -

Event

show

Legal Case

show

Project information

show

Source 1

show

hide

Title: Research Report / Max-Planck-Institut für Informatik

Source Genre: Series

Creator(s):

Affiliations:

Publ. Info: -

Pages: - Volume / Issue: - Sequence Number: - Start / End Page: - Identifier: -