An Approach for Weakly-Supervised Deep Information Retrieval

MacAvaney, Sean; Hui, Kai; Yates, Andrew

アイテム詳細

登録内容を編集ファイル形式で保存

一時保存へ追加

タグ情報を表示リリース履歴を表示詳細要約

公開

成果報告書

An Approach for Weakly-Supervised Deep Information Retrieval

MPS-Authors

/persons/resource/persons101776

Hui, Kai
Databases and Information Systems, MPI for Informatics, Max Planck Society;

/persons/resource/persons206666

Yates, Andrew
Databases and Information Systems, MPI for Informatics, Max Planck Society;

External Resource

There are no locators available

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

フルテキスト (公開)

arXiv:1707.00189.pdf
(プレプリント), 632KB

付随資料 (公開)

There is no public supplementary material available

引用

MacAvaney, S., Hui, K., & Yates, A. (2017). An Approach for Weakly-Supervised Deep Information Retrieval. Retrieved from http://arxiv.org/abs/1707.00189.

引用: https://hdl.handle.net/11858/00-001M-0000-002E-06C5-C

要旨

Recent developments in neural information retrieval models have been promising, but a problem remains: human relevance judgments are expensive to produce, while neural models require a considerable amount of training data. In an attempt to fill this gap, we present an approach that---given a weak training set of pseudo-queries, documents, relevance information---filters the data to produce effective positive and negative query-document pairs. This allows large corpora to be used as neural IR model training data, while eliminating training examples that do not transfer well to relevance scoring. The filters include unsupervised ranking heuristics and a novel measure of interaction similarity. We evaluate our approach using a news corpus with article headlines acting as pseudo-queries and article content as documents, with implicit relevance between an article's headline and its content. By using our approach to train state-of-the-art neural IR models and comparing to established baselines, we find that training data generated by our approach can lead to good results on a benchmark test collection.