非表示:
キーワード:
-
要旨:
Phrase snippets of large text corpora like news articles or web search results
offer great insight and analytical value. While much of the prior work is
focussed on efficient storage and retrieval of all candidate phrases, little
emphasis has been laid on the quality of the result set. In this thesis, we
define phrases of interest and propose a framework
for mining and post-processing interesting phrases. We focus on the quality of
phrases and develop techniques to mine minimal-length maximal-informative
sequences of words.The techniques developed are streamed into a post-processing
pipeline and include exact and approximate match-based merging, incomplete
phrase detection with filtering, and heuristics-based phrase classification.
The strategies aim to prune the candidate set of phrases down to the ones being
meaningful and having rich content. We characterize
the phrases with heuristics- and NLP-based features. We use a supervised
learning based regression model to predict their interestingness. Further, we
develop and analyze ranking and grouping models for presenting the phrases to
the user. Finally, we discuss relevance and performance evaluation of our
techniques. Our framework is evaluated using a recently released real world
corpus of New York Times news articles.