The fact that birds have feathers and ice is cold seems trivially true. Yet,
most machine-readable sources of knowledge either lack such common sense facts
entirely or have only limited coverage. Prior work on automated knowledge base
construction has largely focused on relations between named entities and on
taxonomic knowledge, while disregarding common sense properties.
Extracting such structured data from text is challenging, especially due to the
scarcity of explicitly expressed knowledge. Even when relying on large document
collections, patternbased information extraction approaches typically discover
insufficient amounts of information.
This thesis investigates harvesting massive amounts of common sense knowledge
using the textual knowledge of the entire Web, yet staying away from the
massive engineering efforts in procuring such a large corpus as a Web. Despite
the advancements in knowledge harvesting, we observed that the state of the art
methods were limited in terms of accuracy and discovered insufficient amounts
of information under our desired setting.
This thesis shows how to gather large amounts of common sense facts from Web
N-gram data, using seeds from the existing knowledge bases like ConceptNet. Our
novel contributions include scalable methods for tapping onto Web-scale data
and a new scoring model to determine which patterns and facts are most reliable.
An extensive experimental evaluation is provided for three different binary
relations, comparing different sources of n-gram data as well as different
algorithms. The experimental results show that this approach extends ConceptNet
by many orders of magnitude (more than 200-fold) at comparable levels of