hide
Free keywords:
-
Abstract:
Is it possible to meaningfully analyze the structure of a Boolean matrix for
which 99% data is missing?
Real-life data sets usually contain a high percentage of missing values which
hamper structure estimation from the data and the difficulty only increases
when the missing values dominate the known elements in the data set. There are
good real-valued factorization methods for such scenarios, but there exist
another class of data "Boolean data", which demand a different handling
strategy than their real-valued counterpart.
There are many application which find logical representation only via Boolean
matrices, where real-valued factorization methods do not provide correct and
intuitive solutions.
Currently, there exists no method which can factorize a Boolean matrix
containing a percentage of missing values usually associated with non-trivial
real-world data set. In this thesis, we introduce a method to fill this gap.
Our method is based on the correlation among the data records and is not
restricted by the percentage of unknowns in the matrix. It performs greedy
selection of the basis vectors, which represent the underlying
structure in the data.
This thesis also presents several experiments on a variety of synthetic and
real-world data, and discusses the performance of the algorithm for a range of
data properties.
However, it was not easy to obtain comparison statistics with existing methods,
for the reason that none exist. Hence we present indirect comparisons with
existing matrix completion methods which work with real-valued data sets.