Zusammenfassung
Matrix factorizations---where a given data matrix is approximated by a product
of two or more factor matrices---are powerful data mining tools. Among other
tasks, matrix factorizations are often used to separate global structure from
noise. This, however, requires solving the `model order selection problem' of
determining where fine-grained structure stops, and noise starts, i.e., what is
the proper size of the factor matrices.
Boolean matrix factorization (BMF)---where data, factors, and matrix product
are Boolean---has received increased attention from the data mining community
in recent years. The technique has
desirable properties, such as high interpretability and natural sparsity. But
so far no method for selecting the correct model order for BMF has been
available.
In this paper we propose to use the Minimum Description Length (MDL) principle
for this task. Besides solving the problem, this well-founded approach has
numerous benefits, e.g., it is automatic, does not require a likelihood
function, is fast, and, as experiments show, is highly accurate.
We formulate the description length function for BMF in general---making it
applicable for any BMF algorithm. We extend an existing algorithm for BMF to
use MDL to identify the best Boolean matrix factorization, analyze the
complexity of the problem, and perform an extensive experimental evaluation to
study its behavior.