Flexibly Mining Better Subgroups

Nguyen, Hoang-Vu; Vreeken, Jilles

doi:10.1137/1.9781611974348.66

Datensatz

DATENSATZ AKTIONENEXPORT

Zur Ablage hinzufügen

Lokale TagsFreigabegeschichteDetailsÜbersicht

Freigegeben

Konferenzbeitrag

Flexibly Mining Better Subgroups

MPG-Autoren

/persons/resource/persons187552

Nguyen, Hoang-Vu
Databases and Information Systems, MPI for Informatics, Max Planck Society;

/persons/resource/persons79525

Vreeken, Jilles
Databases and Information Systems, MPI for Informatics, Max Planck Society;

Externe Ressourcen

Es sind keine externen Ressourcen hinterlegt

Volltexte (beschränkter Zugriff)

Für Ihren IP-Bereich sind aktuell keine Volltexte freigegeben.

Volltexte (frei zugänglich)

Es sind keine frei zugänglichen Volltexte in PuRe verfügbar

Ergänzendes Material (frei zugänglich)

Es sind keine frei zugänglichen Ergänzenden Materialien verfügbar

Zitation

Nguyen, H.-V., & Vreeken, J. (2016). Flexibly Mining Better Subgroups. In S. Chawla Venkatasubramanian, & W. Meira (Eds.), Proceedings of the Sixteenth SIAM International Conference on Data Mining (pp. 585-593). Philadelphia, PA: SIAM. doi:10.1137/1.9781611974348.66.

Zitierlink: https://hdl.handle.net/11858/00-001M-0000-002B-A933-C

Zusammenfassung

Finding patterns from binary data is a classical problem in data mining, dating back to at least frequent itemset mining. More recently, approaches such as tiling and Boolean matrix factorization (BMF), have been proposed to find sets of patterns that aim to explain the full data well. These methods, however, are not robust against non-trivial destructive noise, i.e. when relatively many 1s are removed from the data: tiling can only model additive noise while BMF assumes approximately equal amounts of additive and destructive noise. Most real-world binary datasets, however, exhibit mostly destructive noise. In presence/absence data, for instance, it is much more common to fail to observe something than it is to observe a spurious presence. To address this problem, we take the recent approach of employing the Minimum Description Length (MDL) principle for BMF and introduce a new algorithm, Nassau, that directly optimizes the description length of the factorization instead of the reconstruction error. In addition, unlike the previous algorithms, it can adjust the factors it has discovered during its search. Empirical evaluation on synthetic data shows that Nassau excels at datasets with high destructive noise levels and its performance on real-world datasets confirms our hypothesis of the high numbers of missing observations in the real-world data.