Identifying Consistent Statements about Numerical Data with 
Dispersion-Corrected Subgroup Discovery

Boley, Mario; Goldsmith, Bryan; Ghiringhelli, Luca M.; Vreeken, Jilles

doi:10.1007/s10618-017-0520-3

DetailsSummary

Identifying Consistent Statements about Numerical Data with Dispersion-Corrected Subgroup Discovery

Boley, M., Goldsmith, B., Ghiringhelli, L. M., & Vreeken, J. (2017). Identifying Consistent Statements about Numerical Data with Dispersion-Corrected Subgroup Discovery. Data Mining and Knowledge Discovery, 31(5), 1391-1418. doi:10.1007/s10618-017-0520-3.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/11858/00-001M-0000-002D-99F7-B Version Permalink: https://hdl.handle.net/21.11116/0000-0000-7497-3

Genre: Journal Article

Files

show Files

hide Files

:

s10618-017-0520-3.pdf (Publisher version), 2MB

View Save

File Permalink:
https://hdl.handle.net/11858/00-001M-0000-002D-F21E-9

Name:
s10618-017-0520-3.pdf

Description:
-

OA-Status:

Visibility:
Public

MIME-Type / Checksum:
application/pdf / [MD5]

Technical Metadata:

View

Copyright Date:
2017

Copyright Info:
© The Author(s)

License:
http://creativecommons.org/licenses/by/3.0/

Locators

show

Creators

show

hide

Creators:
Boley, Mario¹, Author
Goldsmith, Bryan², Author
Ghiringhelli, Luca M.², Author
Vreeken, Jilles¹, Author

Affiliations:
1Databases and Information Systems, MPI for Informatics, Max Planck Society, ou_24018
2Theory, Fritz Haber Institute, Max Planck Society, ou_634547

Content

show

hide

Free keywords: -

Abstract: Existing algorithms for subgroup discovery with numerical targets do not optimize the error or target variable dispersion of the groups they find. This often leads to unreliable or inconsistent statements about the data, rendering practical applications, especially in scientific domains, futile. Therefore, we here extend the optimistic estimator framework for optimal subgroup discovery to a new class of objective func- tions: we show how tight estimators can be computed efficiently for all functions that are determined by subgroup size (non-decreasing dependence), the subgroup median value, and a dispersion measure around the median (non-increasing dependence). In the important special case when dispersion is measured using the mean absolute deviation from the median, this novel approach yields a linear time algorithm. Empirical evaluation on a wide range of datasets shows that, when used within branch-and-bound search, this approach is highly efficient and indeed discovers subgroups with much smaller errors.

Details

show

hide

Language(s): eng - English

Dates: Submitted: 2017-06-28Accepted: 2017-06-12Published Online: 2017-09Date issued: 2017-01-19

Publication Status: Issued

Pages: 28

Publishing info: -

Table of Contents: -

Rev. Type: Peer

Identifiers: DOI: 10.1007/s10618-017-0520-3

Degree: -

Event

show

Legal Case

show

Project information

show

Source 1

show

hide

Title: Data Mining and Knowledge Discovery

Source Genre: Journal

Creator(s):

Affiliations:

Publ. Info: London : Springer

Pages: 28 Volume / Issue: 31 (5) Sequence Number: - Start / End Page: 1391 - 1418 Identifier: -