Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

Banerjee, A; Dhillon I, Ghosh, J; Sra, S

Datensatz

DATENSATZ AKTIONENEXPORT

Zur Ablage hinzufügen

Bitte beachten Sie, dass eine neuere Version dieses Datensatzes verfügbar ist:
https://pure.mpg.de/pubman/item/item_1791300_2

DetailsÜbersicht

Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

Banerjee, A., Dhillon I, Ghosh, J., & Sra, S. (2005). Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. Journal of Machine Learning Research, 6, 1345-1382. Retrieved from http://jmlr.csail.mit.edu/papers/volume6/banerjee05a/banerjee05a.pdf.

Item is Freigegeben

einblenden: alle ausblenden: alle

Basisdaten

einblenden: ausblenden:

Datensatz-Permalink: https://hdl.handle.net/11858/00-001M-0000-0013-D419-1 Versions-Permalink: https://hdl.handle.net/11858/00-001M-0000-0013-D41A-0

Genre: Zeitschriftenartikel

ausblenden:

Urheber:
Banerjee, A, Autor
Dhillon I, Ghosh, J, Autor
Sra, S¹, Autor

Affiliations:
1Department Empirical Inference, Max Planck Institute for Biological Cybernetics, Max Planck Society, ou_1497795

Inhalt

einblenden:

ausblenden:

Schlagwörter: -

Zusammenfassung: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces.

Details

einblenden:

ausblenden:

Sprache(n):

Datum: Erschienen: 2005-09

Publikationsstatus: Erschienen

Seiten: -

Ort, Verlag, Ausgabe: -

Inhaltsverzeichnis: -

Art der Begutachtung: -

Identifikatoren: URI: http://jmlr.csail.mit.edu/papers/volume6/banerjee05a/banerjee05a.pdf
BibTex Citekey: 5126

Art des Abschluß: -

Veranstaltung

einblenden:

Entscheidung

einblenden:

Projektinformation

einblenden:

Quelle 1

einblenden:

ausblenden:

Titel: Journal of Machine Learning Research

Genre der Quelle: Zeitschrift

Urheber:

Affiliations:

Ort, Verlag, Ausgabe: -

Seiten: - Band / Heft: 6 Artikelnummer: - Start- / Endseite: 1345 - 1382 Identifikator: -

Datensatz

Basisdaten

Dateien

Externe Referenzen

Urheber

Inhalt

Details

Veranstaltung

Entscheidung

Projektinformation

Quelle 1