Sparse Dictionary Learning with Simplex Constraints and Application to Topic 
Modeling

Zheng, Qinqing

Sparse Dictionary Learning with Simplex Constraints and Application to Topic Modeling

Zheng, Q. (2012). Sparse Dictionary Learning with Simplex Constraints and Application to Topic Modeling. Master Thesis, Universität des Saarlandes, Saarbrücken.

Item is 公開

表示: 全項目非表示: 全項目

基本情報

表示: 非表示:

アイテムのパーマリンク: https://hdl.handle.net/11858/00-001M-0000-0027-A192-D 版のパーマリンク: https://hdl.handle.net/11858/00-001M-0000-0027-A193-B

資料種別: 学位論文

ファイル

表示: ファイル

非表示: ファイル

:

2012_Qinqing Zheng_Master's Thesis.pdf (全文テキスト（全般）), 19MB

ファイルのパーマリンク:
-

ファイル名:
2012_Qinqing Zheng_Master's Thesis.pdf

説明:
-

OA-Status:

閲覧制限:
制限付き (Max Planck Institute for Informatics, MSIN; )

MIMEタイプ / チェックサム:
application/pdf

技術的なメタデータ:

著作権日付:
-

著作権情報:
-

CCライセンス:
-

作成者

表示:

非表示:

作成者:
Zheng, Qinqing¹, 著者
Hein, Matthias², 学位論文主査
Slawski, Martin², 監修者

所属:
1International Max Planck Research School, MPI for Informatics, Max Planck Society, ou_1116551
2External Organizations, ou_persistent22

内容説明

表示:

非表示:

キーワード: -

要旨: Probabilistic mixture model is a powerful tool to provide a low-dimensional representation of count data. In the context of topic modeling, this amounts to representing the distribution of one document as a mixture of multiple distributions known as topics. The mixing proportions are called coecients. A common attempt is to introduce sparsity into both the topics and the coecients for better interpretability. We first discuss the problem of recovering sparse coecients of given documents when the topics are known. This is formulated as a penalized least squares problem on the probability simplex, where the sparsity is achieved through regularization. However, the typical `1 regularizer becomes toothless in this case since it is constant over the simplex. To overcome this issue, we propose a group of concave penalties for inducing sparsity. An alternative approach is to post-process the solution of non-negative lasso to produce result that conform to the simplex constraint. Our experiments show that both kinds of approaches can effiectively recover the sparsity pattern of coefficients. We then elaborately compare their robustness for dierent characteristics of input data. The second problem we discuss is to model both the topics and the coefficients of a collection of documents via matrix factorization. We propose the LpT approach, in which all the topics and coefficients are constrained on the simplex, and the `p penalty is imposed on each topic to promote sparsity. We also consider procedures that post-process the solutions of other methods. For example, the L1 approach first solves the problem where the simplex constraints imposed on the topics are relaxed into the non-negativity constraints, and the `p penalty is the replaced by the `1 penalty. Afterwards, L1 normalize the estimated topics to generate results satisfying the simplex constraints. As detecting the number of mixture components inherent in the data is of central importance for the probabilistic mixture model, we analyze how the regularization techniques can help us to automatically find out this number. We compare the capabilities of these approaches to recover the low-rank structure underlying the data, when the number of topics are correctly specied and over-specified, respectively. The empirical results demonstrate that LpT and L1 can discover the sparsity pattern of the ground truth. In addition, when the number of topics is over-specied, they adapt to the true number of topics.

資料詳細

表示:

非表示:

言語: eng - English

日付: 受理: 2012-03出版: 2012

出版の状態: 出版

ページ: -

出版情報: Saarbrücken : Universität des Saarlandes

目次: -

査読: -

識別子（DOI, ISBNなど）: BibTex参照ID: Zheng2012

学位: 修士号 (Master)

アイテム詳細

基本情報

ファイル

関連URL

作成者

内容説明

資料詳細

関連イベント

訴訟

Project information

出版物