English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT
 
 
DownloadE-Mail
  Sparse Dictionary Learning with Simplex Constraints and Application to Topic Modeling

Zheng, Q. (2012). Sparse Dictionary Learning with Simplex Constraints and Application to Topic Modeling. Master Thesis, Universität des Saarlandes, Saarbrücken.

Item is

Files

show Files
hide Files
:
2012_Qinqing Zheng_Master's Thesis.pdf (Any fulltext), 19MB
 
File Permalink:
-
Name:
2012_Qinqing Zheng_Master's Thesis.pdf
Description:
-
OA-Status:
Visibility:
Restricted (Max Planck Institute for Informatics, MSIN; )
MIME-Type / Checksum:
application/pdf
Technical Metadata:
Copyright Date:
-
Copyright Info:
-
License:
-

Locators

show

Creators

show
hide
 Creators:
Zheng, Qinqing1, Author           
Hein, Matthias2, Advisor
Slawski, Martin2, Referee
Affiliations:
1International Max Planck Research School, MPI for Informatics, Max Planck Society, ou_1116551              
2External Organizations, ou_persistent22              

Content

show
hide
Free keywords: -
 Abstract: Probabilistic mixture model is a powerful tool to provide a low-dimensional representation of count data. In the context of topic modeling, this amounts to representing the distribution of one document as a mixture of multiple distributions known as topics. The mixing proportions are called coecients. A common attempt is to introduce sparsity into both the topics and the coecients for better interpretability. We first discuss the problem of recovering sparse coecients of given documents when the topics are known. This is formulated as a penalized least squares problem on the probability simplex, where the sparsity is achieved through regularization. However, the typical `1 regularizer becomes toothless in this case since it is constant over the simplex. To overcome this issue, we propose a group of concave penalties for inducing sparsity. An alternative approach is to post-process the solution of non-negative lasso to produce result that conform to the simplex constraint. Our experiments show that both kinds of approaches can effiectively recover the sparsity pattern of coefficients. We then elaborately compare their robustness for dierent characteristics of input data. The second problem we discuss is to model both the topics and the coefficients of a collection of documents via matrix factorization. We propose the LpT approach, in which all the topics and coefficients are constrained on the simplex, and the `p penalty is imposed on each topic to promote sparsity. We also consider procedures that post-process the solutions of other methods. For example, the L1 approach first solves the problem where the simplex constraints imposed on the topics are relaxed into the non-negativity constraints, and the `p penalty is the replaced by the `1 penalty. Afterwards, L1 normalize the estimated topics to generate results satisfying the simplex constraints. As detecting the number of mixture components inherent in the data is of central importance for the probabilistic mixture model, we analyze how the regularization techniques can help us to automatically find out this number. We compare the capabilities of these approaches to recover the low-rank structure underlying the data, when the number of topics are correctly specied and over-specified, respectively. The empirical results demonstrate that LpT and L1 can discover the sparsity pattern of the ground truth. In addition, when the number of topics is over-specied, they adapt to the true number of topics.

Details

show
hide
Language(s): eng - English
 Dates: 2012-032012
 Publication Status: Issued
 Pages: -
 Publishing info: Saarbrücken : Universität des Saarlandes
 Table of Contents: -
 Rev. Type: -
 Identifiers: BibTex Citekey: Zheng2012
 Degree: Master

Event

show

Legal Case

show

Project information

show

Source

show