Probabilistic mixture model is a powerful tool to provide a low-dimensional
representation of count data. In the context of topic modeling, this amounts to
representing the distribution of one document as a mixture of multiple
distributions known as topics. The mixing proportions are called coecients. A
common attempt is to introduce sparsity into both the topics and the coecients
for better interpretability. We first discuss the problem of recovering sparse
coecients of given documents when the topics are known. This is formulated as
a penalized least squares problem on the probability simplex, where the
sparsity is achieved through regularization. However, the typical `1
regularizer becomes toothless in this case since it is constant over the
simplex. To overcome this issue, we propose a group of concave penalties for
inducing sparsity. An alternative approach is to post-process the solution of
non-negative lasso to produce result that conform to the simplex constraint.
Our experiments show that both kinds of approaches can effiectively recover the
sparsity pattern of coefficients. We
then elaborately compare their robustness for dierent characteristics of input
data. The second problem we discuss is to model both the topics and the
coefficients of a collection of documents via matrix factorization. We propose
the LpT approach, in which all the topics and coefficients are constrained on
the simplex, and the `p penalty is imposed on each topic to promote sparsity.
We also consider procedures that post-process the solutions of other methods.
For example, the L1 approach first solves the problem where the simplex
constraints imposed on the topics are relaxed into the non-negativity
constraints, and the `p penalty is the replaced by the `1 penalty. Afterwards,
L1 normalize the estimated topics to generate results satisfying the simplex
constraints. As detecting the number of mixture components inherent in the data
is of central importance for the probabilistic mixture model, we analyze how
the regularization techniques can help us to automatically find out this
number. We compare the capabilities of these approaches to recover the low-rank
structure underlying the data, when the number of topics are correctly specied
and over-specified, respectively.
The empirical results demonstrate that LpT and L1 can discover the sparsity
pattern of the ground truth. In addition, when the number of topics is
over-specied, they adapt to the true number of topics.