For text analysis of large text corpora, we frequently want to organize, summarize, index, or query them in some way. Identifying topic features can be an interpretable method for .
Input *Bag-of-words: data format typically in form of a sparse matrix of word counts per document
Output
For count data, use a simple generative model for sparse counts in bag-of-words format. A document is modeled as a mixture model over topics. Each topic is a probability distribution over the vocabulary words.
The forward generative nature of this model means that it can generate sample documents, though the documents are created in a bag-of-words format and not practical for direct interpretation.
This pseudocode is known as a forward mechanism. The model we want to learn is essentially inverting this code via Bayes' rule.
$p(w | \phi,z_d)$ is a multinomial distribution over words, given the current topic. $p(doc | \phi,z_d) = \prod_{i=1}^{L_doc} p(w_i | \phi, z_{d,i})$ Via Bayes' rule: $p(corpus | \phi) = \prod p(doc | \phi, z_d)$.
$$ p(w_i|d) = \sum_{j=1}^T p(w_i|z_j)p(z_j|d) $$
See the Blei paper, page 4, for derivation of likelihood from the generative model.
Can “easily” use Expectation Maximization for simple models where $Z$ is sampled once per document.
The most commonly used learning algorithm, Latent Dirichlet Allocation, uses multiple topics learned per document. It is typical to perform stochastic search with Gibbs sampling over the unkown variables (historically, also variational inference). In practice, you can just learn the $\underline{z}$ variables:
$$ p(z_i=t|z_{-i}) = \frac{n_{td}^{-1} + \alpha}{\sum_{t'} n_{t'd}^{-i} + T\alpha} * \frac{n_{wt}^{-i} + \beta}{\sum_{w'} n_{w't}^{-i} + W\beta} $$
Directly analogous to principal components and factor models. Where principal components tries to minimize squared error, here we want to maximize the log-probability of observed data.
Go from matrix of docs-by-terms to factorized matrices of docs-by-topics and topics-by-terms.
Human inspection: get human experts to choose how many of top-10 words from each topic seem incoherent.
Information retrieval performance: can you use high-probabiligy topic terms as search queries to retrieve documents of that topic?
Feature selection in classification of documents.
May add metadata or dependencies between documents.
Which topics are “growing” over time? What percentage of the words produced within a given year came from that topic?
See Steyvers et al, 2004 and Rosen-Zvi et al, 2010.
During training, associate each topic with a training label and restrict the sampler to the known labels for the document.
In practice, people tend to fix the number of topics in a heuristic manner rather than relying on non-parametric methods.
Distributed algorithms can approximate the update by allowing documents to be processed in parallel.
Stochastic methods provide more performance improvements. See Hoffman et. al 2013.