(check Smyth Notes http://www.ics.uci.edu/~smyth/courses/cs277/public_slides/text_classification.pdf)
Spam email detection
Tagging of news articles
Classifying web pages into categories
Most common is bag of words, which is counts or binary counts of words for each document. Bigrams or longer phrases are less frequently used.
Classes themselves can vary. They may be general categories e.g. the “Twenty Newsgroups” or a large number of specific “tag” labels. In Sentiment Analysis, labels may be opinions on a person or product e.g. “Like” “hate” “neutral”.
Some applications have multiple labels for single documents. Labels may also have a hierarchical relationship. Classifiers which take advantage of the hierarchy typically do better. The multi-label problems are an area which can use further progress.
May be manually assigned from a predefined dictionary or human labelers. The human labelers may be domain experts, librarians/editors, or low-paid labelers via Amazon Turk.
In this case, there is still some disagreement among the domain experts. The agreement among experts is a sign of the difficulty of the classification problem. This can be measured as inter-annotator agreement.
Labels can also be generated with the help of semi-supervised learning, where a bootstrapping approach is used to generate more data labels.
Data typically is in the form of N documents comprised of d terms. Counts are stored in an N-by-d sparse matrix, where each row is a “bag of words” for a document.
Variables:
Probabilistic Classification:
Want to model $P(\underline{x} | c_k)$ for each class $k$ and perform classification via Bayes' rule. For 0-1 loss select the class $c_k$ that maximizes $P(c_k|\underline{x})$. By Bayes' rule, this is the same $c_k$ that maximizes $P(\underline{x} | c_k) p(c_k)$.
Need to compute $p(c|x) \propto p(x|c) p©$.
For both the Naive Bayes Bernoulli and multinomial models, the term $p(c|x)$ is the product of many probabilities and may result in underflow issues during computation. In practice, $\logp(c|x) \propto \log p(x|c) + \log p©$ is used instead.
The number of words used in the model is important because many are uninformative. To choose words to use, it is common to start with an empty set then incrementally add words using high mutual information or some other heuristic.