Text preprocessing decisions:
Open-source parser compiling tools: ANTLR, JFlex, JavaCC.
Stop words: very common words that are not useful for non-statistical techniques. Not necessary when certain statistical techniques are used.
Normalization: “canonicalize” tokens to remove superficial differences. USA → U.S.A. → usa. C.A.T. → cat
Convert a doument into bag of words word counts. The common definition of a token is “any nonempty sequence of characters”.
Want to reduce all morphological variants of a word to a single term in order to reduce the dimensionality of the feature space.
Not a perfect technique, and some information is lost. e.g. in the popular Porter stemming algorithm, both “university” and “universal” become “univers”.