Part-of-Speech Tagging

In linguistics, part-of-speech tagging (POS tagging or POST) is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context — i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph.

Part-of-speech tagging is hard because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken, which is not rare in natural languages.

Here we implement a part-of-speech tagger based on hidden Markov models (HMMs) in Java. Compared to other advanced algorithms (e.g. those based on maximum entropy classifier or random fields), this implementation is extremely fast while providing comparable accuracy.