The nltk.tag module defines functions and classes for manipulating tagged tokens, which combine a basic token value with a tag. Tags are case-sensitive strings that identify some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):
|
An off-the-shelf tagger is available. It uses the Penn Treebank tagset:
|
Tagged tokens are often written using the form 'fly/NN'. The nltk.tag module provides utility functions to convert between this string representation and the tuple representation:
|
To convert an entire sentence from the string format to the tuple format, we simply tokenize the sentence and then apply str2tuple to each word:
|
Similarly, we can convert from a list of tagged tuples to a single string by combining tuple2str with the string join method:
|
The nltk.tag module defines several taggers, which take a token list (typically a sentence), assign a tag to each token, and return the resulting list tagged of tagged tokens. Most of the taggers defined in the nltk.tag module are built automatically based on a training corpus. For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus:
|
Note that words that the tagger has not seen before, such as decried, receive a tag of None.
In the examples below, we'll look at developing automatic part-of-speech taggers based on the Brown Corpus. Here are the training & test sets we'll use:
|
(Note that these are on the small side, to make the tests run faster -- for real-world use, you would probably want to train on more data.)
The simplest tagger is the DefaultTagger, which just applies the same tag to all tokens:
|
Since 'NN' is the most frequent tag in the Brown corpus, we can use a tagger that assigns 'NN' to all words as a baseline.
|
Using this baseline, we achieve about a fairly low accuracy:
|
The RegexpTagger class assigns tags to tokens by comparing their word strings to a series of regular expressions. The following tagger uses word suffixes to make guesses about the correct Brown Corpus part of speech tag:
|
This gives us a higher score than the default tagger, but accuracy is still fairly low:
|
As mentioned above, the UnigramTagger class finds the most likely tag for each word in a training corpus, and then uses that information to assign tags to new tokens.
|
This gives us a significantly higher accuracy score than the default tagger or the regexp tagger:
|
As was mentioned above, the unigram tagger will assign a tag of None to any words that it never saw in the training data. We can avoid this problem by providing the unigram tagger with a backoff tagger, which will be used whenever the unigram tagger is unable to choose a tag:
|
Using a backoff tagger has another advantage, as well -- it allows us to build a more compact unigram tagger, because the unigram tagger doesn't need to explicitly store the tags for words that the backoff tagger would get right anyway. We can see this by using the size() method, which reports the number of words that a unigram tagger has stored the most likely tag for.
|
The bigram tagger is similar to the unigram tagger, except that it finds the most likely tag for each word, given the preceding tag. (It is called a "bigram" tagger because it uses two pieces of information -- the current word, and the previous tag.) When training, it can look up the preceding tag directly. When run on new data, it works through the sentence from left to right, and uses the tag that it just generated for the preceding word.
|
Similarly, the trigram tagger finds the most likely tag for a word, given the preceding two tags; and the n-gram tagger finds the most likely tag for a word, given the preceding n-1 tags. However, these higher-order taggers are only likely to improve performance if there is a large amount of training data available; otherwise, the sequences that they consider do not occur often enough to gather reliable statistics.
|
The Brill Tagger starts by running an initial tagger, and then improves the tagging by applying a list of transformation rules. These transformation rules are automatically learned from the training corpus, based on one or more "rule templates."
|
The HMM tagger uses a hidden markov model to find the most likely tag sequence for each sentence. (Note: this requires numpy.)
|
Demo code lifted more or less directly from the HMM class.
|
|
|
|
|
|
|
|
|
Check the test sequence by hand -- calculate the joint probability for each possible state sequence, and verify that they're equal to what the model gives; then verify that their total is equal to what the model gives for the probability of the sequence w/ no states specified.
|
Find the most likely set of tags for the test sequence.
|
Find some entropy values. These are all in base 2 (i.e., bits).
|
|
|
|
The TaggerI interface defines two methods: tag and batch_tag:
|
The TaggerI interface should not be directly instantiated:
|
- test that fast & normal trainers get identical results when deterministic=True is used.
- check on some simple examples to make sure they're doing the right thing.
Make sure that get_neighborhoods is implemented correctly -- in particular, given index, it should return the indices i such that applicable_rules(token, i, ...) depends on the value of the indexth token. There used to be a bug where this was swapped -- i.e., it calculated the values of i such that applicable_rules(token, index, ...) depended on i.
|