Probability

 
>>> import nltk
>>> from nltk.probability import *

1   Testing some HMM estimators

We extract a small part (500 sentences) of the Brown corpus

 
>>> corpus = nltk.corpus.brown.tagged_sents(categories='adventure')[:500]
>>> print len(corpus)
500

We create a HMM trainer - note that we need the tags and symbols from the whole corpus, not just the training corpus

 
>>> tag_set = list(set([tag for sent in corpus for (word,tag) in sent]))
>>> print len(tag_set)
92
>>> symbols = list(set([word for sent in corpus for (word,tag) in sent]))
>>> print len(symbols)
1464
>>> trainer = nltk.HiddenMarkovModelTrainer(tag_set, symbols)

We divide the corpus into 90% training and 10% testing

 
>>> train_corpus = []
>>> test_corpus = []
>>> for i in range(len(corpus)):
...     if i % 10:
...         train_corpus += [corpus[i]]
...     else:
...         test_corpus += [corpus[i]]
>>> print len(train_corpus)
450
>>> print len(test_corpus)
50

And now we can test the estimators

 
>>> def train_and_test(est):
...     hmm = trainer.train_supervised(train_corpus, estimator=est)
...     print '%.2f%%' % (100 * nltk.tag.accuracy(hmm, test_corpus))

Maximum Likelihood Estimation - this resulted in an initialization error before r7209

 
>>> mle = lambda fd, bins: MLEProbDist(fd)
>>> train_and_test(mle)
14.43%

Laplace (= Lidstone with gamma==1)

 
>>> train_and_test(LaplaceProbDist)
66.04%

Expected Likelihood Estimation (= Lidstone with gamma==0.5)

 
>>> train_and_test(ELEProbDist)
73.01%

Lidstone Estimation, for gamma==0.1, 0.5 and 1 (the later two should be exactly equal to MLE and ELE above)

 
>>> def lidstone(gamma):
...     return lambda fd, bins: LidstoneProbDist(fd, gamma, bins)
>>> train_and_test(lidstone(0.1))
82.51%
>>> train_and_test(lidstone(0.5))
73.01%
>>> train_and_test(lidstone(1.0))
66.04%

Witten Bell Estimation - This resulted in ZeroDivisionError before r7209

 
>>> train_and_test(WittenBellProbDist)
88.12%

Good Turing Estimation - The accuracy is only 17%, see issues #26 and #133 - The estimator should be fixed, and then this test

 
>>> train_and_test(GoodTuringProbDist)
THE GOOD TURING ESTIMATOR IS FLAWED

Remains to be added: - Tests for HeldoutProbDist, CrossValidationProbDist and MutableProbDist