Package nltk :: Package tag :: Module tnt :: Class TnT
[hide private]
[frames] | no frames]

type TnT

source code

 object --+    
          |    
api.TaggerI --+
              |
             TnT


TnT - Statistical POS tagger

IMPORTANT NOTES:

* DOES NOT AUTOMATICALLY DEAL WITH UNSEEN WORDS
  It is possible to provide an untrained POS tagger to
  create tags for unknown words, see __init__ function

* SHOULD BE USED WITH SENTENCE-DELIMITED INPUT
  - Due to the nature of this tagger, it works best when
   trained over sentence delimited input.
 - However it still produces good results if the training
   data and testing data are separated on all punctuation eg: [,.?!]
 - Input for training is expected to be a list of sentences
   where each sentence is a list of (word, tag) tuples
 - Input for tag function is a single sentence
   Input for tagdata function is a list of sentences
   Output is of a similar form

* Function provided to process text that is unsegmented
  - Please see basic_sent_chop()


TnT uses a second order Markov model to produce tags for
a sequence of input, specifically:

  argmax [Proj(P(t_i|t_i-1,t_i-2)P(w_i|t_i))] P(t_T+1 | t_T)

IE: the maximum projection of a set of probabilities

The set of possible tags for a given word is derived
from the training data. It is the set of all tags
that exact word has been assigned.

The probability of a tag for a given word is the linear
interpolation of 3 markov models; a zero-order, first-order,
and a second order model.

  P(t_i| t_i-1, t_i-2) = l1*P(t_i) + l2*P(t_i| t_i-1) +
                         l3*P(t_i| t_i-1, t_i-2)

A beam search is used to limit the memory usage of the algorithm.
The degree of the beam can be changed using N in the initialization.
N represents the maximum number of possible solutions to maintain
while tagging.

It is possible to differentiate the tags which are assigned to
capitalized words. However this does not result in a significant
gain in the accuracy of the results. 

Instance Methods [hide private]
 
__init__(self, unk=None, Trained=False, N=1000, C=False)
Construct a TnT statistical tagger.
source code
 
train(self, data)
Uses a set of tagged data to train the tagger.
source code
 
_compute_lambda(self)
creates lambda values based upon training data
source code
 
_safe_div(self, v1, v2)
Safe floating point division function, does not allow division by 0 returns -1 if the denominator is 0
source code
 
tagdata(self, data)
Tags each sentence in a list of sentences
source code
list of (token, tag)
tag(self, data)
Tags a single sentence
source code
 
_tagword(self, sent, current_states) source code
 
_cmp_tup(self, (_hq, p1), (_h2, p2))
comparison function
source code

Inherited from api.TaggerI: batch_tag, evaluate

Inherited from api.TaggerI (private): _check_params

Method Details [hide private]

__init__(self, unk=None, Trained=False, N=1000, C=False)
(Constructor)

source code 

Construct a TnT statistical tagger. Tagger must be trained before being used to tag input.

Parameters:
  • unk ((TaggerI)) - instance of a POS tagger, conforms to TaggerI
  • Trained (boolean) - Indication that the POS tagger is trained or not
  • N ((int)) - Beam search degree (see above)
  • C (boolean

    Initializer, creates frequency distributions to be used for tagging

    _lx values represent the portion of the tri/bi/uni taggers to be used to calculate the probability

    N value is the number of possible solutions to maintain while tagging. A good value for this is 1000

    C is a boolean value which specifies to use or not use the Capitalization of the word as additional information for tagging. NOTE: using capitalization may not increase the accuracy of the tagger

    ) - Capitalization flag
Overrides: object.__init__

train(self, data)

source code 

Uses a set of tagged data to train the tagger. If an unknown word tagger is specified, it is trained on the same data.

Parameters:
  • data (tuple of str) - List of lists of (word, tag) tuples

_compute_lambda(self)

source code 

creates lambda values based upon training data

NOTE: no need to explicitly reference C,
it is contained within the tag variable :: tag == (tag,C)

for each tag trigram (t1, t2, t3)
depending on the maximum value of
- f(t1,t2,t3)-1 / f(t1,t2)-1
- f(t2,t3)-1 / f(t2)-1
- f(t3)-1 / N-1

increment l3,l2, or l1 by f(t1,t2,t3)

ISSUES -- Resolutions:
if 2 values are equal, increment both lambda values
by (f(t1,t2,t3) / 2)

tagdata(self, data)

source code 

Tags each sentence in a list of sentences

Parameters:
  • data ([[string,],]) - list of list of words
Returns:
list of list of (word, tag) tuples

Invokes tag(sent) function for each sentence compiles the results into a list of tagged sentences each tagged sentence is a list of (word, tag) tuples

tag(self, data)

source code 

Tags a single sentence

Parameters:
  • data ([string,]) - list of words
Returns: list of (token, tag)
[(word, tag),]

Calls recursive function '_tagword' to produce a list of tags

Associates the sequence of returned tags with the correct words in the input sequence

returns a list of (word, tag) tuples

Overrides: api.TaggerI.tag

_tagword(self, sent, current_states)

source code 
Parameters:
  • sent ([word,]) - List of words remaining in the sentence
  • current_states ([([tag, ],prob), ]

    Tags the first word in the sentence and recursively tags the reminder of sentence

    Uses formula specified above to calculate the probability of a particular tag

    ) - List of possible tag combinations for the sentence so far, and the probability associated with each tag combination

_cmp_tup(self, (_hq, p1), (_h2, p2))

source code 

comparison function

@params : (_, prob)