| Home | Trees | Indices | Help |
|
|---|
|
|
object --+
|
api.TaggerI --+
|
TnT
TnT - Statistical POS tagger
IMPORTANT NOTES:
* DOES NOT AUTOMATICALLY DEAL WITH UNSEEN WORDS
It is possible to provide an untrained POS tagger to
create tags for unknown words, see __init__ function
* SHOULD BE USED WITH SENTENCE-DELIMITED INPUT
- Due to the nature of this tagger, it works best when
trained over sentence delimited input.
- However it still produces good results if the training
data and testing data are separated on all punctuation eg: [,.?!]
- Input for training is expected to be a list of sentences
where each sentence is a list of (word, tag) tuples
- Input for tag function is a single sentence
Input for tagdata function is a list of sentences
Output is of a similar form
* Function provided to process text that is unsegmented
- Please see basic_sent_chop()
TnT uses a second order Markov model to produce tags for
a sequence of input, specifically:
argmax [Proj(P(t_i|t_i-1,t_i-2)P(w_i|t_i))] P(t_T+1 | t_T)
IE: the maximum projection of a set of probabilities
The set of possible tags for a given word is derived
from the training data. It is the set of all tags
that exact word has been assigned.
The probability of a tag for a given word is the linear
interpolation of 3 markov models; a zero-order, first-order,
and a second order model.
P(t_i| t_i-1, t_i-2) = l1*P(t_i) + l2*P(t_i| t_i-1) +
l3*P(t_i| t_i-1, t_i-2)
A beam search is used to limit the memory usage of the algorithm.
The degree of the beam can be changed using N in the initialization.
N represents the maximum number of possible solutions to maintain
while tagging.
It is possible to differentiate the tags which are assigned to
capitalized words. However this does not result in a significant
gain in the accuracy of the results.
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
list of (token, tag)
|
|
||
|
|||
|
|||
|
Inherited from Inherited from |
|||
|
|||
Construct a TnT statistical tagger. Tagger must be trained before being used to tag input.
|
Uses a set of tagged data to train the tagger. If an unknown word tagger is specified, it is trained on the same data.
|
creates lambda values based upon training data NOTE: no need to explicitly reference C, it is contained within the tag variable :: tag == (tag,C) for each tag trigram (t1, t2, t3) depending on the maximum value of - f(t1,t2,t3)-1 / f(t1,t2)-1 - f(t2,t3)-1 / f(t2)-1 - f(t3)-1 / N-1 increment l3,l2, or l1 by f(t1,t2,t3) ISSUES -- Resolutions: if 2 values are equal, increment both lambda values by (f(t1,t2,t3) / 2) |
Tags each sentence in a list of sentences
|
Tags a single sentence
|
|
comparison function @params : (_, prob) |
| Home | Trees | Indices | Help |
|
|---|
| Generated by Epydoc 3.0.1 on Mon Apr 11 14:39:53 2011 | http://epydoc.sourceforge.net |