| Home | Trees | Indices | Help |
|
|---|
|
|
object --+
|
api.TokenizerI --+
|
TreebankWordTokenizer
A word tokenizer that tokenizes sentences using the conventions
used by the Penn Treebank. Contractions, such as "can't", are
split in to two tokens. E.g.:
- can't S{->} ca n't
- he'll S{->} he 'll
- weren't S{-} were n't
This tokenizer assumes that the text has already been segmented into
sentences. Any periods -- apart from those at the end of a string --
are assumed to be part of the word they are attached to (e.g. for
abbreviations, etc), and are not separately tokenized.
|
|||
|
|||
|
Inherited from |
|||
|
|||
CONTRACTIONS2 =
|
|||
CONTRACTIONS3 =
|
|||
|
|||
Divide the given string into a list of substrings.
|
|
|||
CONTRACTIONS2
|
CONTRACTIONS3
|
| Home | Trees | Indices | Help |
|
|---|
| Generated by Epydoc 3.0.1 on Mon Apr 11 14:39:53 2011 | http://epydoc.sourceforge.net |