Package nltk :: Package tokenize :: Module treebank :: Class TreebankWordTokenizer
[hide private]
[frames] | no frames]

type TreebankWordTokenizer

source code

    object --+    
             |    
api.TokenizerI --+
                 |
                TreebankWordTokenizer


A word tokenizer that tokenizes sentences using the conventions
used by the Penn Treebank.  Contractions, such as "can't", are
split in to two tokens.  E.g.:

  - can't S{->} ca n't
  - he'll S{->} he 'll
  - weren't S{-} were n't

This tokenizer assumes that the text has already been segmented into
sentences.  Any periods -- apart from those at the end of a string --
are assumed to be part of the word they are attached to (e.g. for
abbreviations, etc), and are not separately tokenized. 

Instance Methods [hide private]
 
tokenize(self, text)
Divide the given string into a list of substrings.
source code

Inherited from api.TokenizerI: batch_span_tokenize, batch_tokenize, span_tokenize

Class Variables [hide private]
  CONTRACTIONS2 = [re.compile(r'(?i)(.)(\'ll|\'re|\'ve|n\'t|\'s|...
  CONTRACTIONS3 = [re.compile(r'(?i)\b(Whad)(dd)(ya)\b'), re.com...
Method Details [hide private]

tokenize(self, text)

source code 

Divide the given string into a list of substrings.

Returns:
list of str
Overrides: api.TokenizerI.tokenize
(inherited documentation)

Class Variable Details [hide private]

CONTRACTIONS2

Value:
[re.compile(r'(?i)(.)(\'ll|\'re|\'ve|n\'t|\'s|\'m|\'d)\b'),
 re.compile(r'(?i)\b(can)(not)\b'),
 re.compile(r'(?i)\b(D)(\'ye)\b'),
 re.compile(r'(?i)\b(Gim)(me)\b'),
 re.compile(r'(?i)\b(Gon)(na)\b'),
 re.compile(r'(?i)\b(Got)(ta)\b'),
 re.compile(r'(?i)\b(Lem)(me)\b'),
 re.compile(r'(?i)\b(Mor)(\'n)\b'),
...

CONTRACTIONS3

Value:
[re.compile(r'(?i)\b(Whad)(dd)(ya)\b'),
 re.compile(r'(?i)\b(Wha)(t)(cha)\b')]