Package nltk :: Package tokenize :: Module regexp :: Class WhitespaceTokenizer
[hide private]
[frames] | no frames]

type WhitespaceTokenizer

source code

    object --+        
             |        
api.TokenizerI --+    
                 |    
   RegexpTokenizer --+
                     |
                    WhitespaceTokenizer

A tokenizer that divides a string into substrings by treating any sequence of whitespace characters as a separator. Whitespace characters are space (' '), tab ('\t'), and newline ('\n'). If you are performing the tokenization yourself (rather than building a tokenizer to pass to some other piece of code), consider using the string split() method instead:

>>> words = s.split()
Instance Methods [hide private]
 
__init__(self)
Construct a new tokenizer that splits strings using the given regular expression pattern.
source code

Inherited from RegexpTokenizer: __repr__, span_tokenize, tokenize

Inherited from api.TokenizerI: batch_span_tokenize, batch_tokenize

Instance Variables [hide private]

Inherited from RegexpTokenizer (private): _discard_empty, _flags, _gaps, _pattern, _regexp

Method Details [hide private]

__init__(self)
(Constructor)

source code 

Construct a new tokenizer that splits strings using the given regular expression pattern. By default, pattern will be used to find tokens; but if gaps is set to False, then patterns will be used to find separators between tokens instead.

Parameters:
  • pattern - The pattern used to build this tokenizer. This pattern may safely contain grouping parenthases.
  • gaps - True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves.
  • discard_empty - True if any empty tokens ('') generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps is true.
  • flags - The regexp flags used to compile this tokenizer's pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.
Overrides: RegexpTokenizer.__init__
(inherited documentation)