Package nltk :: Package tokenize :: Module simple :: Class LineTokenizer
[hide private]
[frames] | no frames]

type LineTokenizer

source code

    object --+    
             |    
api.TokenizerI --+
                 |
                LineTokenizer

A tokenizer that divides a string into substrings by treating any single newline character as a separator. Handling of blank lines may be controlled using a constructor parameter.

Instance Methods [hide private]
 
__init__(self, blanklines='discard') source code
 
tokenize(self, s)
Divide the given string into a list of substrings.
source code
 
span_tokenize(self, s)
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
source code

Inherited from api.TokenizerI: batch_span_tokenize, batch_tokenize

Method Details [hide private]

__init__(self, blanklines='discard')
(Constructor)

source code 
Parameters:
  • blanklines - Indicates how blank lines should be handled. Valid values are:
    • 'discard': strip blank lines out of the token list before returning it. A line is considered blank if it contains only whitespace characters.
    • 'keep': leave all blank lines in the token list.
    • 'discard-eof': if the string ends with a newline, then do not generate a corresponding token '' after that newline.
Overrides: object.__init__

tokenize(self, s)

source code 

Divide the given string into a list of substrings.

Returns:
list of str
Overrides: api.TokenizerI.tokenize
(inherited documentation)

span_tokenize(self, s)

source code 

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Returns:
iter of tuple of int
Overrides: api.TokenizerI.span_tokenize
(inherited documentation)