Package nltk :: Package tokenize
[hide private]
[frames] | no frames]

Package tokenize

source code

Functions for tokenizing, i.e., dividing text strings into substrings.

Submodules [hide private]

Classes [hide private]
WhitespaceTokenizer
A tokenizer that divides a string into substrings by treating any sequence of whitespace characters as a separator.
SpaceTokenizer
A tokenizer that divides a string into substrings by treating any single space character as a separator.
LineTokenizer
A tokenizer that divides a string into substrings by treating any single newline character as a separator.
TabTokenizer
A tokenizer that divides a string into substrings by treating any single tab character as a separator.
BlanklineTokenizer
A tokenizer that divides a string into substrings by treating any sequence of blank lines as a separator.
WordPunctTokenizer
A tokenizer that divides a text into sequences of alphabetic and non-alphabetic characters.
RegexpTokenizer
A tokenizer that splits a string into substrings using a regular expression.
PunktSentenceTokenizer
A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.
PunktWordTokenizer
SExprTokenizer
A tokenizer that divides strings into s-expressions.
TreebankWordTokenizer
A word tokenizer that tokenizes sentences using the conventions used by the Penn Treebank.
TextTilingTokenizer
A section tokenizer based on the TextTiling algorithm.
Functions [hide private]
 
line_tokenize(text, blanklines='discard') source code
 
regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=56)
Split the given text string, based on the given regular expression pattern.
source code
 
blankline_tokenize(text) source code
 
wordpunct_tokenize(text) source code
 
sexpr_tokenize(text)
Tokenize the text into s-expressions.
source code
 
sent_tokenize(text)
Use NLTK's currently recommended sentence tokenizer to tokenize sentences in the given text.
source code
 
word_tokenize(text)
Use NLTK's currently recommended word tokenizer to tokenize words in the given sentence.
source code
Variables [hide private]
  WordTokenizer
  Deprecated
  BLOCK_COMPARISON = 0
  DEFAULT_SMOOTHING = [0]
  HC = 1
  LC = 0
  VOCABULARY_INTRODUCTION = 1
Function Details [hide private]

regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=56)

source code 

Split the given text string, based on the given regular expression pattern. See the documentation for RegexpTokenizer.tokenize() for descriptions of the arguments.

sexpr_tokenize(text)

source code 

Tokenize the text into s-expressions. For example:

>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

All parenthases are assumed to mark sexprs. In particular, no special processing is done to exclude parenthases that occur inside strings, or following backslash characters.

If the given expression contains non-matching parenthases, then the behavior of the tokenizer depends on the strict parameter to the constructor. If strict is True, then raise a ValueError. If strict is False, then any unmatched close parenthases will be listed as their own s-expression; and the last partial sexpr with unmatched open parenthases will be listed as its own sexpr:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']
Parameters:
  • text (string or iter(string)) - the string to be tokenized
Returns:
An iterator over tokens (each of which is an s-expression)

sent_tokenize(text)

source code 

Use NLTK's currently recommended sentence tokenizer to tokenize sentences in the given text. Currently, this uses PunktSentenceTokenizer.

word_tokenize(text)

source code 

Use NLTK's currently recommended word tokenizer to tokenize words in the given sentence. Currently, this uses TreebankWordTokenizer. This tokenizer should be fed a single sentence at a time.