Tokenizers
(See Chapter 3 of the NLTK book for more detailed information about tokenization.)
1 Overview
Tokenizers divide strings into lists of substrings. For example,
tokenizers can be used to find the list of sentences or words in a
string.
| |
>>> from nltk import word_tokenize, wordpunct_tokenize
>>> s = ("Good muffins cost $3.88\nin New York. Please buy me\n"
... "two of them.\n\nThanks.")
>>> word_tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
|
|
When tokenizing text containing Unicode characters, be sure to tokenize
the Unicode string, and not (say), the UTF8-encoded version:
| |
>>> wordpunct_tokenize("das ist ein t\xc3\xa4ller satz".decode('utf8'))
[u'das', u'ist', u'ein', u't\xe4ller', u'satz']
>>> wordpunct_tokenize("das ist ein t\xc3\xa4ller satz")
['das', 'ist', 'ein', 't\xc3', '\xa4', 'ller', 'satz']
|
|
There are numerous ways to tokenize text. If you need more control over
tokenization, see the methods described below.
2 Simple Tokenizers
The following tokenizers, defined in nltk.tokenize.simple, just
divide the string using the string split() method.
| |
>>> from nltk.tokenize import *
|
|
| |
>>>
>>> WhitespaceTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
|
|
| |
>>>
>>> SpaceTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '',
'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
|
|
| |
>>>
>>> LineTokenizer(blanklines='keep').tokenize(s)
['Good muffins cost $3.88', 'in New York. Please buy me',
'two of them.', '', 'Thanks.']
|
|
| |
>>>
>>> LineTokenizer(blanklines='discard').tokenize(s)
['Good muffins cost $3.88', 'in New York. Please buy me',
'two of them.', 'Thanks.']
|
|
| |
>>>
>>> TabTokenizer().tokenize('a\tb c\n\t d')
['a', 'b c\n', ' d']
|
|
The simple tokenizers are not available as separate functions;
instead, you should just use the string split() method directly:
| |
>>> s.split()
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
>>> s.split(' ')
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '',
'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
>>> s.split('\n')
['Good muffins cost $3.88', 'in New York. Please buy me',
'two of them.', '', 'Thanks.']
|
|
The simple tokenizers are mainly useful because they follow the
standard TokenizerI interface, and so can be used with any code
that expects a tokenizer. For example, these tokenizers can be used
to specify the tokenization conventions when building a CorpusReader.
3 Regular Expression Tokenizers
RegexpTokenizer splits a string into substrings using a regular
expression. By default, any substrings matched by this regexp will be
returned as tokens. For example, the following tokenizer selects out
only capitalized words, and throws everything else away:
| |
>>> capword_tokenizer = RegexpTokenizer('[A-Z]\w+')
>>> capword_tokenizer.tokenize(s)
['Good', 'New', 'York', 'Please', 'Thanks']
|
|
The following tokenizer forms tokens out of alphabetic sequences,
money expressions, and any other non-whitespace sequences:
| |
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
|
|
RegexpTokenizers can be told to use their regexp pattern to match
separators between tokens, using gaps=True:
| |
>>> tokenizer = RegexpTokenizer('\s+', gaps=True)
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
|
|
The nltk.tokenize.regexp module contains several subclasses of
RegexpTokenizer that use pre-defined regular expressions:
| |
>>>
>>> WordPunctTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York',
'.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
|
|
| |
>>>
>>> BlanklineTokenizer().tokenize(s)
['Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.',
'Thanks.']
|
|
All of the regular expression tokenizers are also available as simple
functions:
| |
>>> regexp_tokenize(s, pattern='\w+|\$[\d\.]+|\S+')
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York',
'.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> blankline_tokenize(s)
['Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.',
'Thanks.']
|
|
Warning
The function regexp_tokenize() takes the text as its
first argument, and the regular expression pattern as its second
argument. This differs from the conventions used by Python's
re functions, where the pattern is always the first argument.
But regexp_tokenize() is primarily a tokenization function, so
we chose to follow the convention among other tokenization
functions that the text should always be the first argument.
4 Treebank Tokenizer
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
This is the method that is invoked by nltk.word_tokenize(). It assumes that the
text has already been segmented into sentences.
| |
>>> TreebankWordTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
|
|
5 S-Expression Tokenizers
SExprTokenizer is used to find parenthasized expressions in a
string. In particular, it divides a string into a sequence of
substrings that are either parenthasized expressions (including any
nested parenthasized expressions), or other whitespace-separated
tokens.
| |
>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']
|
|
By default, SExprTokenizer will raise a ValueError exception if
used to tokenize an expression with non-matching parenthases:
| |
>>> SExprTokenizer().tokenize('c) d) e (f (g')
Traceback (most recent call last):
...
ValueError: Un-matched close paren at char 1
|
|
But the strict argument can be set to False to allow for
non-matching parenthases. Any unmatched close parenthases will be
listed as their own s-expression; and the last partial sexpr with
unmatched open parenthases will be listed as its own sexpr:
| |
>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']
|
|
The characters used for open and close parenthases may be customized
using the parens argument to the SExprTokenizer constructor:
| |
>>> SExprTokenizer(parens='{}').tokenize('{a b {c d}} e f {g}')
['{a b {c d}}', 'e', 'f', '{g}']
|
|
The s-expression tokenizer is also available as a function:
| |
>>> sexpr_tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']
|
|
6 Punkt Tokenizer
The PunktSentenceTokenizer divides a text into a list of sentences,
by using an unsupervised algorithm to build a model for abbreviation
words, collocations, and words that start sentences. It must be
trained on a large collection of plaintext in the taret language
before it can be used. The algorithm for this tokenizer is described
in Kiss & Strunk (2006):
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
Boundary Detection. Computational Linguistics 32: 485-525.
The NLTK data package includes a pre-trained Punkt tokenizer for
English.
| |
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.
|
|
(Note that whitespace from the original text, including newlines, is
retained in the output.)
Punctuation following sentences can be included with the realign_boundaries
flag:
| |
>>> text = """
... (How does it deal with this parenthesis?) "It should be part of the
... previous sentence."
... """
>>> print '\n-----\n'.join(
... sent_detector.tokenize(text.strip(), realign_boundaries=True))
(How does it deal with this parenthesis?)
-----
"It should be part of the
previous sentence."
|
|
The nltk.tokenize.punkt module also defines PunktWordTokenizer,
which uses a regular expression to divide a text into tokens, leaving all
periods attached to words, but separating off other punctuation:
| |
>>> PunktWordTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.', 'Please',
'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
|
|
7 Regression Tests: Regexp Tokenizer
Some additional test strings.
| |
>>> s = ("Good muffins cost $3.88\nin New York. Please buy me\n"
... "two of them.\n\nThanks.")
>>> s2 = ("Alas, it has not rained today. When, do you think, "
... "will it rain again?")
>>> s3 = ("<p>Although this is <b>not</b> the case here, we must "
... "not relax our vigilance!</p>")
|
|
| |
>>> print regexp_tokenize(s2, r'[,\.\?!"]\s*', gaps=False)
[', ', '. ', ', ', ', ', '?']
>>> print regexp_tokenize(s2, r'[,\.\?!"]\s*', gaps=True)
['Alas', 'it has not rained today', 'When', 'do you think',
'will it rain again']
|
|
Make sure that grouping parenthases don't confuse the tokenizer:
| |
>>> print regexp_tokenize(s3, r'</?(b|p)>', gaps=False)
['<p>', '<b>', '</b>', '</p>']
>>> print regexp_tokenize(s3, r'</?(b|p)>', gaps=True)
['Although this is ', 'not',
' the case here, we must not relax our vigilance!']
|
|
Make sure that named groups don't confuse the tokenizer:
| |
>>> print regexp_tokenize(s3, r'</?(?P<named>b|p)>', gaps=False)
['<p>', '<b>', '</b>', '</p>']
>>> print regexp_tokenize(s3, r'</?(?P<named>b|p)>', gaps=True)
['Although this is ', 'not',
' the case here, we must not relax our vigilance!']
|
|
Make sure that nested groups don't confuse the tokenizer:
| |
>>> print regexp_tokenize(s2, r'(h|r|l)a(s|(i|n0))', gaps=False)
['las', 'has', 'rai', 'rai']
>>> print regexp_tokenize(s2, r'(h|r|l)a(s|(i|n0))', gaps=True)
['A', ', it ', ' not ', 'ned today. When, do you think, will it ',
'n again?']
|
|
The tokenizer should reject any patterns with backreferences:
| |
>>> print regexp_tokenize(s2, r'(.)\1')
Traceback (most recent call last):
...
ValueError: Regular expressions with back-references are
not supported: '(.)\\1'
>>> print regexp_tokenize(s2, r'(?P<foo>)(?P=foo)')
Traceback (most recent call last):
...
ValueError: Regular expressions with back-references are
not supported: '(?P<foo>)(?P=foo)'
|
|
A simple sentence tokenizer '.(s+|$)'
| |
>>> print regexp_tokenize(s, pattern=r'\.(\s+|$)', gaps=True)
['Good muffins cost $3.88\nin New York',
'Please buy me\ntwo of them', 'Thanks']
|
|