Package nltk :: Module text :: Class TokenSearcher
[hide private]
[frames] | no frames]

type TokenSearcher

source code

object --+
         |
        TokenSearcher

A class that makes it easier to use regular expressions to search over tokenized strings. The tokenized string is converted to a string where tokens are marked with angle brackets -- e.g., '<the><window><is><still><open>'. The regular expression passed to the findall() method is modified to treat angle brackets as nongrouping parentheses, in addition to matching the token boundaries; and to have '.' not match the angle brackets.

Instance Methods [hide private]
 
__init__(self, tokens) source code
 
findall(self, regexp)
Find instances of the regular expression in the text.
source code
Method Details [hide private]

__init__(self, tokens)
(Constructor)

source code 
Overrides: object.__init__
(inherited documentation)

findall(self, regexp)

source code 

Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.

>>> ts.findall("<.*><.*><bro>")
['you rule bro', ['telling you bro; u twizted bro
>>> ts.findall("<a>(<.*>)<man>")
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> text9.findall("<th.*>{3,}")
thread through those; the thought that; that the thing; the thing
that; that that thing; through these than through; them that the;
through the thick; them that they; thought that the
Parameters:
  • regexp (str) - A regular expression