Package nltk :: Module text :: Class Text
[hide private]
[frames] | no frames]

type Text

source code

object --+
         |
        Text
Known Subclasses:

A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text's contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the Text class, and use the appropriate analysis function or class directly instead.

Texts are typically initialized from a given document or corpus. E.g.:

>>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
Instance Methods [hide private]
 
__init__(self, tokens, name=None)
Create a Text object.
source code
 
__getitem__(self, i) source code
 
__len__(self) source code
 
concordance(self, word, width=79, lines=25)
Print a concordance for word with the specified context window.
source code
 
collocations(self, num=20, window_size=2)
Print collocations derived from the text, ignoring stopwords.
source code
 
count(self, word)
Count the number of times this word appears in the text.
source code
 
index(self, word)
Find the index of the first occurrence of the word in the text.
source code
 
readability(self, method) source code
 
generate(self, length=100)
Print random text, generated using a trigram language model.
source code
 
search(self, pattern)
Search for instances of the regular expression pattern in the text.
source code
 
similar(self, word, num=20)
Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.
source code
 
common_contexts(self, words, num=20)
Find contexts where the specified words appear; list most frequent common contexts first.
source code
 
dispersion_plot(self, words)
Produce a plot showing the distribution of the words through the text.
source code
 
plot(self, *args)
See documentation for FreqDist.plot()
source code
 
vocab(self) source code
 
findall(self, regexp)
Find instances of the regular expression in the text.
source code
 
_context(self, tokens, i)
One left & one right token, both case-normalied.
source code
string
__repr__(self)
Returns: A string representation of this FreqDist.
source code
Class Variables [hide private]
  _COPY_TOKENS = True
  _CONTEXT_RE = re.compile(r'\w+|[\.!\?]')
Method Details [hide private]

__init__(self, tokens, name=None)
(Constructor)

source code 

Create a Text object.

Parameters:
  • tokens (sequence of str) - The source text.
Overrides: object.__init__

concordance(self, word, width=79, lines=25)

source code 

Print a concordance for word with the specified context window. Word matching is not case-sensitive.

See Also: ConcordanceIndex

collocations(self, num=20, window_size=2)

source code 

Print collocations derived from the text, ignoring stopwords.

Parameters:
  • num (int) - The maximum number of collocations to print.
  • window_size (int) - The number of tokens spanned by a collocation (default=2)

See Also: find_collocations

generate(self, length=100)

source code 

Print random text, generated using a trigram language model.

Parameters:
  • length (int) - The length of text to generate (default=100)

See Also: NgramModel

search(self, pattern)

source code 

Search for instances of the regular expression pattern in the text.

See Also: TokenSearcher

similar(self, word, num=20)

source code 

Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.

Parameters:
  • word (str) - The word used to seed the similarity search
  • num (int) - The number of words to generate (default=20)

common_contexts(self, words, num=20)

source code 

Find contexts where the specified words appear; list most frequent common contexts first.

Parameters:
  • word (str) - The word used to seed the similarity search
  • num (int) - The number of words to generate (default=20)

dispersion_plot(self, words)

source code 

Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.

Parameters:
  • words - The words to be plotted
  • word (str)

plot(self, *args)

source code 

See documentation for FreqDist.plot()

See Also: nltk.prob.FreqDist.plot()

vocab(self)

source code 

See Also: nltk.prob.FreqDist

findall(self, regexp)

source code 

Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.

>>> text5.findall("<.*><.*><bro>")
you rule bro; telling you bro; u twizted bro
>>> text1.findall("<a>(<.*>)<man>")
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> text9.findall("<th.*>{3,}")
thread through those; the thought that; that the thing; the thing
that; that that thing; through these than through; them that the;
through the thick; them that they; thought that the
Parameters:
  • regexp (str) - A regular expression

_context(self, tokens, i)

source code 

One left & one right token, both case-normalied. Skip over non-sentence-final punctuation. Used by the ContextIndex that is created for similar() and common_contexts().

__repr__(self)
(Representation operator)

source code 
Returns: string
A string representation of this FreqDist.
Overrides: object.__repr__