Package nltk :: Package corpus :: Package reader :: Module plaintext :: Class PlaintextCorpusReader
[hide private]
[frames] | no frames]

type PlaintextCorpusReader

source code

      object --+    
               |    
api.CorpusReader --+
                   |
                  PlaintextCorpusReader
Known Subclasses:

Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor.

This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the CorpusView class variable.

Nested Classes [hide private]
CorpusView
The corpus view class used by this reader.
Instance Methods [hide private]
 
__init__(self, root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, disc..., sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle'), para_block_reader=<function read_blankline_block at 0x132be70>, encoding=None)
Construct a new plaintext corpus reader for a set of documents located at the given root directory.
source code
str
raw(self, fileids=None, sourced=False)
Returns: the given file(s) as a single string.
source code
list of str
words(self, fileids=None, sourced=False)
Returns: the given file(s) as a list of words and punctuation symbols.
source code
list of (list of str)
sents(self, fileids=None, sourced=False)
Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
source code
list of (list of (list of str))
paras(self, fileids=None, sourced=False)
Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
source code
 
_read_word_block(self, stream) source code
 
_read_sent_block(self, stream) source code
 
_read_para_block(self, stream) source code

Inherited from api.CorpusReader: __repr__, abspath, abspaths, encoding, fileids, open, readme

Inherited from api.CorpusReader (private): _get_root

    Deprecated since 0.9.7

Inherited from api.CorpusReader: files

    Deprecated since 0.9.1

Inherited from api.CorpusReader: items

Inherited from api.CorpusReader (private): _get_items

Instance Variables [hide private]

Inherited from api.CorpusReader (private): _encoding, _fileids, _root

Properties [hide private]

Inherited from api.CorpusReader: root

Method Details [hide private]

__init__(self, root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, disc..., sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle'), para_block_reader=<function read_blankline_block at 0x132be70>, encoding=None)
(Constructor)

source code 

Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = PlaintextCorpusReader(root, '.*\.txt')
Parameters:
  • root - The root directory for this corpus.
  • fileids - A list or regexp specifying the fileids in this corpus.
  • word_tokenizer - Tokenizer for breaking sentences or paragraphs into words.
  • sent_tokenizer - Tokenizer for breaking paragraphs into words.
  • para_block_reader - The block reader used to divide the corpus into paragraph blocks.
Overrides: api.CorpusReader.__init__

raw(self, fileids=None, sourced=False)

source code 
Returns: str
the given file(s) as a single string.

words(self, fileids=None, sourced=False)

source code 
Returns: list of str
the given file(s) as a list of words and punctuation symbols.

sents(self, fileids=None, sourced=False)

source code 
Returns: list of (list of str)
the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

paras(self, fileids=None, sourced=False)

source code 
Returns: list of (list of (list of str))
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.