Package nltk :: Package corpus :: Package reader :: Module aligned :: Class AlignedCorpusReader
[hide private]
[frames] | no frames]

type AlignedCorpusReader

source code

      object --+    
               |    
api.CorpusReader --+
                   |
                  AlignedCorpusReader

Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.

Instance Methods [hide private]
 
__init__(self, root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=T..., sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, f..., alignedsent_block_reader=<function read_alignedsent_block at 0x132beb0>, encoding=None)
Construct a new Aligned Corpus reader for a set of documents located at the given root directory.
source code
str
raw(self, fileids=None)
Returns: the given file(s) as a single string.
source code
list of str
words(self, fileids=None)
Returns: the given file(s) as a list of words and punctuation symbols.
source code
list of (list of str)
sents(self, fileids=None)
Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
source code
list of AlignedSent
aligned_sents(self, fileids=None)
Returns: the given file(s) as a list of AlignedSent objects.
source code

Inherited from api.CorpusReader: __repr__, abspath, abspaths, encoding, fileids, open, readme

Inherited from api.CorpusReader (private): _get_root

    Deprecated since 0.9.7

Inherited from api.CorpusReader: files

    Deprecated since 0.9.1

Inherited from api.CorpusReader: items

Inherited from api.CorpusReader (private): _get_items

Instance Variables [hide private]

Inherited from api.CorpusReader (private): _encoding, _fileids, _root

Properties [hide private]

Inherited from api.CorpusReader: root

Method Details [hide private]

__init__(self, root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=T..., sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, f..., alignedsent_block_reader=<function read_alignedsent_block at 0x132beb0>, encoding=None)
(Constructor)

source code 

Construct a new Aligned Corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = AlignedCorpusReader(root, '.*', '.txt')
Parameters:
  • root - The root directory for this corpus.
  • fileids - A list or regexp specifying the fileids in this corpus.
Overrides: api.CorpusReader.__init__

raw(self, fileids=None)

source code 
Returns: str
the given file(s) as a single string.

words(self, fileids=None)

source code 
Returns: list of str
the given file(s) as a list of words and punctuation symbols.

sents(self, fileids=None)

source code 
Returns: list of (list of str)
the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

aligned_sents(self, fileids=None)

source code 
Returns: list of AlignedSent
the given file(s) as a list of AlignedSent objects.