Reader for Europarl corpora that consist of plaintext documents.
Documents are divided into chapters instead of paragraphs as for regular
plaintext documents. Chapters are separated using blank lines. Everything
is inherited from PlaintextCorpusReader except that:
|
|
|
|
|
|
|
|
|
list of (list of (list of
str))
|
chapters(self,
fileids=None)
Returns:
the given file(s) as a list of chapters, each encoded as a list of
sentences, which are in turn encoded as lists of word strings. |
source code
|
|
list of (list of (list of
str))
|
paras(self,
fileids=None)
Returns:
the given file(s) as a list of paragraphs, each encoded as a list of
sentences, which are in turn encoded as lists of word strings. |
source code
|
|
|
Inherited from PlaintextCorpusReader:
__init__,
raw,
sents,
words
Inherited from api.CorpusReader:
__repr__,
abspath,
abspaths,
encoding,
fileids,
open,
readme
|
|
Inherited from api.CorpusReader:
files
|
|
Inherited from api.CorpusReader:
items
|