__init__(self,
root,
fileids,
chunk_types,
encoding=None,
tag_mapping_function=None)
(Constructor)
| source code
|
- Parameters:
root - A path pointer identifying the root directory for this corpus.
If a string is specified, then it will be converted to a PathPointer automatically.
fileids - A list of the files that make up this corpus. This list can
either be specified explicitly, as a list of strings; or
implicitly, as a regular expression over file paths. The
absolute path for each file will be constructed by joining the
reader's root to each file name.
encoding - The default unicode encoding for the files that make up the
corpus. encoding's value can be any of the
following:
-
A string:
encoding is the encoding name
for all files.
-
A dictionary:
encoding[file_id] is the
encoding name for the file whose identifier is
file_id. If file_id is not in
encoding, then the file contents will be
processed using non-unicode byte strings.
-
A list:
encoding should be a list of
(regexp, encoding) tuples. The encoding for a
file whose identifier is file_id will be the
encoding value for the first tuple whose
regexp matches the file_id. If no
tuple's regexp matches the file_id,
the file contents will be processed using non-unicode byte
strings.
-
None: the file contents of all files will be
processed using non-unicode byte strings.
tag_mapping_function - A function for normalizing or simplifying the POS tags returned
by the tagged_words() or tagged_sents() methods.
- Overrides:
api.CorpusReader.__init__
- (inherited documentation)
|