type KNBCorpusReader
source code
object --+
|
api.CorpusReader --+
|
api.SyntaxCorpusReader --+
|
KNBCorpusReader
This class implements:
- L{__init__}, which specifies the location of the corpus
and a method for detecting the sentence blocks in corpus files.
- L{_read_block}, which reads a block from the input stream.
- L{_word}, which takes a block and returns a list of list of words.
- L{_tag}, which takes a block and returns a list of list of tagged
words.
- L{_parse}, which takes a block and returns a list of parsed
sentences.
The structure of tagged words:
tagged_word = (word(str), tags(tuple))
tags = (surface, reading, lemma, pos1, posid1, pos2, posid2, pos3, posid3, others ...)
|
|
__init__(self,
root,
fileids,
encoding=None,
morphs2str=<function <lambda> at 0x19089b0>)
Initialize KNBCorpusReader morphs2str is a function to convert
morphlist to str for tree representation for _parse() |
source code
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Inherited from api.SyntaxCorpusReader:
parsed_sents,
raw,
sents,
tagged_sents,
tagged_words,
words
Inherited from api.CorpusReader:
__repr__,
abspath,
abspaths,
encoding,
fileids,
open,
readme
|
|
|
|
Inherited from api.SyntaxCorpusReader:
parsed,
read,
tagged,
tokenized
|
|
Inherited from api.CorpusReader:
files
|
|
Inherited from api.CorpusReader:
items
|
__init__(self,
root,
fileids,
encoding=None,
morphs2str=<function <lambda> at 0x19089b0>)
(Constructor)
| source code
|
Initialize KNBCorpusReader morphs2str is a function to convert
morphlist to str for tree representation for _parse()
- Parameters:
root - A path pointer identifying the root directory for this corpus.
If a string is specified, then it will be converted to a PathPointer automatically.
fileids - A list of the files that make up this corpus. This list can
either be specified explicitly, as a list of strings; or
implicitly, as a regular expression over file paths. The
absolute path for each file will be constructed by joining the
reader's root to each file name.
encoding - The default unicode encoding for the files that make up the
corpus. encoding's value can be any of the
following:
-
A string:
encoding is the encoding name
for all files.
-
A dictionary:
encoding[file_id] is the
encoding name for the file whose identifier is
file_id. If file_id is not in
encoding, then the file contents will be
processed using non-unicode byte strings.
-
A list:
encoding should be a list of
(regexp, encoding) tuples. The encoding for a
file whose identifier is file_id will be the
encoding value for the first tuple whose
regexp matches the file_id. If no
tuple's regexp matches the file_id,
the file contents will be processed using non-unicode byte
strings.
-
None: the file contents of all files will be
processed using non-unicode byte strings.
tag_mapping_function - A function for normalizing or simplifying the POS tags returned
by the tagged_words() or tagged_sents() methods.
- Overrides:
api.CorpusReader.__init__
|