Package nltk :: Package corpus :: Package reader :: Module knbc :: Class KNBCorpusReader
[hide private]
[frames] | no frames]

type KNBCorpusReader

source code

        object --+        
                 |        
  api.CorpusReader --+    
                     |    
api.SyntaxCorpusReader --+
                         |
                        KNBCorpusReader


This class implements:
  - L{__init__}, which specifies the location of the corpus
    and a method for detecting the sentence blocks in corpus files.
  - L{_read_block}, which reads a block from the input stream.
  - L{_word}, which takes a block and returns a list of list of words.
  - L{_tag}, which takes a block and returns a list of list of tagged
    words.
  - L{_parse}, which takes a block and returns a list of parsed
    sentences.

The structure of tagged words:
  tagged_word = (word(str), tags(tuple))
  tags = (surface, reading, lemma, pos1, posid1, pos2, posid2, pos3, posid3, others ...)

Instance Methods [hide private]
 
__init__(self, root, fileids, encoding=None, morphs2str=<function <lambda> at 0x19089b0>)
Initialize KNBCorpusReader morphs2str is a function to convert morphlist to str for tree representation for _parse()
source code
 
_read_block(self, stream) source code
 
_word(self, t) source code
 
_tag(self, t, simplify_tags=False) source code
 
_parse(self, t) source code

Inherited from api.SyntaxCorpusReader: parsed_sents, raw, sents, tagged_sents, tagged_words, words

Inherited from api.CorpusReader: __repr__, abspath, abspaths, encoding, fileids, open, readme

Inherited from api.CorpusReader (private): _get_root

    Block Readers
    Deprecated since 0.8

Inherited from api.SyntaxCorpusReader: parsed, read, tagged, tokenized

    Deprecated since 0.9.7

Inherited from api.CorpusReader: files

    Deprecated since 0.9.1

Inherited from api.CorpusReader: items

Inherited from api.CorpusReader (private): _get_items

Instance Variables [hide private]

Inherited from api.CorpusReader (private): _encoding, _fileids, _root

Properties [hide private]

Inherited from api.CorpusReader: root

Method Details [hide private]

__init__(self, root, fileids, encoding=None, morphs2str=<function <lambda> at 0x19089b0>)
(Constructor)

source code 

Initialize KNBCorpusReader morphs2str is a function to convert morphlist to str for tree representation for _parse()

Parameters:
  • root - A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.
  • fileids - A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader's root to each file name.
  • encoding - The default unicode encoding for the files that make up the corpus. encoding's value can be any of the following:
    • A string: encoding is the encoding name for all files.
    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.
    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple's regexp matches the file_id, the file contents will be processed using non-unicode byte strings.
    • None: the file contents of all files will be processed using non-unicode byte strings.
  • tag_mapping_function - A function for normalizing or simplifying the POS tags returned by the tagged_words() or tagged_sents() methods.
Overrides: api.CorpusReader.__init__

_read_block(self, stream)

source code 
Overrides: api.SyntaxCorpusReader._read_block

_word(self, t)

source code 
Overrides: api.SyntaxCorpusReader._word

_tag(self, t, simplify_tags=False)

source code 
Overrides: api.SyntaxCorpusReader._tag

_parse(self, t)

source code 
Overrides: api.SyntaxCorpusReader._parse