Package nltk :: Package corpus :: Package reader :: Module ipipan :: Class IPIPANCorpusReader
[hide private]
[frames] | no frames]

type IPIPANCorpusReader

source code

      object --+    
               |    
api.CorpusReader --+
                   |
                  IPIPANCorpusReader

Corpus reader designed to work with corpus created by IPI PAN.
See http://korpus.pl/en/ for more details about IPI PAN corpus.

The corpus includes information about text domain, channel and categories.
You can access possible values using ipipan.domains(), ipipan.channels() and
ipipan.categories(). You can use also this metadata to filter files, e.g.:
    ipipan.fileids(channel='prasa')
    ipipan.fileids(categories='publicystyczny')

The reader supports methods: words, sents, paras and their tagged versions.
You can get part of speech instead of full tag by giving "simplify_tags=True"
parameter, e.g.:
    ipipan.tagged_sents(simplify_tags=True)

Also you can get all tags disambiguated tags specifying parameter
"one_tag=False", e.g.:
    ipipan.tagged_paras(one_tag=False)

You can get all tags that were assigned by a morphological analyzer specifying
parameter "disamb_only=False", e.g.
    ipipan.tagged_words(disamb_only=False)

The IPIPAN Corpus contains tags indicating if there is a space between two
tokens. To add special "no space" markers, you should specify parameter
"append_no_space=True", e.g.
    ipipan.tagged_words(append_no_space=True)
As a result in place where there should be no space between two tokens new
pair ('', 'no-space') will be inserted (for tagged data) and just '' for
methods without tags.

The corpus reader can also try to append spaces between words. To enable this
option, specify parameter "append_space=True", e.g.
    ipipan.words(append_space=True)
As a result either ' ' or (' ', 'space') will be inserted between tokens.

By default, xml entities like " and & are replaced by corresponding
characters. You can turn off this feature, specifying parameter
"replace_xmlentities=False", e.g.
    ipipan.words(replace_xmlentities=False)

Instance Methods [hide private]
 
__init__(self, root, fileids) source code
 
raw(self, fileids=None) source code
 
channels(self, fileids=None) source code
 
domains(self, fileids=None) source code
 
categories(self, fileids=None) source code
 
fileids(self, channels=None, domains=None, categories=None)
Return a list of file identifiers for the fileids that make up this corpus.
source code
 
sents(self, fileids=None, **kwargs) source code
 
paras(self, fileids=None, **kwargs) source code
 
words(self, fileids=None, **kwargs) source code
 
tagged_sents(self, fileids=None, **kwargs) source code
 
tagged_paras(self, fileids=None, **kwargs) source code
 
tagged_words(self, fileids=None, **kwargs) source code
 
_list_morph_files(self, fileids) source code
 
_list_header_files(self, fileids) source code
 
_parse_header(self, fileids, tag) source code
 
_list_morph_files_by(self, tag, values, map=None) source code
 
_get_tag(self, f, tag) source code
 
_map_category(self, cat) source code
 
_view(self, filename, **kwargs) source code

Inherited from api.CorpusReader: __repr__, abspath, abspaths, encoding, open, readme

Inherited from api.CorpusReader (private): _get_root

    Deprecated since 0.9.7

Inherited from api.CorpusReader: files

    Deprecated since 0.9.1

Inherited from api.CorpusReader: items

Inherited from api.CorpusReader (private): _get_items

Instance Variables [hide private]

Inherited from api.CorpusReader (private): _encoding, _fileids, _root

Properties [hide private]

Inherited from api.CorpusReader: root

Method Details [hide private]

__init__(self, root, fileids)
(Constructor)

source code 
Parameters:
  • root - A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.
  • fileids - A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader's root to each file name.
  • encoding - The default unicode encoding for the files that make up the corpus. encoding's value can be any of the following:
    • A string: encoding is the encoding name for all files.
    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.
    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple's regexp matches the file_id, the file contents will be processed using non-unicode byte strings.
    • None: the file contents of all files will be processed using non-unicode byte strings.
  • tag_mapping_function - A function for normalizing or simplifying the POS tags returned by the tagged_words() or tagged_sents() methods.
Overrides: api.CorpusReader.__init__
(inherited documentation)

fileids(self, channels=None, domains=None, categories=None)

source code 

Return a list of file identifiers for the fileids that make up this corpus.

Overrides: api.CorpusReader.fileids
(inherited documentation)

sents(self, fileids=None, **kwargs)

source code 
Decorators:
  • @_parse_args

paras(self, fileids=None, **kwargs)

source code 
Decorators:
  • @_parse_args

words(self, fileids=None, **kwargs)

source code 
Decorators:
  • @_parse_args

tagged_sents(self, fileids=None, **kwargs)

source code 
Decorators:
  • @_parse_args

tagged_paras(self, fileids=None, **kwargs)

source code 
Decorators:
  • @_parse_args

tagged_words(self, fileids=None, **kwargs)

source code 
Decorators:
  • @_parse_args