| Home | Trees | Indices | Help |
|
|---|
|
|
object --+
|
api.CorpusReader --+
|
IPIPANCorpusReader
Corpus reader designed to work with corpus created by IPI PAN.
See http://korpus.pl/en/ for more details about IPI PAN corpus.
The corpus includes information about text domain, channel and categories.
You can access possible values using ipipan.domains(), ipipan.channels() and
ipipan.categories(). You can use also this metadata to filter files, e.g.:
ipipan.fileids(channel='prasa')
ipipan.fileids(categories='publicystyczny')
The reader supports methods: words, sents, paras and their tagged versions.
You can get part of speech instead of full tag by giving "simplify_tags=True"
parameter, e.g.:
ipipan.tagged_sents(simplify_tags=True)
Also you can get all tags disambiguated tags specifying parameter
"one_tag=False", e.g.:
ipipan.tagged_paras(one_tag=False)
You can get all tags that were assigned by a morphological analyzer specifying
parameter "disamb_only=False", e.g.
ipipan.tagged_words(disamb_only=False)
The IPIPAN Corpus contains tags indicating if there is a space between two
tokens. To add special "no space" markers, you should specify parameter
"append_no_space=True", e.g.
ipipan.tagged_words(append_no_space=True)
As a result in place where there should be no space between two tokens new
pair ('', 'no-space') will be inserted (for tagged data) and just '' for
methods without tags.
The corpus reader can also try to append spaces between words. To enable this
option, specify parameter "append_space=True", e.g.
ipipan.words(append_space=True)
As a result either ' ' or (' ', 'space') will be inserted between tokens.
By default, xml entities like " and & are replaced by corresponding
characters. You can turn off this feature, specifying parameter
"replace_xmlentities=False", e.g.
ipipan.words(replace_xmlentities=False)
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
Inherited from Inherited from |
|||
| Deprecated since 0.9.7 | |||
|---|---|---|---|
|
Inherited from |
|||
| Deprecated since 0.9.1 | |||
|
Inherited from Inherited from |
|||
|
|||
|
Inherited from |
|||
|
|||
|
Inherited from |
|||
|
|||
|
Return a list of file identifiers for the fileids that make up this corpus.
|
|
|
|
|
|
|
| Home | Trees | Indices | Help |
|
|---|
| Generated by Epydoc 3.0.1 on Mon Apr 11 14:39:44 2011 | http://epydoc.sourceforge.net |