Package nltk :: Package corpus :: Package reader :: Module api :: Class CategorizedCorpusReader
[hide private]
[frames] | no frames]

type CategorizedCorpusReader

source code

object --+
Known Subclasses:

A mixin class used to aid in the implementation of corpus readers for categorized corpora. This class defines the method categories(), which returns a list of the categories for the corpus or for a specified set of fileids; and overrides fileids() to take a categories argument, restricting the set of fileids to be returned.

Subclasses are expected to:

Instance Methods [hide private]
__init__(self, kwargs)
Initialize this mapping based on keyword arguments, as follows:
source code
_init(self) source code
_add(self, file_id, category) source code
categories(self, fileids=None)
Return a list of the categories that are defined for this corpus, or for the file(s) if it is given.
source code
fileids(self, categories=None)
Return a list of file identifiers for the files that make up this corpus, or that make up the given category(s) if specified.
source code
Instance Variables [hide private]
file-to-category mapping
category-to-file mapping
regexp specifying the mapping
dict specifying the mapping
fileid of file containing the mapping
delimiter for self._file
Method Details [hide private]

__init__(self, kwargs)

source code 

Initialize this mapping based on keyword arguments, as follows:

  • cat_pattern: A regular expression pattern used to find the category for each file identifier. The pattern will be applied to each file identifier, and the first matching group will be used as the category label for that file.
  • cat_map: A dictionary, mapping from file identifiers to category labels.
  • cat_file: The name of a file that contains the mapping from file identifiers to categories. The argument cat_delimiter can be used to specify a delimiter.

The corresponding argument will be deleted from kwargs. If more than one argument is specified, an exception will be raised.

Overrides: object.__init__