type CorpusReader
source code
object --+
|
CorpusReader
- Known Subclasses:
-
- aligned.AlignedCorpusReader
- , SyntaxCorpusReader
- , xmldocs.XMLCorpusReader
- , cmudict.CMUDictCorpusReader
- , plaintext.PlaintextCorpusReader
- , tagged.TaggedCorpusReader
- , chasen.ChasenCorpusReader
- , chunked.ChunkedCorpusReader
- , conll.ConllCorpusReader
- , ieer.IEERCorpusReader
- , ipipan.IPIPANCorpusReader
- , indian.IndianCorpusReader
- , nombank.NombankCorpusReader
- , ppattach.PPAttachmentCorpusReader
- , propbank.PropbankCorpusReader
- , senseval.SensevalCorpusReader
- , string_category.StringCategoryCorpusReader
- , wordlist.WordListCorpusReader
- , switchboard.SwitchboardCorpusReader
- , timit.TimitCorpusReader
- , toolbox.ToolboxCorpusReader
- , wordnet.WordNetCorpusReader
- , wordnet.WordNetICCorpusReader
- , ycoe.YCOECorpusReader
A base class for corpus reader classes, each of which can be used to
read a specific corpus format. Each individual corpus reader instance is
used to read a specific corpus, consisting of one or more files under a
common root directory. Each file is identified by its file
identifier, which is the relative path to the file from the root
directory.
A separate subclass is be defined for each corpus format. These
subclasses define one or more methods that provide 'views' on the corpus
contents, such as words() (for a list of words) and
parsed_sents() (for a list of parsed sentences). Called
with no arguments, these methods will return the contents of the entire
corpus. For most corpora, these methods define one or more selection
arguments, such as fileids or categories, which
can be used to select which portion of the corpus should be returned.
|
|
|
|
|
|
|
|
readme(self)
Return the contents of the corpus README file, if it exists. |
source code
|
|
|
|
fileids(self)
Return a list of file identifiers for the fileids that make up this
corpus. |
source code
|
|
|
PathPointer
|
|
list of PathPointer
|
abspaths(self,
fileids=None,
include_encoding=False,
include_fileid=False)
Return a list of the absolute paths for all fileids in this corpus;
or for the given list of fileids, if specified. |
source code
|
|
|
|
open(self,
file,
sourced=False)
Return an open stream that can be used to read the given file. |
source code
|
|
|
|
encoding(self,
file)
Return the unicode encoding for the given corpus file, if known. |
source code
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
_fileids
A list of the relative paths for the fileids that make up this
corpus.
|
|
|
_root
The root directory for this corpus.
|
|
|
_encoding
The default unicode encoding for the fileids that make up this
corpus.
|
__init__(self,
root,
fileids,
encoding=None,
tag_mapping_function=None)
(Constructor)
| source code
|
- Parameters:
root (PathPointer or str) - A path pointer identifying the root directory for this corpus.
If a string is specified, then it will be converted to a PathPointer automatically.
fileids - A list of the files that make up this corpus. This list can
either be specified explicitly, as a list of strings; or
implicitly, as a regular expression over file paths. The
absolute path for each file will be constructed by joining the
reader's root to each file name.
encoding - The default unicode encoding for the files that make up the
corpus. encoding's value can be any of the
following:
-
A string:
encoding is the encoding name
for all files.
-
A dictionary:
encoding[file_id] is the
encoding name for the file whose identifier is
file_id. If file_id is not in
encoding, then the file contents will be
processed using non-unicode byte strings.
-
A list:
encoding should be a list of
(regexp, encoding) tuples. The encoding for a
file whose identifier is file_id will be the
encoding value for the first tuple whose
regexp matches the file_id. If no
tuple's regexp matches the file_id,
the file contents will be processed using non-unicode byte
strings.
-
None: the file contents of all files will be
processed using non-unicode byte strings.
tag_mapping_function - A function for normalizing or simplifying the POS tags returned
by the tagged_words() or tagged_sents() methods.
- Overrides:
object.__init__
|
- Overrides:
object.__repr__
- (inherited documentation)
|
|
Return the absolute path for the given file.
- Parameters:
file (str) - The file identifier for the file whose path should be returned.
- Returns: PathPointer
|
abspaths(self,
fileids=None,
include_encoding=False,
include_fileid=False)
| source code
|
Return a list of the absolute paths for all fileids in this corpus; or
for the given list of fileids, if specified.
- Parameters:
fileids (None or str or list) - Specifies the set of fileids for which paths should be returned.
Can be None, for all fileids; a list of file
identifiers, for a specified set of fileids; or a single file
identifier, for a single file. Note that the return value is
always a list of paths, even if fileids is a single
file identifier.
include_encoding - If true, then return a list of (path_pointer,
encoding) tuples.
- Returns:
list of PathPointer
|
|
Return an open stream that can be used to read the given file. If the
file's encoding is not None, then the stream will
automatically decode the file's contents into unicode.
- Parameters:
file - The file identifier of the file to read.
|
|
Return the unicode encoding for the given corpus file, if known. If
the encoding is unknown, or if the given file should be processed using
byte strings (str), then return None.
|
- Decorators:
@deprecated("Use corpus.fileids() instead")
|
- Decorators:
@deprecated("Use corpus.fileids() instead")
|
- Decorators:
@deprecated("Use corpus.fileids() instead")
|
_encoding
The default unicode encoding for the fileids that make up this corpus.
If encoding is None, then the file contents are
processed using byte strings (str).
|