Package nltk :: Package corpus :: Package reader
[hide private]
[frames] | no frames]

Package reader

source code

NLTK corpus readers. The modules in this package provide functions that can be used to read corpus fileids in a variety of formats. These functions can be used to read both the corpus fileids that are distributed in the NLTK corpus package, and corpus fileids that are part of external corpora.

Corpus Reader Functions

Each corpus module defines one or more corpus reader functions, which can be used to read documents from that corpus. These functions take an argument, item, which is used to indicate which document should be read from the corpus:

Additionally, corpus reader functions can be given lists of item names; in which case, they will return a concatenation of the corresponding documents.

Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are:

For example, to read a list of the words in the Brown Corpus, use nltk.corpus.brown.words():

>>> from nltk.corpus import brown
>>> print brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

[Work in Progress: Corpus Metadata =============== Metadata about the NLTK corpora, and their individual documents, is stored using Open Language Archives Community (OLAC) metadata records. These records can be accessed using nltk.corpus.corpus.olac().]

Submodules [hide private]

Classes [hide private]
CategorizedPlaintextCorpusReader
A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.
PlaintextCorpusReader
Reader for corpora that consist of plaintext documents.
CategorizedTaggedCorpusReader
A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.
MacMorphoCorpusReader
A corpus reader for the MAC_MORPHO corpus.
CMUDictCorpusReader
ConllChunkCorpusReader
A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.
ConllCorpusReader
A corpus reader for CoNLL-style files.
ChunkedCorpusReader
Reader for chunked (and optionally tagged) corpora.
SwadeshCorpusReader
WordListCorpusReader
List of words, one per line.
PPAttachmentCorpusReader
sentence_id verb noun1 preposition noun2 attachment
SensevalCorpusReader
IEERCorpusReader
SinicaTreebankCorpusReader
Reader for the sinica treebank.
AlpinoCorpusReader
Reader for the Alpino Dutch Treebank.
IndianCorpusReader
List of words, one per line.
ToolboxCorpusReader
TimitCorpusReader
Reader for the TIMIT corpus (or any other corpus with the same file layout and use of file formats).
TaggedCorpusReader
Reader for simple part-of-speech tagged corpora.
BracketParseCorpusReader
Reader for corpora that consist of parenthesis-delineated parse trees.
YCOECorpusReader
Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.
RTECorpusReader
Corpus reader for corpora in RTE challenges.
StringCategoryCorpusReader
PropbankCorpusReader
Corpus reader for the propbank corpus, which augments the Penn Treebank with information about the predicate argument structure of every verb instance.
VerbnetCorpusReader
BNCCorpusReader
Corpus reader for the XML version of the British National Corpus.
NPSChatCorpusReader
XMLCorpusReader
Corpus reader for corpora whose documents are xml files.
WordNetICCorpusReader
A corpus reader for the WordNet information content corpus.
WordNetCorpusReader
A corpus reader used to access wordnet or its variants.
SwitchboardCorpusReader
DependencyCorpusReader
SyntaxCorpusReader
An abstract base class for reading corpora consisting of syntactically parsed text.
CategorizedCorpusReader
A mixin class used to aid in the implementation of corpus readers for categorized corpora.
NombankCorpusReader
Corpus reader for the nombank corpus, which augments the Penn Treebank with information about the predicate argument structure of every noun instance.
CorpusReader
A base class for corpus reader classes, each of which can be used to read a specific corpus format.
AlignedCorpusReader
Reader for corpora of word-aligned sentences.
CHILDESCorpusReader
Corpus reader for the XML version of the CHILDES corpus.
ChasenCorpusReader
EuroparlCorpusReader
Reader for Europarl corpora that consist of plaintext documents.
IPIPANCorpusReader
Corpus reader designed to work with corpus created by IPI PAN.
KNBCorpusReader
This class implements: - L{__init__}, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.
Pl196xCorpusReader
PortugueseCategorizedPlaintextCorpusReader
TEICorpusView
TimitTaggedCorpusReader
A corpus reader for tagged sentences that are included in the TIMIT corpus.
Functions [hide private]
 
tagged_treebank_para_block_reader(stream) source code
 
find_corpus_fileids(root, regexp) source code
Variables [hide private]
  ADJ = 'a'
  ADJ_SAT = 's'
  ADV = 'r'
  ANA = re.compile(r'ana="(.*?)"')
  NOUN = 'n'
  NS = 'http://www.talkbank.org/ns/talkbank'
  PARA = re.compile(r'<p(?: [^>]*)?>(.*?)</p>')
  SENT = re.compile(r'<s(?: [^>]*)?>(.*?)</s>')
  TAGGEDWORD = re.compile(r'<([wc](?: [^>]*)?>)(.*?)</[wc]>')
  TEXTID = re.compile(r'text id="(.*?)"')
  TYPE = re.compile(r'type="(.*?)"')
  VERB = 'v'