Package nltk :: Package corpus
[hide private]
[frames] | no frames]

Package corpus

source code

NLTK corpus readers. The modules in this package provide functions that can be used to read corpus files in a variety of formats. These functions can be used to read both the corpus files that are distributed in the NLTK corpus package, and corpus files that are part of external corpora.

Available Corpora

Please see http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml for a complete list. Install corpora using nltk.download().

Corpus Reader Functions

Each corpus module defines one or more corpus reader functions, which can be used to read documents from that corpus. These functions take an argument, item, which is used to indicate which document should be read from the corpus:

Additionally, corpus reader functions can be given lists of item names; in which case, they will return a concatenation of the corresponding documents.

Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are:

For example, to read a list of the words in the Brown Corpus, use nltk.corpus.brown.words():

>>> from nltk.corpus import brown
>>> print brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

Corpus Metadata

Metadata about the NLTK corpora, and their individual documents, is stored using Open Language Archives Community (OLAC) metadata records. These records can be accessed using nltk.corpus.corpus.olac().

Submodules [hide private]

Functions [hide private]
 
demo() source code
Variables [hide private]
  abc = <PlaintextCorpusReader in '/usr/share/nltk_data/corpora/...
  alpino = <AlpinoCorpusReader in '/usr/share/nltk_data/corpora/...
  brown = <CategorizedTaggedCorpusReader in '.../corpora/brown' ...
  cess_cat = <BracketParseCorpusReader in '/usr/share/nltk_data/...
  cess_esp = <BracketParseCorpusReader in '/usr/share/nltk_data/...
  cmudict = <CMUDictCorpusReader in '/usr/share/nltk_data/corpor...
  comtrans = <AlignedCorpusReader in '/usr/share/nltk_data/corpo...
  conll2000 = <ConllChunkCorpusReader in '/usr/share/nltk_data/c...
  conll2002 = <ConllChunkCorpusReader in '/usr/share/nltk_data/c...
  conll2007 = <DependencyCorpusReader in '.../corpora/conll2007'...
  dependency_treebank = <DependencyCorpusReader in '.../corpora/...
  floresta = <BracketParseCorpusReader in '/usr/share/nltk_data/...
  gazetteers = <WordListCorpusReader in '/usr/share/nltk_data/co...
  genesis = <PlaintextCorpusReader in '.../corpora/genesis' (not...
  gutenberg = <PlaintextCorpusReader in '.../corpora/gutenberg' ...
  ieer = <IEERCorpusReader in '/usr/share/nltk_data/corpora/ieer...
  inaugural = <PlaintextCorpusReader in '.../corpora/inaugural' ...
  indian = <IndianCorpusReader in '/usr/share/nltk_data/corpora/...
  ipipan = <IPIPANCorpusReader in '.../corpora/ipipan' (not load...
  jeita = <ChasenCorpusReader in '/usr/share/nltk_data/corpora/j...
  knbc = <KNBCorpusReader in '/usr/share/nltk_data/corpora/knbc....
  mac_morpho = <MacMorphoCorpusReader in '/usr/share/nltk_data/c...
  machado = <PortugueseCategorizedPlaintextCorpusReader in '/usr...
  movie_reviews = <CategorizedPlaintextCorpusReader in '/usr/sha...
  names = <WordListCorpusReader in '/usr/share/nltk_data/corpora...
  nps_chat = <NPSChatCorpusReader in '/usr/share/nltk_data/corpo...
  pl196x = <Pl196xCorpusReader in '.../corpora/pl196x' (not load...
  ppattach = <PPAttachmentCorpusReader in '/usr/share/nltk_data/...
  qc = <StringCategoryCorpusReader in '/usr/share/nltk_data/corp...
  reuters = <CategorizedPlaintextCorpusReader in '.../corpora/re...
  rte = <RTECorpusReader in '/usr/share/nltk_data/corpora/rte.zi...
  semcor = <XMLCorpusReader in '.../corpora/semcor' (not loaded ...
  senseval = <SensevalCorpusReader in '/usr/share/nltk_data/corp...
  shakespeare = <XMLCorpusReader in '.../corpora/shakespeare' (n...
  sinica_treebank = <SinicaTreebankCorpusReader in '/usr/share/n...
  state_union = <PlaintextCorpusReader in '/usr/share/nltk_data/...
  stopwords = <WordListCorpusReader in '.../corpora/stopwords' (...
  swadesh = <SwadeshCorpusReader in '/usr/share/nltk_data/corpor...
  switchboard = <SwitchboardCorpusReader in '/usr/share/nltk_dat...
  timit = <TimitCorpusReader in '.../corpora/timit' (not loaded ...
  timit_tagged = <TimitTaggedCorpusReader in '.../corpora/timit'...
  toolbox = <ToolboxCorpusReader in '.../corpora/toolbox' (not l...
  treebank = <BracketParseCorpusReader in '/usr/share/nltk_data/...
  treebank_chunk = <ChunkedCorpusReader in '/usr/share/nltk_data...
  treebank_raw = <PlaintextCorpusReader in '/usr/share/nltk_data...
  udhr = <PlaintextCorpusReader in '.../corpora/udhr' (not loade...
  verbnet = <VerbnetCorpusReader in '/usr/share/nltk_data/corpor...
  webtext = <PlaintextCorpusReader in '/usr/share/nltk_data/corp...
  wordnet = <WordNetCorpusReader in '/usr/share/nltk_data/corpor...
  wordnet_ic = <WordNetICCorpusReader in '/usr/share/nltk_data/c...
  words = <WordListCorpusReader in '/usr/share/nltk_data/corpora...
  ycoe = <YCOECorpusReader in '.../corpora/ycoe' (not loaded yet)>
  propbank = <PropbankCorpusReader in '.../corpora/propbank' (no...
  nombank = <NombankCorpusReader in '/usr/share/nltk_data/corpor...
Variables Details [hide private]

abc

Value:
<PlaintextCorpusReader in '/usr/share/nltk_data/corpora/abc.zip/abc/'>

alpino

Value:
<AlpinoCorpusReader in '/usr/share/nltk_data/corpora/alpino.zip/alpino\
/'>

brown

Value:
<CategorizedTaggedCorpusReader in '.../corpora/brown' (not loaded yet)\
>

cess_cat

Value:
<BracketParseCorpusReader in '/usr/share/nltk_data/corpora/cess_cat.zi\
p/cess_cat/'>

cess_esp

Value:
<BracketParseCorpusReader in '/usr/share/nltk_data/corpora/cess_esp.zi\
p/cess_esp/'>

cmudict

Value:
<CMUDictCorpusReader in '/usr/share/nltk_data/corpora/cmudict'>

comtrans

Value:
<AlignedCorpusReader in '/usr/share/nltk_data/corpora/comtrans.zip/com\
trans/'>

conll2000

Value:
<ConllChunkCorpusReader in '/usr/share/nltk_data/corpora/conll2000'>

conll2002

Value:
<ConllChunkCorpusReader in '/usr/share/nltk_data/corpora/conll2002.zip\
/conll2002/'>

conll2007

Value:
<DependencyCorpusReader in '.../corpora/conll2007' (not loaded yet)>

dependency_treebank

Value:
<DependencyCorpusReader in '.../corpora/dependency_treebank' (not load\
ed yet)>

floresta

Value:
<BracketParseCorpusReader in '/usr/share/nltk_data/corpora/floresta.zi\
p/floresta/'>

gazetteers

Value:
<WordListCorpusReader in '/usr/share/nltk_data/corpora/gazetteers.zip/\
gazetteers/'>

genesis

Value:
<PlaintextCorpusReader in '.../corpora/genesis' (not loaded yet)>

gutenberg

Value:
<PlaintextCorpusReader in '.../corpora/gutenberg' (not loaded yet)>

ieer

Value:
<IEERCorpusReader in '/usr/share/nltk_data/corpora/ieer.zip/ieer/'>

inaugural

Value:
<PlaintextCorpusReader in '.../corpora/inaugural' (not loaded yet)>

indian

Value:
<IndianCorpusReader in '/usr/share/nltk_data/corpora/indian.zip/indian\
/'>

ipipan

Value:
<IPIPANCorpusReader in '.../corpora/ipipan' (not loaded yet)>

jeita

Value:
<ChasenCorpusReader in '/usr/share/nltk_data/corpora/jeita.zip/jeita/'\
>

knbc

Value:
<KNBCorpusReader in '/usr/share/nltk_data/corpora/knbc.zip/knbc/corpus\
1/'>

mac_morpho

Value:
<MacMorphoCorpusReader in '/usr/share/nltk_data/corpora/mac_morpho.zip\
/mac_morpho/'>

machado

Value:
<PortugueseCategorizedPlaintextCorpusReader in '/usr/share/nltk_data/c\
orpora/machado.zip/machado/'>

movie_reviews

Value:
<CategorizedPlaintextCorpusReader in '/usr/share/nltk_data/corpora/mov\
ie_reviews.zip/movie_reviews/'>

names

Value:
<WordListCorpusReader in '/usr/share/nltk_data/corpora/names.zip/names\
/'>

nps_chat

Value:
<NPSChatCorpusReader in '/usr/share/nltk_data/corpora/nps_chat.zip/nps\
_chat/'>

pl196x

Value:
<Pl196xCorpusReader in '.../corpora/pl196x' (not loaded yet)>

ppattach

Value:
<PPAttachmentCorpusReader in '/usr/share/nltk_data/corpora/ppattach.zi\
p/ppattach/'>

qc

Value:
<StringCategoryCorpusReader in '/usr/share/nltk_data/corpora/qc.zip/qc\
/'>

reuters

Value:
<CategorizedPlaintextCorpusReader in '.../corpora/reuters' (not loaded\
 yet)>

rte

Value:
<RTECorpusReader in '/usr/share/nltk_data/corpora/rte.zip/rte/'>

semcor

Value:
<XMLCorpusReader in '.../corpora/semcor' (not loaded yet)>

senseval

Value:
<SensevalCorpusReader in '/usr/share/nltk_data/corpora/senseval.zip/se\
nseval/'>

shakespeare

Value:
<XMLCorpusReader in '.../corpora/shakespeare' (not loaded yet)>

sinica_treebank

Value:
<SinicaTreebankCorpusReader in '/usr/share/nltk_data/corpora/sinica_tr\
eebank.zip/sinica_treebank/'>

state_union

Value:
<PlaintextCorpusReader in '/usr/share/nltk_data/corpora/state_union.zi\
p/state_union/'>

stopwords

Value:
<WordListCorpusReader in '.../corpora/stopwords' (not loaded yet)>

swadesh

Value:
<SwadeshCorpusReader in '/usr/share/nltk_data/corpora/swadesh.zip/swad\
esh/'>

switchboard

Value:
<SwitchboardCorpusReader in '/usr/share/nltk_data/corpora/switchboard.\
zip/switchboard/'>

timit

Value:
<TimitCorpusReader in '.../corpora/timit' (not loaded yet)>

timit_tagged

Value:
<TimitTaggedCorpusReader in '.../corpora/timit' (not loaded yet)>

toolbox

Value:
<ToolboxCorpusReader in '.../corpora/toolbox' (not loaded yet)>

treebank

Value:
<BracketParseCorpusReader in '/usr/share/nltk_data/corpora/treebank.zi\
p/treebank/combined/'>

treebank_chunk

Value:
<ChunkedCorpusReader in '/usr/share/nltk_data/corpora/treebank.zip/tre\
ebank/tagged/'>

treebank_raw

Value:
<PlaintextCorpusReader in '/usr/share/nltk_data/corpora/treebank.zip/t\
reebank/raw/'>

udhr

Value:
<PlaintextCorpusReader in '.../corpora/udhr' (not loaded yet)>

verbnet

Value:
<VerbnetCorpusReader in '/usr/share/nltk_data/corpora/verbnet.zip/verb\
net/'>

webtext

Value:
<PlaintextCorpusReader in '/usr/share/nltk_data/corpora/webtext.zip/we\
btext/'>

wordnet

Value:
<WordNetCorpusReader in '/usr/share/nltk_data/corpora/wordnet'>

wordnet_ic

Value:
<WordNetICCorpusReader in '/usr/share/nltk_data/corpora/wordnet_ic.zip\
/wordnet_ic/'>

words

Value:
<WordListCorpusReader in '/usr/share/nltk_data/corpora/words.zip/words\
/'>

propbank

Value:
<PropbankCorpusReader in '.../corpora/propbank' (not loaded yet)>

nombank

Value:
<NombankCorpusReader in '/usr/share/nltk_data/corpora/nombank.1.0.zip/\
nombank.1.0/'>