Package nltk :: Package corpus :: Package reader
[hide private]
[frames] | no frames]

Source Code for Package nltk.corpus.reader

  1  # Natural Language Toolkit: Corpus Readers 
  2  # 
  3  # Copyright (C) 2001-2011 NLTK Project 
  4  # Author: Steven Bird <sb@ldc.upenn.edu> 
  5  #         Edward Loper <edloper@gradient.cis.upenn.edu> 
  6  # URL: <http://www.nltk.org/> 
  7  # For license information, see LICENSE.TXT 
  8   
  9  """ 
 10  NLTK corpus readers.  The modules in this package provide functions 
 11  that can be used to read corpus fileids in a variety of formats.  These 
 12  functions can be used to read both the corpus fileids that are 
 13  distributed in the NLTK corpus package, and corpus fileids that are part 
 14  of external corpora. 
 15   
 16  Corpus Reader Functions 
 17  ======================= 
 18  Each corpus module defines one or more X{corpus reader functions}, 
 19  which can be used to read documents from that corpus.  These functions 
 20  take an argument, C{item}, which is used to indicate which document 
 21  should be read from the corpus: 
 22   
 23    - If C{item} is one of the unique identifiers listed in the corpus 
 24      module's C{items} variable, then the corresponding document will 
 25      be loaded from the NLTK corpus package. 
 26   
 27    - If C{item} is a fileid, then that file will be read. 
 28   
 29  Additionally, corpus reader functions can be given lists of item 
 30  names; in which case, they will return a concatenation of the 
 31  corresponding documents. 
 32   
 33  Corpus reader functions are named based on the type of information 
 34  they return.  Some common examples, and their return types, are: 
 35   
 36    - I{corpus}.words(): list of str 
 37    - I{corpus}.sents(): list of (list of str) 
 38    - I{corpus}.paras(): list of (list of (list of str)) 
 39    - I{corpus}.tagged_words(): list of (str,str) tuple 
 40    - I{corpus}.tagged_sents(): list of (list of (str,str)) 
 41    - I{corpus}.tagged_paras(): list of (list of (list of (str,str))) 
 42    - I{corpus}.chunked_sents(): list of (Tree w/ (str,str) leaves) 
 43    - I{corpus}.parsed_sents(): list of (Tree with str leaves) 
 44    - I{corpus}.parsed_paras(): list of (list of (Tree with str leaves)) 
 45    - I{corpus}.xml(): A single xml ElementTree 
 46    - I{corpus}.raw(): unprocessed corpus contents 
 47   
 48  For example, to read a list of the words in the Brown Corpus, use 
 49  C{nltk.corpus.brown.words()}: 
 50   
 51      >>> from nltk.corpus import brown 
 52      >>> print brown.words() 
 53      ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] 
 54   
 55  [Work in Progress: 
 56  Corpus Metadata 
 57  =============== 
 58  Metadata about the NLTK corpora, and their individual documents, is 
 59  stored using U{Open Language Archives Community (OLAC) 
 60  <http://www.language-archives.org/>} metadata records.  These records 
 61  can be accessed using C{nltk.corpus.I{corpus}.olac()}.] 
 62  """ 
 63   
 64  from nltk.corpus.reader.plaintext import * 
 65  from nltk.corpus.reader.util import * 
 66  from nltk.corpus.reader.api import * 
 67  from nltk.corpus.reader.tagged import * 
 68  from nltk.corpus.reader.cmudict import * 
 69  from nltk.corpus.reader.conll import * 
 70  from nltk.corpus.reader.chunked import * 
 71  from nltk.corpus.reader.wordlist import * 
 72  from nltk.corpus.reader.xmldocs import * 
 73  from nltk.corpus.reader.ppattach import * 
 74  from nltk.corpus.reader.senseval import * 
 75  from nltk.corpus.reader.ieer import * 
 76  from nltk.corpus.reader.sinica_treebank import * 
 77  from nltk.corpus.reader.bracket_parse import * 
 78  from nltk.corpus.reader.indian import * 
 79  from nltk.corpus.reader.toolbox import * 
 80  from nltk.corpus.reader.timit import * 
 81  from nltk.corpus.reader.ycoe import * 
 82  from nltk.corpus.reader.rte import * 
 83  from nltk.corpus.reader.string_category import * 
 84  from nltk.corpus.reader.propbank import * 
 85  from nltk.corpus.reader.verbnet import * 
 86  from nltk.corpus.reader.bnc import * 
 87  from nltk.corpus.reader.nps_chat import * 
 88  from nltk.corpus.reader.wordnet import * 
 89  from nltk.corpus.reader.switchboard import * 
 90  from nltk.corpus.reader.dependency import * 
 91  from nltk.corpus.reader.nombank import * 
 92  from nltk.corpus.reader.ipipan import * 
 93  from nltk.corpus.reader.pl196x import * 
 94  from nltk.corpus.reader.knbc import * 
 95  from nltk.corpus.reader.chasen import * 
 96  from nltk.corpus.reader.childes import * 
 97  from nltk.corpus.reader.aligned import * 
 98   
 99  # Make sure that nltk.corpus.reader.bracket_parse gives the module, not 
100  # the function bracket_parse() defined in nltk.tree: 
101  import bracket_parse 
102   
103  __all__ = [ 
104      'CorpusReader', 'CategorizedCorpusReader', 
105      'PlaintextCorpusReader', 'find_corpus_fileids', 
106      'TaggedCorpusReader', 'CMUDictCorpusReader', 
107      'ConllChunkCorpusReader', 'WordListCorpusReader', 
108      'PPAttachmentCorpusReader', 'SensevalCorpusReader', 
109      'IEERCorpusReader', 'ChunkedCorpusReader', 
110      'SinicaTreebankCorpusReader', 'BracketParseCorpusReader', 
111      'IndianCorpusReader', 'ToolboxCorpusReader', 
112      'TimitCorpusReader', 'YCOECorpusReader', 
113      'MacMorphoCorpusReader', 'SyntaxCorpusReader', 
114      'AlpinoCorpusReader', 'RTECorpusReader', 
115      'StringCategoryCorpusReader','EuroparlCorpusReader', 
116      'CategorizedTaggedCorpusReader', 
117      'CategorizedPlaintextCorpusReader', 
118      'PortugueseCategorizedPlaintextCorpusReader', 
119      'tagged_treebank_para_block_reader', 
120      'PropbankCorpusReader', 'VerbnetCorpusReader', 
121      'BNCCorpusReader', 'ConllCorpusReader', 
122      'XMLCorpusReader', 'NPSChatCorpusReader', 
123      'SwadeshCorpusReader', 'WordNetCorpusReader', 
124      'WordNetICCorpusReader', 'SwitchboardCorpusReader', 
125      'DependencyCorpusReader', 'NombankCorpusReader', 
126      'IPIPANCorpusReader', 'Pl196xCorpusReader', 
127      'TEICorpusView', 'KNBCorpusReader', 'ChasenCorpusReader', 
128      'CHILDESCorpusReader', 'AlignedCorpusReader', 
129      'TimitTaggedCorpusReader' 
130  ] 
131