Package nltk :: Package corpus :: Package reader :: Module childes :: Class CHILDESCorpusReader
[hide private]
[frames] | no frames]

type CHILDESCorpusReader

source code

         object --+        
                  |        
   api.CorpusReader --+    
                      |    
xmldocs.XMLCorpusReader --+
                          |
                         CHILDESCorpusReader

Corpus reader for the XML version of the CHILDES corpus. The CHILDES corpus is available at http://childes.psy.cmu.edu/. The XML version of CHILDES is located at http://childes.psy.cmu.edu/data-xml/. Copy the CHILDES XML corpus (at the moment, this CorpusReader supports only English corpora at http://childes.psy.cmu.edu/data-xml/Eng-USA/) into the NLTK data directory (nltk_data/corpora/CHILDES/). For access to simple word lists and tagged word lists, use words() and sents().

Instance Methods [hide private]
 
__init__(self, root, fileids, lazy=True) source code
list of str
words(self, fileids=None, speaker='ALL', sent=None, stem=False, relation=False, pos=False, strip_space=True, replace=False)
Returns: the given file(s) as a list of words
source code
list of (list of str)
sents(self, fileids=None, speaker='ALL', sent=True, stem=False, relation=None, pos=False, strip_space=True, replace=False)
Returns: the given file(s) as a list of sentences
source code
list of dict
corpus(self, fileids=None)
Returns: the given file(s) as a dict of (corpus_property_key, value)
source code
 
_get_corpus(self, fileid) source code
list of dict
participants(self, fileids=None)
Returns: the given file(s) as a dict of (participant_propperty_key, value)
source code
 
_get_participants(self, fileid) source code
list or int
age(self, fileids=None, month=False)
Returns: the given file(s) as string or int
source code
 
_get_age(self, fileid, month) source code
list of float
MLU(self, fileids=None)
Returns: the given file(s) as a floating number
source code
 
_getMLU(self, fileid) source code
 
_get_words(self, fileid, speaker, sent, stem, relation, pos, strip_space, replace) source code

Inherited from xmldocs.XMLCorpusReader: raw, xml

Inherited from api.CorpusReader: __repr__, abspath, abspaths, encoding, fileids, open, readme

Inherited from api.CorpusReader (private): _get_root

    Deprecated since 0.8

Inherited from xmldocs.XMLCorpusReader: read

    Deprecated since 0.9.7

Inherited from api.CorpusReader: files

    Deprecated since 0.9.1

Inherited from api.CorpusReader: items

Inherited from api.CorpusReader (private): _get_items

Instance Variables [hide private]

Inherited from api.CorpusReader (private): _encoding, _fileids, _root

Properties [hide private]

Inherited from api.CorpusReader: root

Method Details [hide private]

__init__(self, root, fileids, lazy=True)
(Constructor)

source code 
Parameters:
  • root - A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.
  • fileids - A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader's root to each file name.
  • encoding - The default unicode encoding for the files that make up the corpus. encoding's value can be any of the following:
    • A string: encoding is the encoding name for all files.
    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.
    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple's regexp matches the file_id, the file contents will be processed using non-unicode byte strings.
    • None: the file contents of all files will be processed using non-unicode byte strings.
  • tag_mapping_function - A function for normalizing or simplifying the POS tags returned by the tagged_words() or tagged_sents() methods.
Overrides: api.CorpusReader.__init__
(inherited documentation)

words(self, fileids=None, speaker='ALL', sent=None, stem=False, relation=False, pos=False, strip_space=True, replace=False)

source code 

Returns all of the words and punctuation symbols in the specified file that were in text nodes -- ie, tags are ignored. Like the xml() method, fileid can only specify one file.

Parameters:
  • speaker - If specified, select specitic speakers defined in the corpus. Default is 'ALL'. Common choices are 'CHI' (all children) and 'MOT' (mothers)
  • stem - If true, then use word stems instead of word strings.
  • relation - If true, then return tuples of (stem, index, dependent_index)
  • pos - If true, then return tuples of (stem, part_of_speech)
  • strip_space - If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace - If true, then use the replaced word instead of the original word (e.g., 'wat' will be replaced with 'watch')
Returns: list of str
the given file(s) as a list of words
Overrides: xmldocs.XMLCorpusReader.words

sents(self, fileids=None, speaker='ALL', sent=True, stem=False, relation=None, pos=False, strip_space=True, replace=False)

source code 
Parameters:
  • speaker - If specified, select specitic speakers defined in the corpus. Default is 'ALL'. Common choices are 'CHI' (all children) and 'MOT' (mothers)
  • stem - If true, then use word stems instead of word strings.
  • relation - If true, then return tuples of (str,pos,relation_list). If there is manually-annotated relation info, it will return tuples of tuples of (str,pos,test_relation_list,str,pos,gold_relation_list)
  • pos - If true, then return tuples of (stem, part_of_speech)
  • strip_space - If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace - If true, then use the replaced word instead of the original word (e.g., 'wat' will be replaced with 'watch')
Returns: list of (list of str)
the given file(s) as a list of sentences

corpus(self, fileids=None)

source code 
Returns: list of dict
the given file(s) as a dict of (corpus_property_key, value)

participants(self, fileids=None)

source code 
Returns: list of dict
the given file(s) as a dict of (participant_propperty_key, value)

age(self, fileids=None, month=False)

source code 
Parameters:
  • month - If true, return months instead of year-month-date
Returns: list or int
the given file(s) as string or int

MLU(self, fileids=None)

source code 
Returns: list of float
the given file(s) as a floating number