Package nltk :: Package corpus :: Package reader :: Module util
[hide private]
[frames] | no frames]

Module util

source code

Classes [hide private]
    Corpus View
StreamBackedCorpusView
A 'view' of a corpus file, which acts like a sequence of tokens: it can be accessed by index, iterated over, etc.
ConcatenatedCorpusView
A 'view' of a corpus file that joins together one or more StreamBackedCorpusViews.
    Corpus View for Pickled Sequences
PickleCorpusView
A stream backed corpus view for corpus files that consist of sequences of serialized Python objects (serialized using pickle.dump).
Functions [hide private]
    Corpus View
 
concat(docs)
Concatenate together the contents of multiple documents from a single corpus, using an appropriate concatenation function.
source code
    Block Readers
 
read_whitespace_block(stream) source code
 
read_wordpunct_block(stream) source code
 
read_line_block(stream) source code
 
read_blankline_block(stream) source code
 
read_alignedsent_block(stream) source code
 
read_regexp_block(stream, start_re, end_re=None)
Read a sequence of tokens from a stream, where tokens begin with lines that match start_re.
source code
 
read_sexpr_block(stream, block_size=16384, comment_char=None)
Read a sequence of s-expressions from the stream, and leave the stream's file position at the end the last complete s-expression read.
source code
 
_sub_space(m)
Helper function: given a regexp match, return a string of spaces that's the same length as the matched string.
source code
 
_parse_sexpr_block(block) source code
    Finding Corpus Items
 
find_corpus_fileids(root, regexp) source code
 
_path_from(parent, child) source code
    Paragraph structure in Treebank files
 
tagged_treebank_para_block_reader(stream) source code
Function Details [hide private]

concat(docs)

source code 

Concatenate together the contents of multiple documents from a single corpus, using an appropriate concatenation function. This utility function is used by corpus readers when the user requests more than one document at a time.

read_regexp_block(stream, start_re, end_re=None)

source code 

Read a sequence of tokens from a stream, where tokens begin with lines that match start_re. If end_re is specified, then tokens end with lines that match end_re; otherwise, tokens end whenever the next line matching start_re or EOF is found.

read_sexpr_block(stream, block_size=16384, comment_char=None)

source code 

Read a sequence of s-expressions from the stream, and leave the stream's file position at the end the last complete s-expression read. This function will always return at least one s-expression, unless there are no more s-expressions in the file.

If the file ends in in the middle of an s-expression, then that incomplete s-expression is returned when the end of the file is reached.

Parameters:
  • block_size - The default block size for reading. If an s-expression is longer than one block, then more than one block will be read.
  • comment_char - A character that marks comments. Any lines that begin with this character will be stripped out. (If spaces or tabs preceed the comment character, then the line will not be stripped.)