Package nltk :: Package corpus :: Package reader :: Module xmldocs :: Class XMLCorpusView
[hide private]
[frames] | no frames]

type XMLCorpusView

source code

               object --+        
                        |        
util.AbstractLazySequence --+    
                            |    
  util.StreamBackedCorpusView --+
                                |
                               XMLCorpusView
Known Subclasses:

A corpus view that selects out specified elements from an XML file, and provides a flat list-like interface for accessing them. (Note: XMLCorpusView is not used by XMLCorpusReader itself, but may be used by subclasses of XMLCorpusReader.)

Every XML corpus view has a tag specification, indicating what XML elements should be included in the view; and each (non-nested) element that matches this specification corresponds to one item in the view. Tag specifications are regular expressions over tag paths, where a tag path is a list of element tag names, separated by '/', indicating the ancestry of the element. Some examples:

The view items are generated from the selected XML elements via the method handle_elt(). By default, this method returns the element as-is (i.e., as an ElementTree object); but it can be overridden, either via subclassing or via the elt_handler constructor parameter.

Instance Methods [hide private]
 
__init__(self, fileid, tagspec, elt_handler=None)
Create a new corpus view based on a specified XML file.
source code
 
_detect_encoding(self, fileid) source code
 
handle_elt(self, elt, context)
Convert an element into an appropriate value for inclusion in the view.
source code
 
_read_xml_fragment(self, stream)
Read a string from the given stream that does not contain any un-closed tags.
source code
list of any
read_block(self, stream, tagspec=None, elt_handler=None)
Read from stream until we find at least one element that matches tagspec, and return the result of applying elt_handler to each element found.
source code

Inherited from util.StreamBackedCorpusView: __add__, __getitem__, __len__, __mul__, __radd__, __rmul__, close, iterate_from

Inherited from util.StreamBackedCorpusView (private): _open

Inherited from util.AbstractLazySequence: __cmp__, __contains__, __hash__, __iter__, __repr__, count, index

Class Variables [hide private]
  _DEBUG = False
If true, then display debugging output to stdout when reading blocks.
  _BLOCK_SIZE = 1024
The number of characters read at a time by this corpus reader.
  _VALID_XML_RE = re.compile(r'(?sx)[^<]*(((<!--.*?-->)|(<![CDAT...
A regular expression that matches XML fragments that do not contain any un-closed tags.
  _XML_TAG_NAME = re.compile(r'<\s*/?\s*([^\s>]+)')
A regular expression used to extract the tag name from a start tag, end tag, or empty-elt tag string.
  _XML_PIECE = re.compile(r'(?sx)(?P<COMMENT><!--.*?-->)|(?P<CDA...
A regular expression used to find all start-tags, end-tags, and emtpy-elt tags in an XML file.

Inherited from util.AbstractLazySequence (private): _MAX_REPR_SIZE

Instance Variables [hide private]
  _tagspec
The tag specification for this corpus view.
  _tag_context
A dictionary mapping from file positions (as returned by stream.seek() to XML contexts.
Properties [hide private]

Inherited from util.StreamBackedCorpusView: fileid

Method Details [hide private]

__init__(self, fileid, tagspec, elt_handler=None)
(Constructor)

source code 

Create a new corpus view based on a specified XML file.

Note that the XMLCorpusView constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves.

Parameters:
  • tagspec (str) - A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.
  • elt_handler - A function used to transform each element to a value for the view. If no handler is specified, then self.handle_elt() is called, which returns the element as an ElementTree object. The signature of elt_handler is:
       elt_handler(elt, tagspec) -> value
    
Overrides: util.StreamBackedCorpusView.__init__

handle_elt(self, elt, context)

source code 

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Parameters:
  • elt (ElementTree) - The element that should be converted.
  • context (str) - A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.
Returns:
The view value corresponding to elt.

_read_xml_fragment(self, stream)

source code 

Read a string from the given stream that does not contain any un-closed tags. In particular, this function first reads a block from the stream of size self._BLOCK_SIZE. It then checks if that block contains an un-closed tag. If it does, then this function either backtracks to the last '<', or reads another block.

read_block(self, stream, tagspec=None, elt_handler=None)

source code 

Read from stream until we find at least one element that matches tagspec, and return the result of applying elt_handler to each element found.

Parameters:
  • stream - an input stream
Returns: list of any
a block of tokens from the input stream
Overrides: util.StreamBackedCorpusView.read_block

Class Variable Details [hide private]

_VALID_XML_RE

A regular expression that matches XML fragments that do not contain any un-closed tags.

Value:
re.compile(r'(?sx)[^<]*(((<!--.*?-->)|(<![CDATA\[\.\*\?]\])|(<!DOCTYPE\
\s+[^\[]*(\[[^\]]*\])?\s*>)|(<[^>]*>))[^<]*)*\Z')

_XML_PIECE

A regular expression used to find all start-tags, end-tags, and emtpy-elt tags in an XML file. This regexp is more lenient than the XML spec -- e.g., it allows spaces in some places where the spec does not.

Value:
re.compile(r'(?sx)(?P<COMMENT><!--.*?-->)|(?P<CDATA><![CDATA\[\.\*\?]\\
]>)|(?P<PI><\?.*?\?>)|(?P<DOCTYPE><!DOCTYPE\s+[^\[]*(\[[^\]]*\])?\s*>)\
|(?P<EMPTY_ELT_TAG><\s*[^>/\?!\s][^>]*/\s*>)|(?P<START_TAG><\s*[^>/\?!\
\s][^>]*>)|(?P<END_TAG><\s*/[^>/\?!\s][^>]*>)')

Instance Variable Details [hide private]

_tag_context

A dictionary mapping from file positions (as returned by stream.seek() to XML contexts. An XML context is a tuple of XML tag names, indicating which tags have not yet been closed.