Package nltk :: Module sourcedstring :: Class SourcedString
[hide private]
[frames] | no frames]

type SourcedString

source code

object --+    
         |    
basestring --+
             |
            SourcedString
Known Subclasses:

A string that is annotated with information about the location in a document where it was originally found. Sourced strings are subclassed from Python strings. As a result, they can usually be used anywhere a normal Python string can be used.

There are two types of sourced strings: SimpleSourcedStrings, which correspond to a single substring of a document; and CompoundSourcedStrings, which are constructed by concatenating strings from multiple sources. Each of these types has two concrete subclasses: one for unicode strings (subclassed from ``unicode``), and one for byte strings (subclassed from ``str``).

Two sourced strings are considered equal if their contents are equal, even if their sources differ. This fact is important in ensuring that sourced strings act like normal strings. In particular, it allows sourced strings to be used with code that was originally intended to process plain Python strings.

If you wish to determine whether two sourced strings came from the same location in the same document, simply compare their sources attributes. If you know that both sourced strings are SimpleSourcedStrings, then you can compare their source attribute instead.

String operations that act on sourced strings will preserve location information whenever possible. However, there are a few types of string manipulation that can cause source information to be discarded. The most common examples of operations that will lose source information are:

Instance Methods [hide private]
    Splitting & Stripping Methods
 
lstrip(self, chars=None) source code
 
rstrip(self, chars=None) source code
 
strip(self, chars=None) source code
 
split(self, sep=None, maxsplit=None) source code
 
rsplit(self, sep=None, maxsplit=None) source code
 
partition(self, sep) source code
 
rpartition(self, sep) source code
 
splitlines(self, keepends=False) source code
    String Concatenation Methods
 
__add__(self, other) source code
 
__radd__(self, other) source code
 
__mul__(self, other) source code
 
__rmul__(self, other) source code
 
join(self, sequence) source code
    Justification Methods
 
center(self, width, fillchar=' ') source code
 
ljust(self, width, fillchar=' ') source code
 
rjust(self, width, fillchar=' ') source code
 
zfill(self, width) source code
    Replacement Methods
 
__mod__(self, other) source code
 
replace(self, old, new, count=0) source code
 
expandtabs(self, tabsize=8) source code
 
translate(self, table, deletechars='') source code
    Unicode
 
encode(self, encoding=None, errors='strict') source code
 
decode(self, encoding=None, errors='strict') source code
 
_decode_one_to_one(unicode_chars)
Helper for self.decode().
source code
 
_mixed_string_types(self, *args)
Return true if the list (self,)+args contains at least one unicode string and at least one byte string.
source code
 
_decode_and_call(self, op, *args)
If self or any of the values in args is a byte string, then convert it to unicode by calling its decode() method.
source code
    Display
 
pprint(self, vertical=False, wrap=70)
Return a string containing a pretty-printed display of this sourced string.
source code
 
_pprint_vertical(self) source code
 
_pprint_docid(self, width, docid) source code
 
_pprint_char_repr(self, char) source code
 
_pprint_char(self, char, output_lines)
Helper for pprint(): add a character to the pretty-printed output.
source code
 
_pprint_offset(self, offset, output_lines)
Helper for pprint(): add an offset marker to the pretty-printed output.
source code
Static Methods [hide private]
a new object with type S, a subtype of T
__new__(cls, contents, source) source code
    String Concatenation Methods
 
concat(substrings)
Return a sourced string formed by concatenating the given list of substrings.
source code
 
__add_substring_to_list(substring, result)
Helper for concat(): add substring to the end of the list of substrings in result.
source code
 
__merge_simple_substrings(lhs, rhs)
Helper for __add_substring_to_list(): Merge lhs and rhs into a single simple sourced string, and return it.
source code
Class Variables [hide private]
  _stringtype = None
A class variable, defined by subclasses of SourcedString, determining what type of string this class contains.
    Splitting & Stripping Methods
  _WHITESPACE_RE = re.compile(r'\s+')
  _NEWLINE_RE = re.compile(r'\n')
  _LINE_RE = re.compile(r'.*\n?')
    Display
  _PPRINT_CHAR_REPRS = {'\x07': '\\a', '\t': '\\t', '\n': '\\n',...
Instance Variables [hide private]
  sources
A sorted tuple of (index, source) pairs.
Method Details [hide private]

__new__(cls, contents, source)
Static Method

source code 
Returns: a new object with type S, a subtype of T
Overrides: basestring.__new__
(inherited documentation)

concat(substrings)
Static Method

source code 

Return a sourced string formed by concatenating the given list of substrings. Adjacent substrings will be merged when possible.

Depending on the types and values of the supplied substrings, the concatenated string's value may be a Python string (str or unicode), a SimpleSourcedString, or a CompoundSourcedString.

__add_substring_to_list(substring, result)
Static Method

source code 

Helper for concat(): add substring to the end of the list of substrings in result. If substring is compound, then add its own substrings instead. Merge adjacent substrings whenever possible. Discard empty un-sourced substrings.

_decode_one_to_one(unicode_chars)

source code 

Helper for self.decode(). Returns a unicode-decoded version of this SourcedString. unicode_chars is the unicode-decoded contents of this SourcedString.

This is used in the special case where the decoded string has the same length that the source string does. As a result, we can safely assume that each character is encoded with one byte; so we can just reuse our source. E.g., this will happen when decoding an ASCII string with utf-8.

Decorators:
  • @abstract

Note: This method is abstract.

_mixed_string_types(self, *args)

source code 

Return true if the list (self,)+args contains at least one unicode string and at least one byte string. (If this is the case, then all byte strings should be converted to unicode by calling decode() before the operation is performed. You can do this automatically using _decode_and_call().

_decode_and_call(self, op, *args)

source code 

If self or any of the values in args is a byte string, then convert it to unicode by calling its decode() method. Then return the result of calling self.op(*args). op is specified using a string, because if self is a byte string, then it will change type when it is decoded.

pprint(self, vertical=False, wrap=70)

source code 

Return a string containing a pretty-printed display of this sourced string.

Parameters:
  • vertical - If true, then the returned display string will have vertical orientation, rather than the default horizontal orientation.
  • wrap - Controls when the pretty-printed output is wrapped to the next line. If wrap is an integer, then lines are wrapped when they become longer than wrap. If wrap is a string, then lines are wrapped immediately following that string. If wrap is None, then lines are never wrapped.

Class Variable Details [hide private]

_stringtype

A class variable, defined by subclasses of SourcedString, determining what type of string this class contains. Its value must be either str or unicode.

Value:
None

_PPRINT_CHAR_REPRS

Value:
{'\x07': '\\a', '\t': '\\t', '\n': '\\n', '\r': '\\r'}

Instance Variable Details [hide private]

sources

A sorted tuple of (index, source) pairs. Each such pair specifies that the source of self[index:index+len(source)] is source. Any characters for which no source is specified are sourceless (e.g., plain Python characters that were concatenated to a sourced string).

When working with simple sourced strings, it's usually easier to use the source attribute instead; however, the sources attribute is defined for both simple and compound sourced strings.