Package nltk :: Package tokenize :: Module punkt :: Class PunktLanguageVars
[hide private]
[frames] | no frames]

type PunktLanguageVars

source code

object --+
         |
        PunktLanguageVars

Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. An extension of this class may modify its properties to suit a language other than English; an instance can then be passed as an argument to PunktSentenceTokenizer and PunktTrainer constructors.

Instance Methods [hide private]
 
__getstate__(self) source code
 
__setstate__(self, state) source code
 
_word_tokenizer_re(self)
Compiles and returns a regular expression for word tokenization
source code
 
word_tokenize(self, s)
Tokenize a string to split of punctuation other than periods
source code
 
period_context_re(self)
Compiles and returns a regular expression to find contexts including possible sentence boundaries.
source code
Class Variables [hide private]
  sent_end_chars = ('.', '?', '!')
Characters which are candidates for sentence boundaries
  internal_punctuation = ',:;'
sentence internal punctuation, which indicates an abbreviation if preceded by a period-final token.
  re_boundary_realignment = re.compile(r'(?m)["\'\)\]\}]+?(?:\s+...
Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).
  _re_word_start = '[^\\(\\"\\`{\\[:;&\\#\\*@\\)}\\]\\-,]'
Excludes some characters from starting word tokens
  _re_non_word_chars = '(?:[?!)\\";}\\]\\*:@\\\'\\({\\[])'
Characters that cannot appear within words
  _re_multi_char_punct = '(?:\\-{2,}|\\.{2,}|(?:\\.\\s){2,}\\.)'
Hyphen and ellipsis are multi-character punctuation
  _word_tokenize_fmt = '(\n %(MultiChar)s\n |\n ...
Format of a regular expression to split punctuation from words, excluding period.
  _period_context_fmt = '\n \\S* ...
Format of a regular expression to find contexts including possible sentence boundaries.
Properties [hide private]
  _re_sent_end_chars
  _re_period_context
  _re_word_tokenizer
Class Variable Details [hide private]

re_boundary_realignment

Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).

Value:
re.compile(r'(?m)["\'\)\]\}]+?(?:\s+|(?=--)|$)')

_word_tokenize_fmt

Format of a regular expression to split punctuation from words, excluding period.

Value:
'''(
        %(MultiChar)s
        |
        (?=%(WordStart)s)\\S+?  # Accept word characters until end is \
found
        (?= # Sequences marking a word\'s end
            \\s|                                 # White-space
            $|                                  # End-of-string
...

_period_context_fmt

Format of a regular expression to find contexts including possible sentence boundaries. Matches token which the possible sentence boundary ends, and matches the following token within a lookahead expression.

Value:
'''
        \\S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            \\s+(?P<next_tok>\\S+)     # or whitespace and some other \
token
...

Property Details [hide private]

_re_sent_end_chars