Stores variables, mostly regular expressions, which may be
language-dependent for correct application of the algorithm. An extension
of this class may modify its properties to suit a language other than
English; an instance can then be passed as an argument to
PunktSentenceTokenizer and PunktTrainer constructors.
|
|
sent_end_chars = ('.', '?', '!')
Characters which are candidates for sentence boundaries
|
|
|
internal_punctuation = ',:;'
sentence internal punctuation, which indicates an abbreviation if
preceded by a period-final token.
|
|
|
re_boundary_realignment = re.compile(r'(?m)["\'\)\]\}]+?(?:\s+...
Used to realign punctuation that should be included in a sentence
although it follows the period (or ?, !).
|
|
|
_re_word_start = '[^\\(\\"\\`{\\[:;&\\#\\*@\\)}\\]\\-,]'
Excludes some characters from starting word tokens
|
|
|
_re_non_word_chars = '(?:[?!)\\";}\\]\\*:@\\\'\\({\\[])'
Characters that cannot appear within words
|
|
|
_re_multi_char_punct = '(?:\\-{2,}|\\.{2,}|(?:\\.\\s){2,}\\.)'
Hyphen and ellipsis are multi-character punctuation
|
|
|
_word_tokenize_fmt = '(\n %(MultiChar)s\n |\n ...
Format of a regular expression to split punctuation from words,
excluding period.
|
|
|
_period_context_fmt = '\n \\S* ...
Format of a regular expression to find contexts including possible
sentence boundaries.
|