Package nltk :: Package tokenize :: Module punkt
[hide private]
[frames] | no frames]

Module punkt

source code

The Punkt sentence tokenizer. The algorithm for this tokenizer is described in Kiss & Strunk (2006):

 Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
   Boundary Detection.  Computational Linguistics 32: 485-525.
Classes [hide private]
    Language-dependent variables
PunktLanguageVars
Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm.
    Punkt Word Tokenizer
PunktWordTokenizer
    Punkt Parameters
PunktParameters
Stores data used to perform sentence boundary detection with punkt.
    PunktToken
PunktToken
Stores a token of text with annotations produced during sentence boundary detection.
    Punkt base class
_PunktBaseClass
Includes common components of PunktTrainer and PunktSentenceTokenizer.
    Punkt Trainer
PunktTrainer
Learns parameters used in Punkt sentence boundary detection.
    Punkt Sentence Tokenizer
PunktSentenceTokenizer
A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.
Functions [hide private]
    Helper Functions
 
_pair_iter(it)
Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple.
source code
    Punkt Sentence Tokenizer
 
main(text, tok_cls=<class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>, train_cls=<class 'nltk.tokenize.punkt.PunktTrainer'>)
Builds a punkt model and applies it to the same text
source code
Variables [hide private]
    Orthographic Context Constants
  _ORTHO_BEG_UC = 2
Orthogaphic context: beginning of a sentence with upper case.
  _ORTHO_MID_UC = 4
Orthogaphic context: middle of a sentence with upper case.
  _ORTHO_UNK_UC = 8
Orthogaphic context: unknown position in a sentence with upper case.
  _ORTHO_BEG_LC = 16
Orthogaphic context: beginning of a sentence with lower case.
  _ORTHO_MID_LC = 32
Orthogaphic context: middle of a sentence with lower case.
  _ORTHO_UNK_LC = 64
Orthogaphic context: unknown position in a sentence with lower case.
  _ORTHO_UC = 14
Orthogaphic context: occurs with upper case.
  _ORTHO_LC = 112
Orthogaphic context: occurs with lower case.
  _ORTHO_MAP = {('initial', 'lower'): 16, ('initial', 'upper'): ...
A map from context position and first-letter case to the appropriate orthographic context flag.
    Language-dependent variables
  _re_non_punct = re.compile(r'(?u)[^\W\d]')
Matches token types that are not merely punctuation.
Function Details [hide private]

_pair_iter(it)

source code 

Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple. The last pair will have None as its second element.


Variables Details [hide private]

_ORTHO_MAP

A map from context position and first-letter case to the appropriate orthographic context flag.

Value:
{('initial', 'lower'): 16,
 ('initial', 'upper'): 2,
 ('internal', 'lower'): 32,
 ('internal', 'upper'): 4,
 ('unknown', 'lower'): 64,
 ('unknown', 'upper'): 8}

_re_non_punct

Matches token types that are not merely punctuation. (Types for numeric tokens are changed to ##number## and hence contain alpha.)

Value:
re.compile(r'(?u)[^\W\d]')