| Home | Trees | Indices | Help |
|
|---|
|
|
The Punkt sentence tokenizer. The algorithm for this tokenizer is described in Kiss & Strunk (2006):
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.
|
|||
| Language-dependent variables | |||
|---|---|---|---|
|
PunktLanguageVars Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. |
|||
| Punkt Word Tokenizer | |||
| PunktWordTokenizer | |||
| Punkt Parameters | |||
|
PunktParameters Stores data used to perform sentence boundary detection with punkt. |
|||
| PunktToken | |||
|
PunktToken Stores a token of text with annotations produced during sentence boundary detection. |
|||
| Punkt base class | |||
|
_PunktBaseClass Includes common components of PunktTrainer and PunktSentenceTokenizer. |
|||
| Punkt Trainer | |||
|
PunktTrainer Learns parameters used in Punkt sentence boundary detection. |
|||
| Punkt Sentence Tokenizer | |||
|
PunktSentenceTokenizer A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. |
|||
|
|||
| Helper Functions | |||
|---|---|---|---|
|
|||
| Punkt Sentence Tokenizer | |||
|
|||
|
|||
| Orthographic Context Constants | |||
|---|---|---|---|
_ORTHO_BEG_UC = 2Orthogaphic context: beginning of a sentence with upper case. |
|||
_ORTHO_MID_UC = 4Orthogaphic context: middle of a sentence with upper case. |
|||
_ORTHO_UNK_UC = 8Orthogaphic context: unknown position in a sentence with upper case. |
|||
_ORTHO_BEG_LC = 16Orthogaphic context: beginning of a sentence with lower case. |
|||
_ORTHO_MID_LC = 32Orthogaphic context: middle of a sentence with lower case. |
|||
_ORTHO_UNK_LC = 64Orthogaphic context: unknown position in a sentence with lower case. |
|||
_ORTHO_UC = 14Orthogaphic context: occurs with upper case. |
|||
_ORTHO_LC = 112Orthogaphic context: occurs with lower case. |
|||
_ORTHO_MAP = A map from context position and first-letter case to the appropriate orthographic context flag. |
|||
| Language-dependent variables | |||
_re_non_punct = re.compile(r'Matches token types that are not merely punctuation. |
|||
|
|||
Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple. The last pair will have None as its second element. |
|
|||
_ORTHO_MAPA map from context position and first-letter case to the appropriate orthographic context flag.
|
_re_non_punctMatches token types that are not merely punctuation. (Types for numeric tokens are changed to ##number## and hence contain alpha.)
|
| Home | Trees | Indices | Help |
|
|---|
| Generated by Epydoc 3.0.1 on Mon Apr 11 14:39:42 2011 | http://epydoc.sourceforge.net |