|
Perform the first pass of annotation, which makes decisions based
purely based on the word type of each word:
-
'?', '!', and '.' are marked as sentence breaks.
-
sequences of two or more periods are marked as ellipsis.
-
any word ending in '.' that's a known abbreviation is marked as an
abbreviation.
-
any other word ending in '.' is marked as a sentence break.
Return these annotations as a tuple of three sets:
-
sentbreak_toks: The indices of all sentence breaks.
-
abbrev_toks: The indices of all abbreviations.
-
ellipsis_toks: The indices of all ellipsis marks.
|