Basic method for tokenizing input into sentences
for this tagger:
@param data: list of tokens
tokens can be either
words or (word, tag) tuples
@type data: [string,]
or [(string, string),]
@param raw: boolean flag marking the input data
as a list of words or a list of tagged words
@type raw: Boolean
@ret : list of sentences
sentences are a list of tokens
tokens are the same as the input
Function takes a list of tokens and separates the tokens into lists
where each list represents a sentence fragment
This function can separate both tagged and raw sequences into
basic sentences.
Sentence markers are the set of [,.!?]
This is a simple method which enhances the performance of the TnT
tagger. Better sentence tokenization will further enhance the results.
|