A section tokenizer based on the TextTiling algorithm. The algorithm
detects subtopic shifts based on the analysis of lexical co-occurence
patterns.
The process starts by tokenizing the text into pseudosentences of a
fixed size w. Then, depending on the method used, similarity scores are
assigned at sentence gaps. The algorithm proceeds by detecting the peak
differences between these scores and marking them as boundaries. The
boundaries are normalized to the closest paragraph break and the
segmented text is returned.
|
|
__init__(self,
w=20,
k=10,
similarity_method=0,
stopwords=None,
smoothing_method=[0],
smoothing_width=2,
smoothing_rounds=1,
cutoff_policy=1,
demo_mode=False) |
source code
|
|
|
|
|
|
|
_block_comparison(self,
tokseqs,
token_table)
Implements the block comparison method |
source code
|
|
|
|
_smooth_scores(self,
gap_scores)
Wraps the smooth function from the SciPy Cookbook |
source code
|
|
|
|
_mark_paragraph_breaks(self,
text)
Identifies indented text or line breaks as the beginning of
paragraphs |
source code
|
|
|
|
_divide_to_tokensequences(self,
text)
Divides the text into pseudosentences of fixed size |
source code
|
|
|
|
_create_token_table(self,
token_sequences,
par_breaks)
Creates a table of TokenTableFields |
source code
|
|
|
|
_identify_boundaries(self,
depth_scores)
Identifies boundaries at the peaks of similarity score differences |
source code
|
|
|
|
|
|
|
_normalize_boundaries(self,
text,
boundaries,
paragraph_breaks)
Normalize the boundaries identified to the original text's paragraph
breaks |
source code
|
|
|
Inherited from api.TokenizerI:
batch_span_tokenize,
batch_tokenize,
span_tokenize
|