Package nltk :: Package tokenize :: Module texttiling :: Class TextTilingTokenizer
[hide private]
[frames] | no frames]

type TextTilingTokenizer

source code

    object --+    
             |    
api.TokenizerI --+
                 |
                TextTilingTokenizer

A section tokenizer based on the TextTiling algorithm. The algorithm detects subtopic shifts based on the analysis of lexical co-occurence patterns.

The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. The boundaries are normalized to the closest paragraph break and the segmented text is returned.

Instance Methods [hide private]
 
__init__(self, w=20, k=10, similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False) source code
 
tokenize(self, text)
The main function.
source code
 
_block_comparison(self, tokseqs, token_table)
Implements the block comparison method
source code
 
_smooth_scores(self, gap_scores)
Wraps the smooth function from the SciPy Cookbook
source code
 
_mark_paragraph_breaks(self, text)
Identifies indented text or line breaks as the beginning of paragraphs
source code
 
_divide_to_tokensequences(self, text)
Divides the text into pseudosentences of fixed size
source code
 
_create_token_table(self, token_sequences, par_breaks)
Creates a table of TokenTableFields
source code
 
_identify_boundaries(self, depth_scores)
Identifies boundaries at the peaks of similarity score differences
source code
 
_depth_scores(self, scores)
Calculates the depth of each gap, i.e.
source code
 
_normalize_boundaries(self, text, boundaries, paragraph_breaks)
Normalize the boundaries identified to the original text's paragraph breaks
source code

Inherited from api.TokenizerI: batch_span_tokenize, batch_tokenize, span_tokenize

Method Details [hide private]

__init__(self, w=20, k=10, similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)
(Constructor)

source code 
Parameters:
  • w (number) - Pseudosentence size
  • k (number) - Size(in sentences) of the block used in the block comparison method
  • similarity_method (constant) - The method used for determining similarity scores
  • stopwords (list) - A list of stopwords that are filtered out
  • smoothing_method (constant) - The method used for smoothing the score plot
  • smoothing_width (number) - The width of the window used by the smoothing method
  • smoothing_rounds (number) - The number of smoothing passes
  • cutoff_policy (constant) - The policy used to determine the number of boundaries
Overrides: object.__init__
(inherited documentation)

tokenize(self, text)

source code 

The main function. Follows a pipeline structure.

Returns:
list of str
Overrides: api.TokenizerI.tokenize

_depth_scores(self, scores)

source code 

Calculates the depth of each gap, i.e. the average difference between the left and right peaks and the gap's score