Module collocations
source code
Tools to identify collocations --- words that often appear
consecutively --- within corpora. They may also be used to find other associations
between word occurrences. See Manning and Schutze ch. 5 at
http://nlp.stanford.edu/fsnlp/promo/colloc.pdf and the Text::NSP Perl
package at http://ngram.sourceforge.net
Finding collocations requires first calculating the frequencies of
words and their appearance in the context of other words. Often the
collection of words will then requiring filtering to only retain useful
content terms. Each ngram of words may then be scored according to some
association
measure, in order to determine the relative likelihood of each ngram
being a collocation.
The BigramCollocationFinder and TrigramCollocationFinder classes provide these
functionalities, dependent on being provided a function which scores a
ngram given appropriate frequency counts. A number of standard
association measures are provided in bigram_measures and trigram_measures.
AbstractCollocationFinder
An abstract base class for collocation finders whose purpose is to
collect collocation candidate frequencies, filter and rank them.
|
BigramCollocationFinder
A tool for the finding and ranking of bigram collocations or other
association measures.
|
TrigramCollocationFinder
A tool for the finding and ranking of bigram collocations or other
association measures.
|
|
|
demo(scorer=None,
compare_scorer=None)
Finds trigram collocations in the files of the WebText corpus. |
source code
|
|