Package nltk :: Module probability :: Class SimpleGoodTuringProbDist
[hide private]
[frames] | no frames]

type SimpleGoodTuringProbDist

source code

object --+    
         |    
 ProbDistI --+
             |
            SimpleGoodTuringProbDist


SimpleGoodTuring ProbDist approximates from frequency to freqency of
frequency into a linear line under log space by linear regression.
Details of Simple Good-Turing algorithm can be found in:
    (1) Bill Gale and Geoffrey Sampson's joint paper
            "Good Turing Smoothing Without Tear", published in 
            Journal of Quantitative Linguistics, vol. 2 pp. 217-237, 1995
    (2) Jurafsky & Martin's Book "Speech and Language Processing"
            2e Chap 4.5 p103 (log(Nc) =  a + b*log(c))
    (3) Website maintained by Geoffrey Sampson:
            http://www.grsampson.net/RGoodTur.html
        
Given a set of pair (xi, yi),  where the xi denotes the freqency and
yi denotes the freqency of freqency, we want to minimize their
square variation. E(x) and E(y) represent the mean of xi and yi.

    -Slope: b = sigma ((xi-E(x)*(yi-E(y))) / sigma ((xi-E(x))*(xi-E(x)))
    -Intercept: a = E(y)- b * E(x)

Instance Methods [hide private]
 
__init__(self, freqdist, bins=None) source code
 
_r_Nr(self)
Split the frequency distribution in two list (r, Nr), where Nr(r) > 0
source code
 
find_best_fit(self, r, nr)
Use simple linear regression to tune parameters self._slope and self._intercept in the log-log space based on count and Nr(count) (Work in log space to avoid floating point underflow.)
source code
 
_switch(self, r, nr)
Calculate the r frontier where we must switch from Nr to Sr when estimating E[Nr].
source code
 
_variance(self, r, nr, nr_1) source code
 
_renormalize(self, r, nr)
It is necessary to renormalize all the probability estimates to ensure a proper probability distribution results.
source code
float
smoothedNr(self, r)
Returns: The number of samples with count r.
source code
float
prob(self, sample)
Returns: The sample's probability.
source code
 
_prob_measure(self, count) source code
 
check(self) source code
float
discount(self)
This function returns the total mass of probability transfers from the seen samples to the unseen samples.
source code
any
max(self)
Returns: the sample with the greatest probability.
source code
list
samples(self)
Returns: A list of all samples that have nonzero probabilities.
source code
 
freqdist(self) source code
string
__repr__(self)
Returns: A string representation of this ProbDist.
source code

Inherited from ProbDistI: generate, logprob

Class Variables [hide private]

Inherited from ProbDistI: SUM_TO_ONE

Method Details [hide private]

__init__(self, freqdist, bins=None)
(Constructor)

source code 
Parameters:
  • freqdist (FreqDist) - The frequency counts upon which to base the estimation.
  • bins (Int) - The number of possible event types. This must be at least as large as the number of bins in the freqdist. If None, then it's assumed to be equal to that of the freqdist
Overrides: ProbDistI.__init__

_renormalize(self, r, nr)

source code 

It is necessary to renormalize all the probability estimates to ensure a proper probability distribution results. This can be done by keeping the estimate of the probability mass for unseen items as N(1)/N and renormalizing all the estimates for previously seen items (as Gale and Sampson (1995) propose). (See M&S P.213, 1999)

smoothedNr(self, r)

source code 
Parameters:
  • r (int) - The amount of freqency.
Returns: float
The number of samples with count r.

prob(self, sample)

source code 
Parameters:
  • sample (string) - sample of the event
Returns: float
The sample's probability.
Overrides: ProbDistI.prob

discount(self)

source code 

This function returns the total mass of probability transfers from the seen samples to the unseen samples.

Returns: float
The ratio by which counts are discounted on average: c*/c
Overrides: ProbDistI.discount

max(self)

source code 
Returns: any
the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
Overrides: ProbDistI.max
(inherited documentation)

samples(self)

source code 
Returns: list
A list of all samples that have nonzero probabilities. Use prob to find the probability of each sample.
Overrides: ProbDistI.samples
(inherited documentation)

__repr__(self)
(Representation operator)

source code 
Returns: string
A string representation of this ProbDist.
Overrides: object.__repr__