| Home | Trees | Indices | Help |
|
|---|
|
|
object --+
|
ProbDistI --+
|
SimpleGoodTuringProbDist
SimpleGoodTuring ProbDist approximates from frequency to freqency of
frequency into a linear line under log space by linear regression.
Details of Simple Good-Turing algorithm can be found in:
(1) Bill Gale and Geoffrey Sampson's joint paper
"Good Turing Smoothing Without Tear", published in
Journal of Quantitative Linguistics, vol. 2 pp. 217-237, 1995
(2) Jurafsky & Martin's Book "Speech and Language Processing"
2e Chap 4.5 p103 (log(Nc) = a + b*log(c))
(3) Website maintained by Geoffrey Sampson:
http://www.grsampson.net/RGoodTur.html
Given a set of pair (xi, yi), where the xi denotes the freqency and
yi denotes the freqency of freqency, we want to minimize their
square variation. E(x) and E(y) represent the mean of xi and yi.
-Slope: b = sigma ((xi-E(x)*(yi-E(y))) / sigma ((xi-E(x))*(xi-E(x)))
-Intercept: a = E(y)- b * E(x)
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
float
|
|
||
float
|
|
||
|
|||
|
|||
float
|
|
||
| any |
|
||
list
|
|
||
|
|||
string
|
|
||
|
|||
|
Inherited from |
|||
|
|||
|
It is necessary to renormalize all the probability estimates to ensure a proper probability distribution results. This can be done by keeping the estimate of the probability mass for unseen items as N(1)/N and renormalizing all the estimates for previously seen items (as Gale and Sampson (1995) propose). (See M&S P.213, 1999) |
|
|
This function returns the total mass of probability transfers from the seen samples to the unseen samples.
|
|
|
|
| Home | Trees | Indices | Help |
|
|---|
| Generated by Epydoc 3.0.1 on Mon Apr 11 14:39:49 2011 | http://epydoc.sourceforge.net |