Manning, Christopher D.; Hinrich Schuetze [Sch\"utze];
Foundations of Statistical Natural Language Processing
Cambridge, MA: The MIT Press, 1999, xxxvii + 680 pages
ISBN 0262133601
topics: | language-nlp | statistics
Stanford University / Xerox Palo Alto Research Center
set of outcomes - samples, stochastic variable
[stochastic - indeterminacy in the outcome; as opp to deterministic]
formally : random or stochastic variable : function from a probability space
to a real number, often between 0 and 1
sample space = probability space
[formally: prob space Omega = set of all events (event may incl multiple outcomes
e.g. several coin-tosses). The set of all outcomes (events)
constitutes a sigma-field F - closed under countable union, and under
complement, with maximal element Omega. Phi = impossible event
(this is the kolmogrovian axiomatization, alternatives exist)]
A discrete probability function P maps F -> [0,1] s.t.
a) There must be some outcome: P(Omega) = 1,
b) Countable Additivity : given disjoint events Aj, Ak in F, P(Aj U Ak) =
P(Aj)+P(Ak)
[Note: if we choose Ak to be complement of A (~A), then we have
P(~A) = 1 - P(A)]
e.g. for 3 coin tosses, the prob space is {HHH, ... TTT} = Omega
UNIFORM DISTRIBUTION : each outcome is eqally likely
prob of exactly 2 heads = {HHT, THH, HTH} = 3/8
An UNBIASED SAMPLE is a set of objects chosen from a complete sample using a
selection process that does not depend on the properties of the
objects.
e.g. if choosing people for experiment, all humans have equal chance
of being selected. email polls etc are biased towards internet users
PRIOR PROBABILITY prob of an event prior to some other knowledge,
i.e. P(event) based on P: Om-> [0,1]
additional event can alter assessment.
e.g. prob of exactly 2 heads, when we know the first is a head
is {HT,TH}/4 = 1/2
CONDITIONAL PROBABILITY P (A|B) = P(A^B) / P(B)
[venn diagram, A^B is inside circle B]
PRODUCT RULE
MULTIPLICATION RULE: P(A^B) = P(A|B)*P(B) = P(B|A) * P(A)
[also holds when P(B)=0]
SUM RULE:
P(X) = SUMy P(X^Y) = SUM y P(X|Y) P(Y)
CHAIN RULE: generalization of Multiplicn rule
P(A^B^C^D) = P(A) * P(B|A) * P(C|A^B) * P(D|ABC) ...
INDEPENDENCE : A, B indep if P(A^B) = P(A) * P(B)
BAYES theorem P(B|A) = [ P(A|B) * P(B) / P(A) ]
P(B) = prior; P(B|A) = posterior;
P(A|B) = likelihood
P(A) = normalizing constant [brings result to 0,1 range
e.g. B = some parameter value Theta, A = evidence x.
then likelihood of Theta given x = prob (x | theta) = prob_Th(x)
now, given several choices of hypotheses B, we can compare them by looking
only at the numerator P(A^B) and we can ignore the normalization factor
P(A) [choose argmax_B P(A|B)P(B)]
If B_i constitute a partition on A (each Bi is disjt and Union Bi = A)
then denom P(A) = SUM P(A|Bi) P(Bi)
Ex. 2: parasitic Gap, a rare syntactic construction occurs on average
once in 100,000 sentences. a pattern matcher that
attempts to identify sentences with parasitic gaps.
if a sentence has a parasitic gap (G), it will say so (T) with prob
0.95. if it has no gap (~G) it will wrongly say it does (T) with prob
0.005. If test says that a sentence contains a parasitic gap. What is
the probability that this is true? P(G|T)
P(G) = 10^-5. P(T|G) = 0.95 P(T|~G) = 0.0005 = 5.10^-4
P(G|T) = P(T|G) * P(G) / [P(T|G) * P(G) + P(T|~G) * P(~G)]
= 0.95 * 10^-5 / [ .95* 10**(-5) + 5.10^-3 . (1 - 10^-5) ]
= 9.5e-3 / (9.5e-3 + 5 * 0.99999) [div by 10^-3]
= 0.0095 / (0.0095 + 4.9995) = 0.0095 / 5.00945
= 0.0019
or abt 1 in 500 cases actually have a parasitic gap [owing to low prior]
r.v : function X : Omega -> R^n, (commonly n=1).
discrete random variable is a fn X : Omega -> discrete set S = {a,b,c,d},
where abcd are the outcomes.
If S = {0,1} then X is called an indicator r.v or a BERNOULLI TRIAL
roll dice 1x : UNIFORM DISTRIBUTION
roll dice 2x : triangular distribution
roll dice n-x : tend toward GAUSSIAN
discrete r.v.: mapping to [0,1] = probability MASS FUNCTION pmf
SUM p(xi) = P (Omega) = 1
any fn satisfying this is also a pmf
continuous r.v. probability DENSITY function
INTEGR_univ p(x) dx = 1
SUM and PRODUCT rule
expectn of function f(x) = SUM p(x) f(x) [or INTEGR p(x) f(x) dx ] if x1..xi..xN are samples drawn from some distribn (cont or discrete), then E(f) ~= 1/N SUM (f(xi)) Q. what is the expected value when rolling one die = 1/6 SUM (1 to 6) = 3.5 rolling two dies = SUM_2,12 x p(x) = 1/36.[2+12] + 2/36 ... = 14. [1/36 + 2/36 + 3/36 + 4/36 + 5/36] + 7. 1/6 = 252/36 = 7 i.e. E(x + x) = E(x) + E(x)