- file 95varqn.html ->
Variances, N vs N-1 (1995) Seal.
... divide by N or N-1 for variance? Seal
=====================David Seal, 06 Nov 1995========ssm, ...
From: dseal@armltd.co.uk (David Seal)
Newsgroups: sci.math.num-analysis,sci.math,sci.stat.math
Subject: Re: What's Standard Deviation?
Message-ID: <47l8tn$nrt@doc.armltd.co.uk>
zhy@pchindig (Zhang Hongyu) writes:
>Can u tell me which definition is correct for Standard Deviation,
> ---------- ----------
> / (X-EX)2 / (X-EX)2
> /----------- or /----------- ?
> V N V N-1
>
>where
> EX means the Expectation(average value) of X. X2 means square of X.
>
>I've met both of these definitions in several cases, so I wonder
>what's their difference?
They're both valid (apart from some typos), but in different
circumstances. Basically, the first is a formula in probability (where
you're dealing with a known distribution); the second is one in
statistics (where you're dealing with an unknown distribution).
In probability, given a known distribution for X, the variance is
E((X-E(X))^2). If there are a finite number N of equiprobable values
for X, this is the same as:
SUM((X-E(X))^2)
---------------
N
The standard deviation is the square root of this variance, giving a
formula akin to your first one above.
In statistics, given a set of N samples from an unknown distribution,
an unbiased estimate for the mean of the unknown distribution is:
SUM(X_i)
M = --------
N
and an unbiased estimate for its variance is:
SUM((X_i-M)^2)
V = --------------
N-1
Why N-1 rather than N? Very roughly: if we could subtract the true
mean of the unknown distribution from the samples, square and sum the
results and then divide by N, we would get a good estimate of the
unknown variance. But we can't do this: we only know the mean M of the
samples we took, not the true mean of the distribution. Now, M tends
to follow the samples around a bit - e.g. if lots of the samples are
less than the true mean, our value for M will probably be below the
true mean as well. This effect tends to reduce the sum of the squared
differences, and if you do the mathematics, it turns out that the
factor by which it is expected to reduce it is (N-1)/N. So dividing by
N-1 instead of N compensates for the fact that we can only work with
M, not the true mean. (Except of course in the extreme case of N=1,
where M is always equal to the one and only sample, making the squared
difference equal to 0. This is in accordance with the reduction by a
factor of (N-1)/N = 0/1 = 0, but we can't compensate by dividing by 0
instead of 1: all we get is the undefined value 0/0. But even this
makes sense if you think about what is going on: seeing 1 sample from
an unknown distribution tells you *nothing* about how widely spread
that distribution is.)
To show a very simple example of what is going on: consider a known
distribution which produces -1 with probability 1/2 and +1 with
probability 1/2. We can calculate the variance of this probability
distribution by:
true mean = (-1 + +1)/2 = 0
true variance = ((-1 - 0)^2 + (+1 - 0)^2)/2 = 1
true standard deviation = SQR(1) = 1.
Now suppose that we're faced with this distribution as an unknown
distribution, and we do an experiment involving taking 2 samples.
There are four equally likely outcomes for the experiment:
1st sample 2nd sample M Variance calculated by V
dividing by N:
From M From true mean = 0
-------------------------------------------------------------
-1 -1 -1 0 1 0
-1 +1 0 1 1 2
+1 -1 0 1 1 2
+1 +1 +1 0 1 0
The variance calculated using a division by N, with differences taken
from M, is too small in the cases where M is not equal to the true
mean. By dividing by N-1 instead of N, we get an estimate for the
variance which is 0 half the time and 2 the other half, making it an
unbiased estimate for the true variance of 1. (Obviously not a very
good estimate, of course - but we can't expect a good estimate from
just two samples!)
Finally, note that I have been careful to talk about V being an
unbiased estimate for the variance, not SQR(V) being an unbiased
estimate for the standard deviation. This is because SQR(V) is no such
thing: in the case above, for instance, SQR(V) is 0 half the time and
SQR(2) the other half: its expected value is therefore SQR(2)/2, not
the true standard deviation (i.e. 1).
To summarise:
* The formula involving dividing by N is suitable for calculating the
variance of a known distribution having N equally probable outcomes.
(If the outcomes aren't equiprobable, go back to the E((X-E(X))^2)
formula.)
* The formula involving dividing by N-1 is suitable for estimating the
variance of an unknown distribution, given N samples from that
distribution.
David Seal
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
FAQ top.
Ulrich home page.
Ulrich FAQ.
http://www.pitt.edu/~wpilib/stats99.html