<- file 95varqn.html -> Variances, N vs N-1 (1995) Seal.
  • ... divide by N or N-1 for variance? Seal
  • =====================David Seal, 06 Nov 1995========ssm, ... From: dseal@armltd.co.uk (David Seal) Newsgroups: sci.math.num-analysis,sci.math,sci.stat.math Subject: Re: What's Standard Deviation? Message-ID: <47l8tn$nrt@doc.armltd.co.uk> zhy@pchindig (Zhang Hongyu) writes: >Can u tell me which definition is correct for Standard Deviation, > ---------- ---------- > / (X-EX)2 / (X-EX)2 > /----------- or /----------- ? > V N V N-1 > >where > EX means the Expectation(average value) of X. X2 means square of X. > >I've met both of these definitions in several cases, so I wonder >what's their difference? They're both valid (apart from some typos), but in different circumstances. Basically, the first is a formula in probability (where you're dealing with a known distribution); the second is one in statistics (where you're dealing with an unknown distribution). In probability, given a known distribution for X, the variance is E((X-E(X))^2). If there are a finite number N of equiprobable values for X, this is the same as: SUM((X-E(X))^2) --------------- N The standard deviation is the square root of this variance, giving a formula akin to your first one above. In statistics, given a set of N samples from an unknown distribution, an unbiased estimate for the mean of the unknown distribution is: SUM(X_i) M = -------- N and an unbiased estimate for its variance is: SUM((X_i-M)^2) V = -------------- N-1 Why N-1 rather than N? Very roughly: if we could subtract the true mean of the unknown distribution from the samples, square and sum the results and then divide by N, we would get a good estimate of the unknown variance. But we can't do this: we only know the mean M of the samples we took, not the true mean of the distribution. Now, M tends to follow the samples around a bit - e.g. if lots of the samples are less than the true mean, our value for M will probably be below the true mean as well. This effect tends to reduce the sum of the squared differences, and if you do the mathematics, it turns out that the factor by which it is expected to reduce it is (N-1)/N. So dividing by N-1 instead of N compensates for the fact that we can only work with M, not the true mean. (Except of course in the extreme case of N=1, where M is always equal to the one and only sample, making the squared difference equal to 0. This is in accordance with the reduction by a factor of (N-1)/N = 0/1 = 0, but we can't compensate by dividing by 0 instead of 1: all we get is the undefined value 0/0. But even this makes sense if you think about what is going on: seeing 1 sample from an unknown distribution tells you *nothing* about how widely spread that distribution is.) To show a very simple example of what is going on: consider a known distribution which produces -1 with probability 1/2 and +1 with probability 1/2. We can calculate the variance of this probability distribution by: true mean = (-1 + +1)/2 = 0 true variance = ((-1 - 0)^2 + (+1 - 0)^2)/2 = 1 true standard deviation = SQR(1) = 1. Now suppose that we're faced with this distribution as an unknown distribution, and we do an experiment involving taking 2 samples. There are four equally likely outcomes for the experiment: 1st sample 2nd sample M Variance calculated by V dividing by N: From M From true mean = 0 ------------------------------------------------------------- -1 -1 -1 0 1 0 -1 +1 0 1 1 2 +1 -1 0 1 1 2 +1 +1 +1 0 1 0 The variance calculated using a division by N, with differences taken from M, is too small in the cases where M is not equal to the true mean. By dividing by N-1 instead of N, we get an estimate for the variance which is 0 half the time and 2 the other half, making it an unbiased estimate for the true variance of 1. (Obviously not a very good estimate, of course - but we can't expect a good estimate from just two samples!) Finally, note that I have been careful to talk about V being an unbiased estimate for the variance, not SQR(V) being an unbiased estimate for the standard deviation. This is because SQR(V) is no such thing: in the case above, for instance, SQR(V) is 0 half the time and SQR(2) the other half: its expected value is therefore SQR(2)/2, not the true standard deviation (i.e. 1). To summarise: * The formula involving dividing by N is suitable for calculating the variance of a known distribution having N equally probable outcomes. (If the outcomes aren't equiprobable, go back to the E((X-E(X))^2) formula.) * The formula involving dividing by N-1 is suitable for estimating the variance of an unknown distribution, given N samples from that distribution. David Seal * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
  • FAQ top.
  • Ulrich home page.
  • Ulrich FAQ. http://www.pitt.edu/~wpilib/stats99.html