- file 96sampn.html ->
'effective' sample N (1996) Volstad.
'effective' sample size. Volstad.
=====================Jon Volstad, 05 Jan 1996========ssc
From: Jon Helge Volstad
Subject: Re: Analysis of large data sets
Message-ID: <4cjas4$f17@news.internetmci.com>
>
> ... Maybe another problem is overconfidence, and the tendency to
> overlook one's limits, to get too impressed by p-levels? Consider
> the example (mentioned) of data from stock markets: You may have
> millions of DATA POINTS, but what is the FRAMEWORK of your random
> sampling? Do you really have "random sampling" or do you, perhaps,
> have VERY LITTLE justification for extrapolating to ANY other set
> of stock prices?
>
> Is it from one country? one year? some limited, defined set of stocks?
> - For Stock prices, for instance, autocorrelation of prices means that
> you don't have a legitimate ANOVA (etc.) if you just dump a bunch of
> series of prices into one stat package as if they WERE independent...
>
>
> Rich Ulrich, biostatistician wpilib@vms.cis.pitt.edu
I think that the concept of "effective samples size" (see, e.g., Kish
(1965); Skinner et al. (1989), Pennington and Volstad (1994)) should
be considered when dealing with "large data-sets". In my field (marine
fisheries surveys, environmental surveys), thousands of individual
fish (from < 100 randomly selected locations) may yield a poor
estimate of the population parameter of interest (say, mean length, or
age-distribution of a fish population). The reason is the effects of
"local homogeneity", i.e., the tendency of fish are caught in
clusters, and fish from one station to be more similar than the
general population. In one trawl survey we analyzed, measurements of
12,000+ fish had an effective sample size of 28!
That is, 28 randomly selected fish would be as good as 12,000+ from
the cluster-sampling actually employed. of course, simple random
sampling of fish is not feasible in practice. However, it is possible
(though seldom done) to adjust for the design effects when analyzing
the data. Similar effects occur in many other fields; simple random
samples (i.e., independence of observations) is rare.
In hypothesis testing the effects of intra-cluster correlation, serial
correlations etc. must be corrected for to yield useful reults.
References:
Kish, L. 1965. Survey Sampling. Wiley.
Skinner, C. J., D. Holt, and T. M. F. Smith. 1989. Analysis of complex
surveys. Wiley.
Pennington, M. and J. H. Volstad. 1994. Assessing the effect of
intra-haul correlation and variable density on estimates of
population characteristics from marine surveys. Biometrics 50: 725-732.
Sincerely,
Jon Helge Volstad
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
FAQ top.
Ulrich home page.
Ulrich FAQ.
http://www.pitt.edu/~wpilib/stats99.html