<- file 96sampn.html -> 'effective' sample N (1996) Volstad.
  • 'effective' sample size. Volstad.
  • =====================Jon Volstad, 05 Jan 1996========ssc From: Jon Helge Volstad <volstadjon@versar.com> Subject: Re: Analysis of large data sets Message-ID: <4cjas4$f17@news.internetmci.com> > > ... Maybe another problem is overconfidence, and the tendency to > overlook one's limits, to get too impressed by p-levels? Consider > the example (mentioned) of data from stock markets: You may have > millions of DATA POINTS, but what is the FRAMEWORK of your random > sampling? Do you really have "random sampling" or do you, perhaps, > have VERY LITTLE justification for extrapolating to ANY other set > of stock prices? > > Is it from one country? one year? some limited, defined set of stocks? > - For Stock prices, for instance, autocorrelation of prices means that > you don't have a legitimate ANOVA (etc.) if you just dump a bunch of > series of prices into one stat package as if they WERE independent... > > > Rich Ulrich, biostatistician wpilib@vms.cis.pitt.edu I think that the concept of "effective samples size" (see, e.g., Kish (1965); Skinner et al. (1989), Pennington and Volstad (1994)) should be considered when dealing with "large data-sets". In my field (marine fisheries surveys, environmental surveys), thousands of individual fish (from < 100 randomly selected locations) may yield a poor estimate of the population parameter of interest (say, mean length, or age-distribution of a fish population). The reason is the effects of "local homogeneity", i.e., the tendency of fish are caught in clusters, and fish from one station to be more similar than the general population. In one trawl survey we analyzed, measurements of 12,000+ fish had an effective sample size of 28! That is, 28 randomly selected fish would be as good as 12,000+ from the cluster-sampling actually employed. of course, simple random sampling of fish is not feasible in practice. However, it is possible (though seldom done) to adjust for the design effects when analyzing the data. Similar effects occur in many other fields; simple random samples (i.e., independence of observations) is rare. In hypothesis testing the effects of intra-cluster correlation, serial correlations etc. must be corrected for to yield useful reults. References: Kish, L. 1965. Survey Sampling. Wiley. Skinner, C. J., D. Holt, and T. M. F. Smith. 1989. Analysis of complex surveys. Wiley. Pennington, M. and J. H. Volstad. 1994. Assessing the effect of intra-haul correlation and variable density on estimates of population characteristics from marine surveys. Biometrics 50: 725-732. Sincerely, Jon Helge Volstad * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
  • FAQ top.
  • Ulrich home page.
  • Ulrich FAQ. http://www.pitt.edu/~wpilib/stats99.html