<- file 95zeroes.html -> zeroes, and taking log? (1995). Ulrich, Velleman comments
  • Handling 0 when taking log(x) ===================Rich Ulrich, 14 Jul 1995==========ssc Subject: Re: Handling X=0 when Y=f(log(X) Message-ID: <3u5vb8$kvt@usenet.srv.cis.pitt.edu> Michael Lacy (mglacy@lamar.ColoState.EDU) wrote: Subject: Re: Handling X=0 when Y=f(log(X) : The title says it all: A colleague is trying to find literature : that systematically treats the problem of what do with legitimate : values of 0 for the independent variable when trying to estimate : a regression equation as described above. Various sources, of course, When I have a legitimate log(X) probability model, then zeros exist only as a function of errors of measurement, e.g., `the bio-assay was insensitive.' Then I use, say, 1/2 the smallest real value. Or the smallest value, for convenience, as my data minimum. If I am just looking for a pragmatic re-scaling to normality, then I try add-ons that are simple numbers like 1 or 2 or 10 or .1, whichever values tend to minimize the computed skewness, and which may still be usable the next time I see data of the same kind. Do you have some other reason for looking at log of x?
  • Is log(x+1) okay for this x=0?
  • =======================Paul Velleman, 06 Nov 1995==========ssc From: pfv2@cornell.edu (Paul Velleman) Subject: Re: regression question Message-ID: <pfv2-061195222059@132.236.236.43> > Dear Readers, > I have several sets of regression data, some of which contain several > pairs of zero values. Is there any rule that would help me determine whether > a data set should be discarded because of the large number of zero values. > this is an example of two data sets I have: > > Both sets of data compare the recovery of mosquito eggshells with two tech- > nigues. Each pair represents the same sample. DEC is taken as the independent > variable. It is desired that the relationship be determined between the two > methods, so as to reduce the amount of work to be carried out. I carried out > a log(x+1) transformation (base 10). Would this have made any difference, > compared to untransformed data. All regressions were significant whether the > data was transformed or not. On a graph, the data 'looks better', compared > to untransformed data (hardly a justification for transforming !), because > the gap between the low values (clumped near the origin) and high values is > less obvious. There is no extra time I can allocate to 'filling in the gaps' > with more data. By all means, transform these data. While Herman is technically correct when he says: >The usual assumptions for a regression are that the dependent variable is >a linear combination of the independent variables and an "error" which is >independent of the independent variables. This can be relaxed somewhat, >but is that YOUR model? Nothing justifies transformation except the >model. he ignores the fact that the data have much information to help us decide whether the simple regression model was reasonable. The data themselves can justify a transformation by suggesting that the original regression model must have been wrong. These data, for example, make it clear that a linear model in the raw data is not reasonable. There is good evidence that a log transformation stabilizes the variance and that the relationship is linear and the error additive on the log scale. This is quite an ordinary occurance. This is much of what you mean by the transformed data "looking better". In that sense you are wrong to suggest that this is not a justification for transformation; You should always listen to your data. Some data whisper; these are shouting at you for a log transformation. To answer your other questions, yes it does make a difference compared to the untransformed data. It is an entirely different model. However, the evidence in your data suggest strongly that it is a more appropriate model. That should send you back to the theory to ask whether a model relating the logs makes scientific sense. (The data are so strong that if it doesn't, I would suggest that you rethink the science or redesign the experiment.) There is nothing wrong with adding a starting constant of 1 to the data to save the zero point. You might consider whether the 1 has any meaning in this context. Other starts are possible, but they make little difference to the final analysis. The start merely has the effect of weakening the effect of the log transformation slightly. <<details about a suspicious data points...>> (I don't know if any of these questions is scientifically interesting because I know nothing about the data. However, they only arise if you transform the data. This is another argument for transforming; it helps you to see what may be going on in your data and thereby raises new questions that may prove interesting.) Hope this helps. Paul Velleman ******************* ********** * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
  • FAQ top.
  • Ulrich home page.
  • Ulrich FAQ. http://www.pitt.edu/~wpilib/stats99.html