- file 95zeroes.html ->
zeroes, and taking log? (1995).
Ulrich, Velleman comments
Handling 0 when taking log(x)
===================Rich Ulrich, 14 Jul 1995==========ssc
Subject: Re: Handling X=0 when Y=f(log(X)
Message-ID: <3u5vb8$kvt@usenet.srv.cis.pitt.edu>
Michael Lacy (mglacy@lamar.ColoState.EDU) wrote:
Subject: Re: Handling X=0 when Y=f(log(X)
: The title says it all: A colleague is trying to find literature
: that systematically treats the problem of what do with legitimate
: values of 0 for the independent variable when trying to estimate
: a regression equation as described above. Various sources, of course,
When I have a legitimate log(X) probability model, then zeros exist
only as a function of errors of measurement, e.g., `the bio-assay was
insensitive.' Then I use, say, 1/2 the smallest real value. Or the
smallest value, for convenience, as my data minimum.
If I am just looking for a pragmatic re-scaling to normality, then
I try add-ons that are simple numbers like 1 or 2 or 10 or .1,
whichever values tend to minimize the computed skewness, and which
may still be usable the next time I see data of the same kind.
Do you have some other reason for looking at log of x?
Is log(x+1) okay for this x=0?
=======================Paul Velleman, 06 Nov 1995==========ssc
From: pfv2@cornell.edu (Paul Velleman)
Subject: Re: regression question
Message-ID:
> Dear Readers,
> I have several sets of regression data, some of which contain several
> pairs of zero values. Is there any rule that would help me determine whether
> a data set should be discarded because of the large number of zero values.
> this is an example of two data sets I have:
>
> Both sets of data compare the recovery of mosquito eggshells with two tech-
> nigues. Each pair represents the same sample. DEC is taken as the independent
> variable. It is desired that the relationship be determined between the two
> methods, so as to reduce the amount of work to be carried out. I carried out
> a log(x+1) transformation (base 10). Would this have made any difference,
> compared to untransformed data. All regressions were significant whether the
> data was transformed or not. On a graph, the data 'looks better', compared
> to untransformed data (hardly a justification for transforming !), because
> the gap between the low values (clumped near the origin) and high values is
> less obvious. There is no extra time I can allocate to 'filling in the gaps'
> with more data.
By all means, transform these data. While Herman is technically correct
when he says:
>The usual assumptions for a regression are that the dependent variable is
>a linear combination of the independent variables and an "error" which is
>independent of the independent variables. This can be relaxed somewhat,
>but is that YOUR model? Nothing justifies transformation except the
>model.
he ignores the fact that the data have much information to help us decide
whether the simple regression model was reasonable. The data themselves can
justify a transformation by suggesting that the original regression model
must have been wrong. These data, for example, make it clear that a
linear model in the raw data is not reasonable. There is good evidence that
a log transformation stabilizes the variance and that the relationship is
linear and the error additive on the log scale. This is quite an ordinary
occurance. This is much of what you mean by the transformed data "looking
better". In that sense you are wrong to suggest that this is not a
justification for transformation; You should always listen to your data.
Some data whisper; these are shouting at you for a log transformation.
To answer your other questions, yes it does make a difference compared to
the untransformed data. It is an entirely different model. However, the
evidence in your data suggest strongly that it is a more appropriate model.
That should send you back to the theory to ask whether a model relating
the logs makes scientific sense. (The data are so strong that if it
doesn't, I would suggest that you rethink the science or redesign the
experiment.)
There is nothing wrong with adding a starting constant of 1 to the data to
save the zero point. You might consider whether the 1 has any meaning in
this context. Other starts are possible, but they make little difference to
the final analysis. The start merely has the effect of weakening the effect
of the log transformation slightly.
<>
(I don't know if any of these questions is
scientifically interesting because I know nothing about the data. However,
they only arise if you transform the data. This is another argument for
transforming; it helps you to see what may be going on in your data and
thereby raises new questions that may prove interesting.)
Hope this helps.
Paul Velleman
******************* **********
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
FAQ top.
Ulrich home page.
Ulrich FAQ.
http://www.pitt.edu/~wpilib/stats99.html