Log-Linear Models


With two lists an estimate of population size N is available only by making an assumption about R, usually that the population is homogeneous and the lists are independent. If N were known we could test these assumptions by a c 2 test - either Pearson X2 or the log likelihood-ratio test, sometimes denoted G2 - for the 2 x 2 table of observed counts. With unknown N these assumptions are untestable from the observed n11, n10, n01. We have to have faith, not internal evidence, that E(n11) = NpApB.

Taking logs this model can be written as log E(n11) = log(N) + log(pA) + log(pB). All the information from the 3 observations is used up in estimating the 3 unknown parameters N, pA, pB and no information remains to test the assumptions. This is a model for the logarithms of our observations which is linear in a set of parameters : a log-linear model. If interaction exists between the lists then this will change the model, a change which can be incorporated by saying that log E(n11) is now log(N) + log(pA) + log(pB) + log(iAB) for some new interaction parameter iAB.

This kind of loglinear model has become the standard form of analysis for contingency tables and was proposed and developed for capture-recapture by Fienberg (1). The above description has been somewhat disingenuous in not giving details of the models for the other observations. Following strictly in the above form logE(n10) would be log(N) + log(pA) + log(1-pB) + an interaction term, and it appears that the parameters here are not the same as, or linear functions of, the parameters used to model logE(n11), since log(1-pB) for example is not a linear function of log(pB). This is readily remedied: there are two alternative ways of formulating a genuinely additive model with 4 parameters to describe helpfully the 4 observations in a 2 x 2 table. The first (Fienberg) is that proposed for the analysis of a complete contingency table:

      log E(n11) = u + uA + uB + uAB
      log E(n01) = u - uA + uB - uAB
      log E(n10) = u + uA - uB - uAB
      log E(n00) = u - uA - uB + uAB.

This has a symmetry between those that are known because they appear in a list and those that are unknown and to be estimated. It has the advantage that the terms uA, uB, etc represent marginal logits of the probabilities of appearing in the various lists. The alternative parameterisation (2) is tailored to the asymmetry of the capture-recapture problem by introducing the interaction parameter only in the modelling of the missing category:

      log E(n11) = u
      log E(n01) = u + uA
      log E(n10) = u + uB
      log E(n00) = u + uA + uB + uAB.

When n00 is unobserved, the second parameterisation contains only 3 parameters u, uA, uB, to describe the 3 observations and no constraint has to be imposed on these three parameters. From the three equations we can estimate u, uA and uB. Unfortunately we can do nothing with the fourth equation as it has an unknown on both sides. However if we can apply another constraint such as uAB = 0 we can then estimate n00 from this equation. Here the interaction term uAB is the logarithm of the odds ratio E(n00)E(n11)/E(n01)E(n10) = p00p11/p01p10. This ratio is discussed, for example, by Agresti (3, p.131) in the context of the complete contingency table. We note that when uAB is zero, u, uA and uB are respectively the logarithms of NpApB and of the odds (1-pA)/pA, and (1-pB)/pB. We shall use this second formulation in all that follows.

The model's extension to more than two lists is clear. With three lists there are 8 possible combinations of lists in which individuals do or do not appear - the combination of all non-appearance is, in capture-recapture studies, unobserved. There are in the general model 8 parameters, the common u - the logarithm of the number expected to be in all lists, 3 'main effects' uA, uB, uC - log odds against appearing in each list for individuals who appear in the others, 3 'two-factor interactions' uAB, uAC, uBC - log odds ratios between pairs of lists for individuals who appear in the other, and a 3-factor interaction uABC. The formulation makes explicit the total necessity of a firm assumption, untestable from the data of the study, about the value of uABC before any estimate can be given for n000 and hence N. Assumptions can be made about the other parameters, e.g. that uAB= 0 (i.e. lists A and B are independent: the probability that an individual appears on list A is unaffected by knowledge of whether or not he appears on list B), or that uAB = uAC = uBC. Such assumptions can be tested from the data, although tests may not be very powerful for smallish samples. Since the complete model, with all parameters separately present (including uABC), is merely a way of re-expressing the data, any model proposed in any form must be a special case of this model. If relevant, the second parametrization allows birth, death and migration between successive samples to be tested for and modelled within the same framework.

Programs for the analysis of such loglinear models exist in most large statistical packages, such as S+, GLIM and SAS: it should be noted that packages may not offer a choice between the two parametrizations. Estimates of the u-parameters are obtained by maximum-likelihood (ML) from an assumed multinomial model of the counts, from which the ML estimate of N can be obtained - for some, but not all, models these estimates are simple explicit functions of the counts(4). If a simpler model is fitted, its goodness-of-fit is measured either by the Pearson X2 or by the deviance G2. If a model truly represents the data, then the difference in Pearson X2 or difference in deviance between it and a more complex model of which it is a special case has approximately a c 2 distribution with degrees of freedom given by the difference in number of parameters between the models. For the difference between models the deviance is to be preferred both because the approximation to c 2 is better and because it strictly adds up when a succession of models is considered, which X2 does not.

The structure of these models corresponds exactly to that of the analysis of variance of a factorial experiment, with lists being factors at two levels - seen or not seen , interactions to be investigated, and the possibility of covariates. Residual sums of squares from any model in analysis of variance are replaced by deviances, the only change being that there is no unknown variance s 2 to be estimated, so that tests are based directly on the c 2 distribution for deviances rather than the F-distribution for ratios of mean squares, used in the analysis of variance. An appropriate set of interactions to be included in the model is then selected by the standard techniques of multiple regression. With 3 or possibly with 4 lists all models can be analyzed. With more, some form of stepwise regression can be used. Formal tests between nested models should be supplemented by residual plots and, above all, by substantive knowledge of the type of interactions which may be expected from the nature of the lists being analyzed. The final estimate should recognize that we do not know that the selected model is really true : it is over-optimistic to believe that statistical model-selection leads to one model which is the only one worth considering. What the model selection procedure may do is to show fairly conclusively that certain simple models are not true. Although there may be some statistic which asserts that one of the acceptable models is marginally better than the others, we should be willing to quote not just one model and its estimate, with a narrow theoretical confidence interval based on the absolute truth of that model, but a range of acceptable models with their estimates providing a range of values in which we may more realistically believe. Formal interval estimates can be obtained as a likelihood interval (5,6) or by a bootstrap procedure (7), which can allow for model uncertainty, but not reliably by taking the estimated number as Normally distributed.

What then are the interactions which are likely to arise? In the context of a closed population repeatedly sampled over time according to the same protocol, Cormack(8) said that "this failure (of the standard independence assumptions ) may be due to either or both of two causes: (a) The probability that a particular individual is caught in any sample is a property of the individual, this 'catchability' having some distribution over the population. (b) The probability that an individual is caught in any sample depends upon its previous history of capture." This distinction has become formalized in suffices h and b for models M defined by Otis et al (9). However, for lists obtained under different protocols, and for human behavior, we should expect both heterogeneity and behavioral dependence, in a form more general than conceived by Otis et al. In particular their behavioral models, occurrence in one list affecting equally all subsequent lists, seem unsuited to the present problem.

Direct dependence between lists has to be considered. By being on one agency's list an individual may learn of, or be approached by, another agency - this process being specific to the particular lists. Dependence between A and B will result in a non-zero value for uAB in the loglinear model. More complicated dependence may result in higher-order interaction terms uABC also becoming non-zero. Indirect dependence between lists will be caused by differences of character or behavior between individuals. If some lists refer to help-giving agencies, then individuals actively seeking help will have a greater tendency to appear on such lists than individuals unaware of, or trying to hide, their status. If all lists are of the same kind, heterogeneity can be modelled as a constant apparent dependence between all lists. This is the H-term used, with a purely empirical motivation, by Cormack (10) to test for heterogeneity, and given more formal justification by Darroch et al (11) and Agresti (12). Then the distinction between direct dependence and heterogeneity is similar to that between true and apparent contagion by which a negative binomial can be formed from a Poisson process (13).

However, all lists may not be of the same type. Plausible behavior patterns should be proposed by the experimenter before analysis, but the loglinear formulation allows the observed data to guide the choice: the analyst can follow up any pattern of observed interactions. To reduce heterogeneity, it is desirable that the population be stratified by any known factor thought likely to influence probability of being listed (14). A joint analysis can discover any parameter with constant values over different strata.

With K lists there are (2K-1) observations. The most general loglinear model has (2K-1) parameters u. Every feature of the data is completely represented somewhere in the set of parameters - nothing is lost. Conversely, if one u parameter is shown to become non-zero under two different types of relationships between the lists, then the data by themselves cannot distinguish between the two relationships. For example, if the lists are compiled sequentially over time, then migrants into the population between the first two lists A, B cause a nonzero value for uAB (2,15). Thus immigration and direct dependence between these lists cannot be distinguished. However immigration between lists B and C causes nonzero uABC, so that the two mechanisms are here separable. This formulation makes explicit the limits on the amount of information which can be extracted from K lists. The advantages of the approach are :

  1. ALL models are expressed, and can be compared, within a single framework of analysis;
  2. flexibility for the analyst of model selection, based on data, guided by prior belief;
  3. availability of formal tests between models and informal inspection of residuals;
  4. understanding of the way in which information is used by models;
  5. all inference is within the mainstream of statistical data analysis.

However large K is, SOME untestable assumption has to be made about one of the u-parameters. However, with large K, the assumption that the highest-order interaction is zero is more likely to be approximately correct.



Reference

1. Fienberg SE. The multiple recapture census for closed populations and incomplete 2k contingency tables. Biometrika 1972;59:591-603.

2. Cormack RM. Loglinear models for capture-recapture experiments on open populations. In The Mathematical Theory of the Dynamics of Biological Populations II, R.W. Hiorns and D. Cooke eds. London, Academic Press, 1981.

3. Agresti A. Categorical Data Analysis. New York, John Wiley & Sons, 1991.

4. Bishop YMM, Fienberg SE and Holland PW. Discrete Multivariate Analysis. Cambridge, MA: MIT Press, 1975.

5. Cormack RM. Interval estimation for mark-recapture studies of closed populations. Biometrics 1992;48:567-576.

6. Regal RR and Hook EB. Goodness-of-fit based confidence intervals for estimates of the size of a closed population. Stat. Med. 1984;3:287-291.

7. Buckland ST and Garthwaite PH. Quantifying precision of mark-recapture estimates using the bootstrap and related methods. Biometrics 1991,47:255-268.

8. Cormack RM. A test for equal catchability. Biometrics 1966; 22:330-342.

9. Otis DL, Burnham KP, White GC and Anderson DR. Statistical inference from capture data on closed animal populations. Wildlife Monographs 1978;62.

10. Cormack RM. The flexibility of GLIM analyses of multiple recapture or resighting data. In Marked Individuals in the Study of Bird Population, J.-D. Lebreton and P.M. North, eds. Basel, Birkhuser-Verlag, 1993.

11. Darroch JN, Fienberg SE, Glonek GFG and Junker BW. A three-sample multiple-recapture approach to census population estimation with heterogeneous catchability. J Amer Stat Assoc 1993;88:1137-1148.

12. Agresti A. Simple capture-recapture models permitting unequal catchability and Variable sampling effort. Biometrics, in press, 1994.

13. Johnson NL, Kotz S and Kemp AW. Univariate Discrete Distributions, 2nd ed. New York, John Wiley & Sons, 1993.

14. Sekar C and Deming EW. On a method of estimating birth and death rates and extent of registration. Journal of the American Statistical Association 1949;44:101-115.

15. Cormack RM. Log-linear models for capture-recapture. Biometrics 1989;45:395-413.