## Two-List Model

To demonstrate some of the difficulties in using lists, we look at the two-list model with lists A and B. Let n11 , n10, n01 and n00 be the numbers of individuals on both lists, on just the first list, just the second list, and on neither list, respectively. Assuming, for the moment, there is no heterogeneity but possibly list dependence, these numbers have a multinomial distribution with probabilities pij (i = 1,0; j = 1, 0) of being in the respective categories. Let r = n11+n10+n01 be the number of different individuals listed. Then n00 = N-r. Now nA = n11 + n10 is the number of individuals on the first list and nB = n11 + n01 the number of individuals on the second list. Since "catching" people on lists is a random process, nA and nB are random variables. Then, denoting the Expected value of a random variable by E[.], we have

`        E[nA] = NpA,`
`        E[nB] = NpB`

and

`        E[n11] = NpAB,`

where pA (= p11 + p10) is the probability of being on the first list, pB (= p11 + p01) is the probability of being on the second list and pAB (= p11) is the probability of being on both lists. If we assume that list A people occupy the same proportion of the population as they do of list B (which follows from assumption (iii)), then nA / N = n11 / nB and we are led to the well-known Petersen estimate

`      N^    = nAnB/n11,`

or

`      N^ = size of first list × adjustment factor,`

where the adjustment factor is the inverse of the estimated level of ascertainment by the first list. Another way of writing this which is useful in the log-linear formulation described later is N^ = r + n^00, where n^00 = n01n10/n11.

Since we can expect good estimates to give us the right answer when N is very large, we look at what happens asymptotically (i.e. as N tends to infinity). Then, to a first approximation, it can be shown under certain conditions that the expectation of a function of random variables is asymptotically equal to the function of the expectations. Hence, for large N, we have approximately,

```        E[N^] = E[nA]
E[nB]/E[n11]```
`              = NpApB/pAB`
`              = NR, say.`

If the lists are independent so that pAB = pApB (or equivalently p11p00/p10p01 = 1), then R=1 and N^ is approximately unbiased. If the lists are not independent, then we can write

```        R = pApB/pAB =
pB/pB|A,```

where pB|A is the probability of an individual being on list B given that it is on list A. If being on list A tends to increase the probability of being on list B then pB|A > pB, R < 1 and N^ will underestimate N. However there is another effect which acts in much the same way and therefore could be called "apparent dependence". This happens when there is a heterogeneity in the population so that the probability of being on a given list varies from individual to individual. This can be handled using a method similar to that of Seber (1, p.86). We assume that we can associate with each of the N members of the population a random triple (pA, pB, pAB), where pA etc are now random variables. Then under fairly general conditions (for example each pA, pB and pAB bounded away from 0 or 1 (Alho(2, p.625)), we find that

`     R = E[pA]E[pB]/E[pAB].`

To examine what happens to R it is helpful to assume list independence so that E[pAB] = E[pApB]. Then

```     R = 1 - ( cov[pA, pB] /
E[pApB] ),```

where cov[pA, pB] = E[pApB] - E[pA] E[pB]. Then R < 1 if the covariance is positive. This will be the case if pA and pB tend to be high together and low together, a common situation in epidemiology. Here list independence means that the value of pB for an individual does not depend on whether the individual is on list A or not. However, with heterogeneity, pB depends on pA, a subtle difference! Thus both positive dependence between samples and positive apparent dependence will bias the estimate downwards. These two effects cannot be distinguished from the data. Similarly negative dependence will bias the estimate upwards.

The effect of heterogeneity which produces the apparent dependence can sometimes be reduced by stratification. One estimates N for each stratum and then adds the estimates together to get an estimate of total population size. An example of this is given by Doscher and Woodward(3). This problem of pooling strata is discussed theoretically by Sekar and Deming(4), Seber(1, pp.101-3) and Regal and Hook(5). Another approach is to try and model the heterogeneity in some way. For example Alho (2) and Alho et al (6) assume independence for the individuals and model the heterogeneity using logistic regression. Alternatively, since the adjustment factor (see above) and its precision will vary from stratum to stratum, one method suggested for the US census is to smooth these factors using a regression model which relates the adjustment factor to other variables (7).

When there are more than two lists, Wittes et al (8) suggest comparing the lists two at a time. If there is no dependence among the lists and no heterogeneity then the "pairwise" estimates should be fairly similar. However if one of the estimates is, say, much lower than the rest, then one might suspect positive dependence between the two lists. They also suggested pooling lists which appeared to show high positive dependence. However given the possible presence of heterogeneity as well, such methods are somewhat ad hoc and ignore "overlap" information which is lost with merging. They may indicate when the assumptions break down but they cannot determine the source of the problem.

In conclusion it is clear that the two-list method should be avoided in epidemiology. Typically dependence and heterogeneity will be present. The same problem is multiplied when we have more than two lists. Clearly a different approach is needed. The key to the problem is the relationship between pA, pB and pAB or, at a more fundamental level, between the pij's. One approach, called log-linear modelling, sets up relationships for the log(pij).

### Reference

1. Seber GAF. The estimation of animal abundance and related parameters, 2nd edit. London:Charles Griffin & Co., 1982.

2. Alho JM. Logistic regression in capture-recapture models. Biometrics 1990;46:623-635.

3. Doscher ML and Woodward JA. Estimating the size of subpopulations of heroin users: Applications of log-linear models to capture-recapture sampling. International J Addictions 1983;18:167-182.

4. Sekar C and Deming EW. On a method of estimating birth and death rates and extent of registration. Journal of the American Statistical Association 1949;44:101-115.

5. Regal RR and Hook EB. Heterogeneity of more kinds than we have considered. Amer J Epid 1993;137:1148-1166.

6. Alho JM, Mulry MH, Wurdeman K and Kim J. Estimating heterogeneity in the probabilities of enumeration for the dual-system estimation. J Amer Stat Assoc 1993;88:1130-1136.

7. Hogan H. The 1990 post-enumeration survey: operations and results. J Amer Stat Assoc 1993;88:1047-1060.

8. Wittes JT, Colton T and Sidel VW. Capture-recapture models for assessing the completeness of case ascertainment using multiple information sources. J Chronic diseases 1974;27:25-36.