CR History

Capture-recapture methods have a long history, and they were first applied in the study of fish and wildlife populations before being adapted for other purposes. The application of these methods to the study of epidemiologic problems came relatively late in this history and thus has been able to draw on advances in the other areas as well as in statistical methods more broadly. The simplest capture-recapture model is the so-called two-sample model, used solely to estimate the unknown size of a population. The first sample provides the individuals for marking or tagging and is returned to the population, while the second sample provides the recaptures. Using the numbers of individuals caught in both samples (the recaptures) and the numbers caught in just one sample, it is possible to estimate the number not caught in either sample thus providing an estimate of the total population size. The assumptions required for this estimate to be valid can be spelt out in a number of ways. However the key ingredients are: (i) There is no change to the population during the investigation (the population is closed). (ii) There is no loss of tags (individuals can be matched from capture to recapture). (iii) For each sample, each individual has the same chance of being in the sample. (iv) The two samples are independent. Assumption (iv) really follows from (iii) since the latter implies that marked and unmarked have the same probability of being caught in the second sample so that capture in the first sample does not affect capture in the second sample: samples are independent. However it is convenient to list (iv) separately.

In ecology, the method is generally called the Petersen method because of Petersen's work in 1894 associated with tagged fish, though its first use in fisheries was by Dahl in 1917. It was also used by Lincoln in 1930 to estimate the size of a duck population [Le Cren(1)]. Sekar and Deming (2) used the method to estimate birth and death rates, and the extent of registration in 1949. Their paper may be regarded as the first serious application of the capture-recapture method to human health and has a good discussion on some of the practical problems associated with the method. Using a similar approach, Shapiro (3) applied the technique to birth registration in the USA using census data. There is also a substantial literature going back to the 1940's (Tracey (4)), under the title of dual record systems or dual-system estimator, dealing with the application of the two sample method to census data. By taking another sample in addition to the census, the capture-recapture method can be used for estimating undercount by the census. The method and controversy currently surrounding its application to the US census are described by Hogan (5). A helpful bibliography of the literature relating to this problem is given by Fienberg (6).

The above method can, in principle, be applied to any situation where there are two incomplete lists. One simply replaces "being caught in sample i" by "being on list i". This is the case in epidemiology where lists can be constructed from a variety of sources such as hospital records, doctors' medical files, medical prescriptions and so on. By their very nature these lists are incomplete and the problem is to estimate those missing from both lists. In spite of the above early work, such applications to epidemiology came later with Wittes and her colleagues (7,8) pointing out the connections.

With regard to applying the assumptions to epidemiology, the experiment can generally be set up so that (i) is at least approximately true. For assumption (ii), matching will depend on the quality of the patients' records and the uniqueness of the patients' code names. In some parts of the world matching is a real problem. Unfortunately assumption (iii), that each individual has the same probability of being on a given list, is generally false, that is patients tend to be heterogeneous with regard to being "caught" on a list. Some methods for minimizing heterogeneity are described later. However, even if something could be done about this, assumption (iv) is invariably false. For example, if certain doctors refer their patients to certain hospitals, then hospital admissions and doctors' records will not give two independent lists. This question of dependence is discussed in detail by Sekar and Deming (2) and Wolter (9). One can think of decomposing assumption (iii) into two parts -- dependence and heterogeneity of capture probabilities. For human populations, the latter component has been considered only recently (10,11) although those working in ecology and other areas had done so earlier.

In animal population studies, the 2-sample method was extended to the K-sample method. By taking more than two samples one can utilize the information from the multiple recaptures. The unmarked animals in each sample are now given individual marks before being returned to the population. If one uses individual (e.g. numbered) marks then the capture history of each marked individual is known.

The first person to introduce the K-sample capture-recapture method was Schnabel in 1938(12), in the context of fishing in a lake. She made the usual assumptions about the sampling and the marking processes such as each sample is a simple random sample and animals do not lose their tags. The theory of this model was developed more fully by Chapman, Darroch and others in the 1950's (13, chapter 4). However it was recognized that some of the underlying assumptions may not hold. For example there was the problem of heterogeneity - unmarked animals had different probabilities of being captured in a given sample, and marked animals behaved differently from unmarked. To cater for populations with these problems, a range of different models was introduced in the 1970's and these are associated with the names of Anderson, Burnham, Otis, White and others (see the review by Seber (14), p.275). These models have since been added to by Chao so that a hierarchy of eight models is now available (see the reviews by Pollock (15); Seber(16, pp.141-3)).

The K-sample method had also been applied to populations that allow migration, birth, and death to take place during the period of the study (the open population). There is a very extensive and expanding literature on the subject (17,18). However, such models depend on the assumption that samples are independent. As this is not the case with lists, it is unlikely that these general models will be directly useful in epidemiology.

Another method for handling the breakdown of the assumptions is the log-linear model which was applied by Fienberg (19) to capture-recapture data. In fact, a general log-linear framework allows for the representation and incorporation of most of these models for K lists, as well as some extensions for the generalization from closed to open populations (20).

Clearly the above methodology has the potential for being applied to K lists. Unfortunately we run into the same problem again, namely that of list dependence. Current thinking would suggest that of all the above approaches only the log-linear model has the flexibility for handling this particular problem. However, such a model has to be used with caution as one still needs some assumptions to hold for the model to be useful (see Appendix for detail).

Reference

1. Le Cren ED. A note on the history of mark-recapture population estimates. J. Animal Ecol 1965;34:453-4.

2. Sekar C and Deming EW. On a method of estimating birth and death rates and extent of registration. Journal of the American Statistical Association 1949;44:101-115.

3. Shapiro S. Estimating birth registration completeness. J Amer Stat Assoc 1949;45:261-264.

4. Tracy WR. Fertility of the population of Canada. Reprinted from Seventh Census of Canada, 1931, (Vol 2), Census Monograph No. 3. Ottawa:Cloutier.

5. Hogan H. The 1990 post-enumeration survey: operations and results. J Amer Stat Assoc 1993;88:1047-1060.

6. Fienberg SE. Bibliography on capture-recapture modeling with application to census undercount adjustment. Survey Methodology 1992;18:143-154.

7. Wittes J and Sidel VW. A generalization of the simple capture-recapture model with applications to epidemiological research. J Chronic Dis 1968;21:287-301.

8. Wittes JT, Colton T and Sidel VW. Capture-recapture models for assessing the completeness of case ascertainment using multiple information sources. J Chronic diseases 1974;27:25-36.

9. Wolter KM. Some coverage error models for census data. J Am Stat Assoc 1986;81:338-46.

10. Hook EB, Regal RR. Effect of variation in probability of ascertainment by sources ("variable catchability") upon "capture-recapture" estimates of prevalence. Am J Epidemiol 1993;137:1148-66.

11. Darroch JN, Fienberg SE, Glonek GFG, et al. A three-sample multiple-recapture approach to census population estimation with heterogeneous catchability. J Am Stat Assoc 1993;88:1137-48.