One essential aspect of planning capture-recapture study is to select the sources from which the data will be collected. Before conducting any capture-recapture study, it will be helpful to check the literature. One can learn potential sources from the literature and the capture-recapture experience from other researchers. Source selection is critical as it will affect the whole study design and the validity of the results. For a specific disease or type of injury, usually there are different sources available to identify the cases. To decide which are suitable for capture-recapture use, one should keep in mind that all the cases should have a chance to be listed in each one of the selected sources. Therefore, complementary sources or mutually excluded sources which are often used in the traditional surveillance systems are not encouraged. Instead, the sources should overlap as the key component of capture-recapture analysis is the overlap information among the sources. The sources should also have the same geographic coverage and time frame. If the study concerns both the morbidity and mortality of a disease or type of injury, death certificates, for example, should be used with extra caution, since only those deceased have the chance to be listed in the death certificate and they can only be compared with other "death" lists. One may consider stratifying the study by level of severity and/or geographic coverage if there are problems of capability in the sources; different sources could be used to identify cases from different levels of severity in different areas and then summarize the results. Finally, all the sources should apply the same case definition to recruit the cases; e.g. case definition should be consistent among the laboratory-based sources and clinician-based sources.
Optimal number of sources?
One may think the more the better. However, it is neither necessary nor practical to have
as many sources as possible. First, cost. Every source costs money/time. There are costs for
identifying each case and costs for locating and maintaining the identifiers. The more sources, the
higher the cost. Therefore budgetary limitation usually will not allow researchers to have a large
number of sources in the study. Secondly, the variation of the estimates. The variation of
capture-recapture estimate may increase as the number of sources increases. Although more
sources will be able to identify more cases, it will not guarantee that they will also increase the
accuracy of the estimate, instead it is possible that the variation of the estimate will become large
due to too few overlap cases. There are no hard and fast rules, however, it is likely that most
epidemiologic studies using capture-recapture should have 3-5 sources. With 3, 4 or 5 sources,
one will be able to determine whether there is dependence among the sources and/or
heterogeneity among the population. However, with two sources, one can not perform such
analysis as it is impossible to assess dependencies. If there are only two sources available, it is
necessary to check the sources meet the assumptions of the two-sample capture-recapture
method, especially the independence of the sources. When capture-recapture method is used in
the health area, the assumption of source independence is the most problematic as typically health
lists are dependent. For example, if one is investigating the incidence of certain type of cancer
and
uses hospital discharge data and physician records, it is very likely that an individual identified by
the hospital record was also identified by the physician record because of referral patterns.
Therefore, the two-sample method should be used with caution. The cost-benefit analysis of the
sources would help us decide how many and which sources to be included in the
capture-recapture methods.
Pooling the sources together?
To reduce the number of sources, one can consider collapsing similar sources together
or those sources which are positively dependent of each others and view the collapsed sources
as one source. If there are both major and minor sources and each of the minor sources
contribute small number of cases, those minor sources could be combined to become a major
source. In some studies the geographic coverage is large and each area has a local hospital to
identify the cases. In the capture-recapture application, all the hospitals should be considered as
one source rather than separate sources since patients are typically seen in only one hospital. To
decide whether to pool sources together, it will also be helpful to check the coefficients of
variation of the estimates. If after combining, the coefficient of variation drops significantly (that
is the accuracy of the estimate improves much), the sources should be combined.
Example of Capture-Recapture Sources