Matching Individuals From Different Sources

One of the most important steps in the capture-recapture application is to determine the number of overlap cases. This procedure is critical as one of the assumptions of capture-recapture methods is perfect matching', that is there are no misclassification errors with respect to determining whether a particular individual has been recorded by both sources or only one of them. If there are many true matches however which are not identified by the matching procedure (false negative matches), overestimation of the number of missing cases is very likely to occur. To determine the overlap cases, it is necessary to link or match the same individual from different sources. The variables which could be used to link the individuals usually are those common to all the sources. For example, name, social security number, date of birth, address, zip code. For injury, the date of event could also be used to perform the matching.

There are several matching techniques; exact matching, relaxing exact matching, probabilistic matching, and the combination of them. Exact matching is to find the individuals from different sources who have exactly the same field values, while relaxing exact matching does not demand exact' correspondence on the variables but allows a minimal degrees of errors in the variables. The approach of probabilistic matching is to assign different probabilities or weights to the variables as some variables provide more information and are more reliable than the others. Then for each record-pair comparison, a total weight will be generated by summarizing the individual weights from each variable or field comparison, and those matched cases will be the one with highly positive weights (Jaro, Ding-Fienberg).

Purely exact matching is not recommended, even for "perfect" data set as typing errors or coding mistakes usually occur during the data collection. If name identifiers are available, e.g. name or social security number, the combination of exact matching and relaxing exact matching or probabilistic matching is encouraged. The exact matching on the name identifiers could be used first, then relaxing exact or probabilistic matching could be used on other variables to match the remaining cases. If there are no name identifiers or the data set is not "clean", probabilistic matching or relaxing exact matching should be used. However, no matter what techniques to be used for matching, the researchers should always have deep understanding of the data and the sources, and visual inspection of the data should be proceeded before any matching.

Several statistical softwares are able to perform exact matching technique, however, these matching procedures are only capable of handling one variable at a time. For example, the match' commend in SPSS can match two files together based on a variable specified by the users; the find duplicate' commend in S1032 database system can identify the duplicate cases in a data set.