Model Selection


In the early days of capture-recapture, model selection was of little interest, as few models existed, and the most appropriate model of those available was easily identified. However, the work of Otis et al.(1), Cormack (see log-linear model section), and others has generated a large number of potential models. In epidemiological studies in particular, such a choice of models is necessary to handle the dependence between lists and the extreme heterogeneity that is often observed between individuals. Model selection is thus part of the estimation procedure, and should be fully integrated with it.

Estimation is usually carried out within the framework of maximum likelihood. Thus a natural way to select between nested models is to use likelihood ratio tests. The likelihood function can also be used to select between non-nested models, for example using Akaike's Information Criterion (AIC; (2)), an approach that was advocated for use in capture-recapture by Burnham and Anderson (3). The criterion is calculated as

      AIC = -2 × [log(L) - q],

where log(L) is the log-likelihood evaluated at the maximum likelihood estimates of the model parameters, and q is the number of parameters in the model. The first term is a measure of how well the model fits the data, and the second term is a penalty for the addition of parameters (and hence model complexity). The model giving the smallest value of AIC is selected by the criterion. For the special case of nested models in which one model has exactly one more parameter than the other, AIC is equivalent to a likelihood ratio test of size 15.7%, and hence it is more likely to select the more complex model than a likelihood ratio test carried out at the conventional size of 5% (p = 0.05). In some applications, the Bayes information criterion (BIC) is preferred to AIC:

      BIC = -2 × log(L) + q × log(n),

with log(L) and q as above, and n the sample size. In the context of mark-recapture, it is questionable what n should be. One choice is the total number of recorded individuals in the population. Burnham et al (4) argue that it should instead be the number of releases, excluding those released from the last sample (list here). However, in the context of epidemiology, there is no chronological ordering of the lists, so this solution is unsatisfactory. Burnham et al also make a compelling case for use of AIC in preference to information criteria that have 'dimensional consistency' such as BIC, in which case the problem is avoided.

It is usually neither practical nor sensible to fit all possible models, so that AIC can be evaluated for each one. In any case, the data often do not contain information for selecting between contending models (see loglinear models section). Additional knowledge should be used wherever available to eliminate implausible models from consideration, or to select between models that yield identical likelihood functions.

As has been noted earlier, having obtained an estimate of population size and a corresponding standard error, it is not satisfactory to set a 95% confidence interval as estimate ± 1.96 standard errors. This is partly because the estimate is frequently far from normally distributed. Improved methods, using a transformation, likelihood intervals or the bootstrap, have already been noted. However, the resulting intervals are still too short in general, because they do not reflect model mis-specification. In other words, they are conditional on the correct model having been selected. If model selection is to be an integral part of estimation, allowance must be made for having to estimate which model is appropriate. In principle, this may be done in a likelihood or a Bayesian framework, although it is not trivial. Simpler is to generate bootstrap resamples, and apply the model selection procedure to each resample. Buckland and Garthwaite (5) show how this may be done under different models. The simplest method is to list the capture data for each observed individual, and to sample with replacement from the list until the resample contains the same number of individuals as the original sample. Now follow through the estimation procedure, including model selection, on this resample exactly as if it had been the observed sample. Repeat b times (typically b = 400 to 1000) to generate the bootstrap estimates of population size. Variance may be estimated as the sample variance of these bootstrap estimates, and a 95% 'percentile' confidence interval obtained by ordering the bootstrap estimates from smallest to largest and reading off the r-th and s-th values from the ordered list, where r = 0.025(b+1) and s = 0.975(b+1) (6). Thus for b = 999, r = 25 and s = 975. Both the variance and the confidence interval include a component for model mis-specification bias. Typically in epidemiological studies, this bias can be large, and the improvement in confidence interval coverage by allowing for it is far greater than is achieved by replacing the naive estimate ± 1.96 standard errors by a method that does not assume that the estimator is normally distributed but is still conditional on the fitted model.

Although the above approach is undoubtedly an improvement on a strategy that ignores model selection uncertainty, it is still imperfect. For example, we would like to resample from the full set of individuals in the population, but we do not know the number of individuals that appear on list. Additionally, in bootstrapping individuals, we assume that the presence of one individual on a list in no way affects whether another appears on the list, yet dependence of this type is one of the sources of heterogeneity that is so problematic in epidemiological applications of mark-recapture. Use of the nonparametric bootstrap assumes that the frequencies associated with each 'capture history' (i.e. each possible combination of lists in which an individual might appear) are multinomial. If we wish to relax this assumption, for example to allow different individuals with the same capture history to have different probabilities of appearing on a list, we can fit a mark-recapture model that allows this, and generate resamples from this fitted model, using the parametric bootstrap.

The fitting of loglinear models to capture-recapture data can be computer intensive, and model selection is not easily fully automated, even using AIC, as it usually benefits from interaction between the scientist(s) and the computer, to home in on the 'optimal' model. The scientist is unlikely to pay the same attention to detail to each of 1000 bootstrap resamples as he/she does to the real data! Thus considerable computer power and development of an artificial intelligence algorithm to mimic the scientist's model selection methods are required to take full advantage of the bootstrap approach.


Reference

1. Otis DL, Burnham KP, White GC and Anderson DR. Statistical inference from capture data on closed animal populations. Wildlife Monographs 1978;62.

2. Akaike H. Prediction and entropy. In A Celebration of Statistics (eds A.C. Atkinson and S.E. Fienberg). Springer-Verlag, Berlin, 1985; 1-24.

3. Burnham KP and Anderson DR. Data-based selection of an appropriate biological model: the key to modern data analysis. In Wildlife 2001: Populations (eds D.R. McCullough and R.H. Barrett). Elsevier Science Publishers, London, 1992;16-30.

4. Burnham KP, Anderson DR, White GC. Evaluation of the Kullback-Leibler discrepancy for model selection in open population capture-recapture models. Biom J 1994;36:299-315.

5. Buckland ST and Garthwaite PH. Quantifying precision of mark-recapture estimates using the bootstrap and related methods. Biometrics 1991,47:255-268.

6. Buckland ST. Monte Carlo confidence intervals. Biometrics 1984;40:811-817.