Information criteria: AIC, AICc, BIC

Information theoretic approaches view inference as a problem of model selection. The best model is the one that has the least information loss relative to the true model. Information criteria (IC) are estimates of the Kullback Leibler information loss, which cannot be calculated in real life models. The best known IC is the Akaike IC

    \[AIC=-2ln(l) +2k\]

and its corrected form for small sample sizes,

    \[AIC_c=AIC+\frac{2k(k+1)}{N-k-1}\]

as well as its Bayesian alternative,

    \[BIC=-2ln(l) + ln(N)*k\]

where l is the maximum likelihood of the model, k is the number of degrees of freedom (or independently adjusted parameters) in the model and N is the number of observations (total sample size) (Baguley p403)

Model selection

Model selection is not a science. Except in rare circumstances, there is no one perfect model, or even one “true” model; there is rarely even one “best” model. (Testing the difference in AIC of two non nested models)

Roughly speaking, choosing from two models the more parsimonious (having less parameters to estimate, therefore more degrees of freedom) will be suggested as preferable. An information criterion introduces a special penalty function that restricts the inclusion of additional explanatory variables into linear model conceptually similar to the restrictions introduced by adjusted R2.(Test equivalence of non nested models)

A few other simple definitions of AIC:

AIC is a number that is helpful for comparing models as it includes measures of both how well the model fits the data and how complex the model is.

AIC is just one of several reasonable ways to capture the trade-off between goodness of fit (which is improved by adding model complexity in the form of extra explanatory variables, or adding caveats like “but only on Thursday, when raining”) and parsimony (simpler==better) in comparing non-nested models.

AIC is a measure of how well the data is explained by the model corrected for how complex the model is.

They are “both mathematically convenient approximations one can make in order to efficiently compare models. If they give you different”best” models, it probably means you have high model uncertainty, which is more important to worry about than whether you should use AIC or BIC. “

Information criteria are different solutions to the problem of parsimony, or different ‘brands’ of Occam’s razor. ICs are a composite of ‘model fit’ and a ‘penalty for complexity’ (Baguley, p403)

Which IC is best

Discussions of AIC vs. AICc vs BIC vs. SBC vs. whatever leave me somewhat nonplussed. I think the idea is to get some GOOD models. You then choose among them based on a combination of substantive expertise and statistical ideas. If you have no substantive expertise (rarely the case; much more rarely than most people suppose) then choose the lowest AIC (or AICc or whatever). But you usually DO have some expertise – else why are you investigating these particular variables? (Testing the difference in AIC of two non nested models)

Note that BIC replaces 2 with ln(N). Because e^2\approx7.39, BIC will select simpler models than AIC when N>8. AICc is preferred for small samples (and in general, AICc is always preferred (Baguley p403). For a simple, linear, one-predictor model (univariate regression), k=2 (slope and intercept).

AIC might overfit, whereas BIC might underfit. AIC is a measure of the goodness of fit a model, adjusting (or penalizing) for the number of parameters in the model.

AIC should rarely be used, as it is really only valid asymptotically. It is almost always better to use AICc (AIC with a correction for finite sample size). AIC tends to overparameterize (because the model-size penalty is pretty low): that problem is greatly lessened with AICc. The main exception to using AICc is when the underlying distributions are heavily leptokurtic. (Model Selection by Burnham & Anderson).

The absolute value of IC does not matter, the difference between the ICs for different models matters.

AIC is a measure of predictive discrimination whereas the Brier score is a combined measure of discrimination + calibration.

AIC tries to select the model that most adequately describes an unknown, high dimensional reality. This means that reality is never in the set of candidate models that are being considered. On the contrary, BIC tries to find the TRUE model among the set of candidates. I find it quite odd the assumption that reality is instantiated in one of the model that the researchers built along the way. This is a real issue for BIC…. use both AIC and BIC. Most of the times they will agree on the preferred model, when they dont, just report it.

Another:

My quick explanation is that AIC is best for prediction as it is asymptotically equivalent to cross-validation. BIC is best for explanation as it is allows consistent estimation of the underlying data generating process.

AIC is leave one out (LOO) cross-validation while BIS is k-fold cross validation

Harrell:

In my experience, BIC results in serious underfitting and AIC typically performs well, when the goal is to maximize predictive discrimination.

Testing the difference between two non-nested models

dAIC <- 5
exp (-0.5 * dAIC) # = LR of two models 
## [1] 0.082085
exp (-0.5 * -2) #2.71 times as probable to minimize information loss
## [1] 2.718282
exp (-0.5 * -10)
## [1] 148.4132

Find the model with the minimum AIC_c. All other models will have a positive \Delta AIC_c and will lose some information. If \Delta AIC_c=5 then the second model is e^{(-0.5 * 5)} \approx 0.08 times as probable as the first model to minimize the information loss.

R functions:

  • library(lmtest)
  • coxtest(fit1, fit2)
  • jtest(fit1, fit2)

Where fit1 and fit2 are two non-nested fitted linear regression models, coxtest is the Cox LR test, and jtest is the Davidson-MacKinnon J test. Test equivalence of non nested models

References:

AIC questions on SO: Model selection

Leave a Reply