One of the basic things the students are taught in statistics classes is that the comparison of models using information criteria can only be done when the models have the same response variable. This means, for example, that when you have \(\log(y_t)\) and calculate AIC, then this value is not comparable with AIC from a model with \(y_t\). The reason for this is because the scales of variables are different. But there is a way to make the criteria in these two cases comparable: both variables need to be transformed into the original scale, and we need to understand what are the distributions of these variables in that scale. In our example, if we assume that \(\log(y_t) \sim \mathcal{N}(0, \sigma^2_{l}) \) (where \(\sigma^2_{l}\) is the variance of the residuals of the model in logarithms), then the exponent of this variable will have log-normal distribution:

\begin{equation}

y_t \sim \text{log}\mathcal{N}(0, \sigma^2_{l})

\end{equation}

Just as a reminder, all the information criteria rely on the log-likelihood. For example, here’s the formula of AIC:

\begin{equation} \label{eq:AIC}

AIC = 2k -2\ell ,

\end{equation}

where \(k\) is the number of all the estimated parameters and \(\ell\) is the value of the log-likelihood.

If we use the likelihood of log-normal distribution instead of the likelihood of the normal in \eqref{eq:AIC} for the variable \(y_t\), then the information criteria will become comparable. In order to understand what needs to be done for this transformation, we need to compare the formulae for the two distributions: normal and log-normal. Here’s normal for the variable \(\log y_t\):

\begin{equation} \label{eq:normal}

f(y_t | \theta, \sigma^2_{l}) = \frac{1}{\sqrt{2 \pi \sigma^2_{l}}} e ^{-\frac{\left(\log y_t -\log \mu_{t} \right)^2}{2 \sigma^2_{l}}}

\end{equation}

and here’s the log-normal for the variable \(y_t = \exp(\log(y_t))\) (the multiplicative model in the original scale):

\begin{equation} \label{eq:log-normal}

f(y_t | \theta, \sigma^2_{l}) = \frac{1}{y_t} \frac{1}{\sqrt{2 \pi \sigma^2_{l}}} e ^{-\frac{\left(\log y_t -\log \mu_{t} \right)^2}{2 \sigma^2_{l}}} ,

\end{equation}

where \(\theta\) is the vector of parameters of our model. The main difference between the two distributions is in \(\frac{1}{y_t}\). If we derive the log-likelihood based on \eqref{eq:log-normal}, here is what we get:

\begin{equation} \label{eq:loglikelihoodlognormal}

\ell(\theta, \sigma^2_{l} | Y) = -\frac{1}{2} \left( T \log \left( 2 \pi {\sigma}^2_{l} \right) +\sum_{t=1}^T \frac{\left(\log y_t -\log \mu_{t} \right)^2}{2\sigma^2_{l}} \right) – \sum_{t=1}^T \log y_t ,

\end{equation}

where \(Y\) is the vector of all the actual values in the sample. When we extract likelihood of the model in logarithms, we calculate only the first part of \eqref{eq:loglikelihoodlognormal}, before the “\(-\sum_{t=1}^T \log y_t \)”, which corresponds to the normal distribution. So, in order to produce the likelihood of the model with the variable in the original scale, we need to subtract the sum of logarithms of the response variable from the extracted likelihood.

The function
AIC() in R, applied to the model in logarithms, will extract the value based on that first part of \eqref{eq:loglikelihoodlognormal}. As a result, in order to fix this and get AIC in the same scale as the variable \(y_t\) we need to take the remaining part into account, modifying equation \eqref{eq:AIC}:

\begin{equation} \label{eq:AICNew}

AIC^{\prime} = 2k -2\ell + 2 \sum_{t=1}^T \log y_t = AIC + 2 \sum_{t=1}^T \log y_t,

\end{equation}

Let’s see an example in R. We will use longley data from datasets package. First we construct additive and multiplicative models:

1 2 |
modelAdditive <- lm(GNP~Employed,data=longley) modelMultiplicative <- lm(log(GNP)~Employed,data=longley) |

And extract the respective AICs:

1 2 3 4 |
AIC(modelAdditive) > 142.7824 AIC(modelMultiplicative) > -44.5661 |

As we see, the values are not comparable. Now let’s modify the second AIC:

1 2 |
AIC(modelMultiplicative) + 2*sum(log(longley$GNP)) > 145.118 |

So, now the values are comparable, and we can conclude that the first model (additive) is better than the second one in terms of AIC.

Similar technique can be used for the other transformed response variables (square root, Box-Cox transformation), but the respective distributions of the variables would need to be derived, which is not always a simple task.

UPDATE: Tim Cole pointed out to me that he has a package in R that implements the adjusted AIC and BIC (including more advanced adjustements). It is called sitar.