Empirical error

Published:

In a previous entry we studied the concepts of bias and variance in an additive context. In this entry we dive deeper into the analysis of the mean squared error and how to asses it using actual data.

Mean squared error revised

In the previous entry, we introduced bias and variance them as part of a decomposition of the mean squared error. Recall our model consisted of $d$-dimensional inputs $x^{(1)},\dots,x^{(m)}$ and real valued outputs/responses $y^{(1)},\dots,y^{(m)}$. We also assume there is a function $f:\mathbb{R}^d\to\mathbb{R}$ and random variables $\epsilon^{(k)}:\mathbb{R}^d\to\mathbb{R}$ such that

$y^{(k)} = f(x^{(k)}) + \epsilon^{(k)}.$

Our model then is

$\widehat y = \widehat f (x),$

where $\widehat f:\mathbb{R}^d\to\mathbb{R}$ is a random function depending on the data we have. In this setting, the mean squared error is

$MSE = \mathbb{E}(y-\widehat f)^2.$

The problem is that evaluating this expectation is not a trivial task, as we do not really have access to the true distribution of $y$. Fortunately we have at our disposition one of the fundamental tools of probability theory, which allows us to approximate expectations.

The law of large numbers

In previous entries (see here and here) we have seen how we can use the law of large numbers to study the almost sure behavior of averages of i.i.d. random variables. We use it now the other way around: in order to evaluate an expectation, we compute averages of realizations of random variables. Recall the LLN:

Theorem (LLN): suppose the i.i.d. sequence $\{X_n \}$ has expectation $\mathbb{E}(X_1) = \mu$ and finite fourth moment $\mathbb{E}(X_1^4)<\infty$. Then $S_n/n \to \mu$ almost surely.

Here $S_n = X_1 + \dots + X_n$. In general the assumption that the fourth moment of $X_n$ is finite can be relaxed to the assumption that only the first moment is finite, but the proof is much harder.

Now back to our problem: we proposed the MSE as a measure to asses the goodness of the fit of our model to the data. The MSE can be expressed as an expectation: $MSE = \mathbb{E}(y-\widehat f)^2$. If we want to evaluate this expectation using the LLN, we need to find a sequence of i.i.d. random variables $\{X_n\}$ such as

$\dfrac{X_1+\dots +X_n}{n} \to \mathbb{E}(X_n) = MSE$

almost surely.

Empirical error

Consider the (finite) sequence of random variables $X_k = (y - \hat f)^2$. It can be proved under reasonable assumptions that these functions are integrable, and hence we can apply the law of large numbers. If we consider their averages, we have that

$\dfrac{1}{n}\sum_{k=1}^n (y - \hat f)^2 \to \mathbb{E}(y - \hat f)^2 = MSE$

as $n\to\infty$. We call the sum in the left hand side the empirical error. Note that taking $n\to\infty$ can be thought as adding more points to our dataset. In other words, the bigger the dataset is, the better the approximation of the MSE by the empirical error. Recall that the MSE can be decomposed in the noise + bias^2 + variance decomposition (for the additive case):

$MSE = \sigma^2 + \text{var} (\widehat f) + \left(\mathbb{E}(f-\widehat f)\right)^2.$

being $\sigma^2$ the variance of $\epsilon^{(k)}$. Using similar ideas, one can approximate the bias and the variance. This is in general not necessary, as one can use directly the MSE to asses the fit of the model. One of the main paradigms nowadays in Statistical Learning is to fit models by finiding parameters which minimize the empirical risk (or analogue quantities). The idea is that we can find the parameters of our model using a portion of the dataset (called training set) and asses the fit using a portion of the dataset which was not used to train the model (called test set). We can formulate then a procedure to find appropriate models for our dataset:

1. For each model, we find the optimal train and test error,
2. We plot in the x-axis the complexity of the model (i.e., the number of parameters), and in the y-axis the error. We plot separately two curves, one for the train error and one for the test error.
3. In general, we observe the following: as the complexity increases, the training error decreases; on the other hand, the test error decreases for the first part of the plot, and then it increases again.
4. To select the model, we chose the one with complexity so that the test error is minimal, while trying to keep the training error as low as possible.

It is worth noticing that this procedure may be used to chose some of the hyperparameters of a model, such as the number of hidden layers and nodes of a neural network.