Jekyll2020-11-10T04:38:26-08:00https://felperez.github.io/feed.xmlFelipe PérezMathematician & data scientistFelipe Pérezfel.prz@gmail.comIs overfitting… good?2020-03-06T00:00:00-08:002020-03-06T00:00:00-08:00https://felperez.github.io/posts/2020/03/blog-post-31<p>Conventional wisdom in Data Science/Statistical Learning tells us that when we try to fit a model that is able to learn from our data and generalize what it learned to unseen data, we must keep in mind the <a href="/posts/2020/01/blog-post-27/">bias/variance trade-off</a>. This means that as we increase the complexity of our models (let us say, the number of learnable parameters), it is more likely that they will just <em>memorize</em> the data and will not be able to generalize well to unseen data. On the other hand, if we keep the complexity low, our models will not be able to <em>learn</em> too much from our data and will not do well either. We are told to find the <em>sweet spot</em> in the middle. But is this paradigm about to change? In this article we review new developments that suggest that this might be the case.</p>
<p>If you like my content, consider following my <a href="https://www.linkedin.com/in/felperez/">linkedIn</a> page to stay updated.</p>
<h3 id="revising-bias-and-variance">Revising bias and variance</h3>
<p>While we framed it as conventional wisdom, the bias/variance trade-off is based on the observation that the loss functions that we minimize when fitting models to our data can be decomposed as</p>
\[\operatorname{loss} = \operatorname{bias}^2 + \operatorname{variance} + \operatorname{error}\]
<p>where the error term is due to random fluctuations of our data with respect to the true function which determines responses from features, or in other words, we can think of it as noise on which we do not have control. For the purpose of this entry, we will suppose that the error term can be taken to be zero.</p>
<p>When we fit models to our data, we try to minimize the loss function on the training data, while keeping track of it evaluated on the test data. The bias/variance trade-off means that when we lookt at the test loss as a function of complexity, we try to keep it minimal by balancing both the bias and the variance. This is normally explained with the conventional U-shaped plot of complexity vs loss:</p>
<p><img src="/files/ushaped.png" alt="U shape" /></p>
<p>We see that low complexity models do not generalize well, as they are not strong enough to learn the patterns in our data, while high complexity models do not generalize well either, as they learn patterns which are not really representative of the underlying relationship between features and responses. Thus, we look for the middle ground. There have been signs (which can be traced back to Leo Breiman, <em>Reflections after refereeing papers for nips</em> 1995) that this paradigm must be re-thought and that there might be more to it. I am by no means giving an extensive literature review on this, but I will point out some articles which seem particularly interesting in this regard.</p>
<h3 id="understanding-deep-learning-requires-re-thinking-generalization">Understanding deep learning requires re-thinking generalization</h3>
<p>The article <a href="https://arxiv.org/pdf/1611.03530.pdf">Understanding deep learning requires re-thinking generalization</a> by Zhang et al suggested that we should revise our current notions of <em>generalization to unseen data</em>, as they can become meaningless in the examples they exhibit. More concretely, they look at the <strong><a href="https://en.wikipedia.org/wiki/Rademacher_complexity">(empirical) Rademacher complexity</a></strong>, and examine if it is a viable candidate measure to explain the incredible generalization capabilities of the state-of-the-art neural networks</p>
<p>. Suppose that our data has points $\{ (x_k,y_k) : k = 1\dots n\}$ and we are trying to fit functions from a class $\mathcal{H}$, that is, we are trying to find $h\in\mathcal{H}$ such that $h(x_k)$ is as close to $y_k$ as possible, while keeping the error on new samples low. In order to see how flexible for learning the class $\mathcal{H}$ is, consider Bernoulli random variables $\sigma_1,\dots,\sigma_n \sim \operatorname{Ber}(1/2)$ taking values on $\{-1,1 \}$, which we think as a realization of labels for our points $\{x_1,\dots,x_n \}$. A fixed function $h\in\mathcal{H}$ performs well if it outputs the correct label for most points, and since the outputs are $1$ and $-1$ with equal distribution, a good performance would mean that the average</p>
\[\dfrac{1}{n}\sum_{k=1}^n \sigma_k h(x_k)\]
<p>is close to one, as a correct prediction adds one to the sum (both $\sigma_k$ and $h(x_k)$ have the same sign in this case), while an incorrect prediction subtracts one to the sum (the factors have opposite signs). We can find the optimal function in the class by simply taking the supremum over $h$:</p>
\[r_n(\mathcal{H},\sigma) = \sup_{h\in\mathcal{H}}\dfrac{1}{n}\sum_{k=1}^n \sigma_k h(x_k).\]
<p>The above number quantifies how well the class $\mathcal{H}$ is able to learn a particular realization of labels $\sigma_1,\dots,\sigma_n$. To measure the overall generalizing power of the class $\mathcal{H}$, we look at the <em>average</em> $r(\sigma)$ for all realizations, so we can measure how well the class $\mathcal{H}$ is able to perform over all balanced problems:</p>
\[\mathfrak{R}_n(\mathcal{H}) = \mathbb{E}_\sigma\left[ \sup_{h\in\mathcal{H}}\dfrac{1}{n}\sum_{k=1}^n \sigma_k h(x_k) \right].\]
<p>We call this number the <strong>Rademacher complexity</strong> of $\mathcal{H}$. Zhang et al put this measure to a test: they found that multiple modern neural networks architectures are able to easily fit randomized labels. In other words, they took data $\{(x_k,y_k), k = 1,\dots, n \}$ and applied permutations $\phi\in S_n$ to the labels and then trained neural networks on the new data $\{(x_k,y_{\phi(k)}), k = 1,\dots, n \}$, which was capable of achieving near zero train error. These were the same networks that yield state-of-the-art results, so their conclusion is that Rademacher complexity is not a measure capable of explaining the great generalization potential of these networks. In their words:</p>
<p><em>This situation poses a conceptual challenge to statistical learning theory as traditional measures of model complexity struggle to explain the generalization ability of large artificial neural networks.</em></p>
<p>They also looked at different measures, such as VC dimension and fat-shattering dimension, for which they have the same conclusions. We are in need then of re-thinking how to measure the complexity of our models.</p>
<h3 id="to-understand-deep-learning-we-need-to-understand-kernel-learning">To understand deep learning we need to understand kernel learning</h3>
<p>The follow-up article <a href="https://arxiv.org/pdf/1802.01396.pdf">To understand deep learning we need to understand kernel learning</a> by Belkin et al dives deeper into the phenomenon studied in the previous article, and provides evidence that overfitting models with good generalization results are not a characteristic of <em>deep</em> neural networks, and it is in fact, something we can see in more <em>shallow</em> models. We will not go into further details of this paper, as we just want to remark that the phenomenon is wider than deep learning.</p>
<h3 id="reconciling-modern-machine-learning-practice-and-the-bias-variance-trade-off">Reconciling modern machine learning practice and the bias-variance trade-off</h3>
<p>In <a href="https://arxiv.org/pdf/1812.11118.pdf">Reconciling modern machine learning practice and the bias-variance trade-off</a> by Belkin et al, we see again strong signs that the classical bias/variance trade-off picture is incomplete. In their paper, the authors show evidence that (shallow) neural networks and decision trees (and their respective emsembles) exhibit a different behavior to the classical U-shape curve for the test error as a function of the complexity of the model (number of parameters). It is observed that there is threshold point (called <strong>interpolation threshold</strong>) for which more the training error becomes zero (that is, the model has effectively memorized the data) and the test error is high. Typically this point is presented (if so) at the rightmost of the test error against complexity graph. In the paper, the authors show that <em>beyond</em> the interpolation threshold, the test error starts <strong>decreasing</strong> again, while the training error stays minimal.</p>
<p><img src="/files/loss.png" alt="Loss" /></p>
<p>We can see that after the interpolation threshold, the test error decreases again. The x axis represents the numbers of features $\times 10^3$.</p>
<p><img src="/files/uextended.png" alt="U shape extended" /></p>
<p>The U shape is only the first part of a larger picture.</p>
<p>The authors explain this new regime in terms of regularization: when we reach the interpolation threshold, the solutions are not unique. In order to choose the solution which is optimal in some sense, we can pick the one which gives minimal $\ell^2$ norm to the vector of parameters defining the model. In the case of a shallow neural network, this corresponds to minimizing the norm in the repoducing kernel Hilbert space corresponding to the Gaussian kernel (i.e., an infinite dimensional model). As we expand the class of functions over which we look for our solution, this gives more room to find a candidate with smaller $\ell^2$ norm for the coefficients, acting as a regularization term for the solutions interpolating the data. This mechanism turns out to be an inductive bias, meaning that we strengthen the bias of our already overfitting models.</p>
<h3 id="benign-overfitting-in-linear-regression">Benign Overfitting in Linear Regression</h3>
<p>The previous paper showed strong evidence for a double descent behavior in the test error against complexity curve: after the interpolation threshold, the test error goes down again. In <em>Benign Overfitting in Linear Regression</em>, Bartlett et al characterize the phenomenon in the setting of linear regression. More precisely, for i.i.d. points $(x_1,y_1),\dots,(x_n,y_n),(x,y)$, we want to find $\theta^*$ such that $\mathbb{E}(x^T\theta^* - y) = \min_{\theta}\mathbb{E}(x^T\theta - y)$. For any $\theta$, the <strong>excess risk</strong> is</p>
\[R(\theta) = \mathbb{E}\left[ (x^T\theta - y)^2 - (x^T\theta^* -y)^2 \right],\]
<p>which measures the average quadratic error of our estimations using $\theta$ with respect to the optimal estimations. The article gives upper and lower bounds for the excess risk of the minimum norm estimator $\hat\theta$, defined by</p>
\[\min_\theta \| \theta \|^2 \quad \text{ such that } \quad \|X\theta - Y\| = \min_\beta\|X\beta - Y\|,\]
<p>where $X$ is the matrix containing all the realizations $(x_1,\dots,x_n)$ of $x$ as rows and $Y$ is similarly defined for $y$. The bounds in the paper depend of the notion of <strong>effective ranks</strong> of the covariance matrix of $x$, $\Sigma = \mathbb{E}[xx^T]$, defined by</p>
\[r_{k}(\Sigma)=\frac{\sum_{i>k} \lambda_{i}}{\lambda_{k+1}} \quad , \quad R_{k}(\Sigma)=\frac{\left(\sum_{i>k} \lambda_{i}\right)^{2}}{\sum_{i>k} \lambda_{i}^{2}}\]
<p>wheere $\lambda_i$ are the eigenvalues of $\Sigma$ in decreasing order.</p>
<h3 id="final-words">Final words</h3>
<p>We have seen throughout this article that the notions of complexity need to be re-thought, as well as the traditional idea that bias/variance trade-off is a static picture. There is hard evidence, both theoretical and practical, by means of experiments as well as the daily practices of many ML practitioner, that overfitting is not necessarily a bad thing, as long as we keep control of the test error and observe decay of it after the interpolation threshold. It is worth pointing that for this to happen, it is necessary to have a number of parameters much higher than the number of points in the dataset (taking into account the number of dimensions of each point), which can become computationally expensive. We may see in the future practices like this more often, and maybe one day the books will have to be changed and the bias/variance trade-off will be a thing of the past. In pragmatic terms, as long as it works, we can keep doing it!</p>
<p>Images from:
1 Elements of Statistical learning,</p>
<p>2, 3 Reconciling modern machine learning practice and the bias-variance trade-off</p>
<p>Thanks to Juan Pablo Vigneaux and Bertrand Nortier for pointing my interest to some of the papers discussed here.</p>Felipe Pérezfel.prz@gmail.comConventional wisdom in Data Science/Statistical Learning tells us that when we try to fit a model that is able to learn from our data and generalize what it learned to unseen data, we must keep in mind the bias/variance trade-off. This means that as we increase the complexity of our models (let us say, the number of learnable parameters), it is more likely that they will just memorize the data and will not be able to generalize well to unseen data. On the other hand, if we keep the complexity low, our models will not be able to learn too much from our data and will not do well either. We are told to find the sweet spot in the middle. But is this paradigm about to change? In this article we review new developments that suggest that this might be the case.PCA and supervised learning2020-02-28T00:00:00-08:002020-02-28T00:00:00-08:00https://felperez.github.io/posts/2020/02/blog-post-30<p>The situation is this: you have been given data, with several variables $x_1,\dots,x_d$ and a response $y$ that we want to predict using such variables. You perform some basic statistical analysis on your variables, see their averages, ranges, distribution. Then you look at the correlation between these variables, and find that there is some strong correlation between some of them. You decide to perform principal components analysis (PCA) to reduce the dimension of your features to $w_1,\dots,w_m$, with $m < d$. Now you fit your model, and you find that it gives terrible results, even though your PCA variables are capable of explaining most of the variance of the features. What went wrong?</p>
<p>There is a fundamental problem with what we did: PCA is blind to the dependence of the response with respect to the features. In what follows we will explore this situation with a concrete example in low dimension and very simple functions. In higher dimensions the situation can get much worse.</p>
<p>If you like my content, consider following my <a href="https://www.linkedin.com/in/felperez/">linkedIn</a> page to stay updated.</p>
<h3 id="how-it-can-go-wrong">How it can go wrong</h3>
<p>Suppose that we have data of the form $(x_1,x_2,y)$ where $x_i$ are the features and $y$ is the responde. Suppose that $X_1 \sim \mathcal{N}(0,1)$ and that $X_2 = X_1 + \mathcal{N}(0,0.4)$, that is, $X_2$ is equal to $X_1$ plus some small noise. Essentially, $X_1$ and $X_2$ are the same feature, or in other words, the are highly correlated. Let us generate some data following such distributions:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import numpy as np
np.random.seed(0)
x1 = np.random.normal(0,1,100)
x2 = x_1 + np.random.normal(0,0.4,100)
</code></pre></div></div>
<p>We can quantify the correlation of these two variables</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>np.corrcoef(x_1,x_2)
</code></pre></div></div>
<p>from where we obtain correlation of $0.93$, which is pretty high. Suppose now that our response $y$ is of the form $y = (x_1-x_2) + \mathcal{N}(0,0.2)$, that is, it depends linearly on $(x_1-x_2)$ plus small noise.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>y = x1 - x2 + np.random.normal(0,0.2,100)
</code></pre></div></div>
<p>We can plot the triplets $(x_1,x_2,y)$ and see clearly the dependence on $(x_1-x_2)$:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ax = plt.axes(projection='3d')
ax.view_init(elev=40, azim=20)
ax.scatter3D(x1, x2, y)
plt.title("Scatterplot of our data")
plt.show()
</code></pre></div></div>
<p><img src="/files/corrscatter.png" alt="3d scatterplot" /></p>
<p>In this setting, a simple linear regression would be able to capture the trend a give good results, as the plot of $(x_1-x_2)$ against $y$ suggests:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt.scatter(X-Y,Z)
plt.title("x1-x2 against y")
plt.show()
</code></pre></div></div>
<p><img src="/files/difference.png" alt="Difference" /></p>
<p>Suppose that we decide to run PCA on our data to reduce the dimensionality. Given that the data is approximately contained in a linear subspace (manifold), it makes sense that we could be able to fit a model using a single feature instead of two. Let us see what happens:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from sklearn.decomposition import PCA
A = np.hstack([x1.reshape(-1,1),x2.reshape(-1,1)])
pca = PCA(n_components=1)
w = pca.fit_transform(A)
</code></pre></div></div>
<p>Here $w$ is our new feature. We can quantify how much of the variance of our original features $w$ is capable of explaining:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pca.explained_variance_ratio_
</code></pre></div></div>
<p>We obtain that approximately $96\%$ is explained by $w$, so it <em>seems</em> like a good feature. Suppose now that we try to find a relation between $w$ and $z$. Since the pairs (feature,response) are two dimensional, a simple plot should reveal any relationship between these two feature. Let us see:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt.scatter(w,y)
plt.title("w against y")
plt.show()
</code></pre></div></div>
<p><img src="/files/pcafeature.png" alt="PCA feature" /></p>
<p>Surprise, no obvious trend! What happened? In order to understand this, we need to think about what PCA is really doing.</p>
<h3 id="a-bit-about-pca">A bit about PCA</h3>
<p>Let us see with a bit more details what we are doing with PCA: for the features $x_1,x_2$, we construct the covariance matrix $C$ with entries $C_{i,j} = \operatorname{cov}(x_i,x_j)$, then look at the eigenvalue $\lambda_1$ of largest module and its associated eigenvector $u_1$ (which for convenience we take of norm equal to 1), and then project our data onto that vector, that is, we look at the points $w = (x_1,x_2)\cdot u_1$. We will not go into the details why this is what we do to find the directions that maximize the variance, but we will see what happens when we do this in our case. If the vectors $x_1$ and $x_2$ are highly correlated, then the covariance matrix will be very close to the matrix [[1,1],[1,1]] The eigenvalues of this matrix are $\lambda_1=2, \lambda_2=0$, with associated vectors $u_1 = \frac{1}{\sqrt{2}}(1,1)$ and $u_2 = \frac{1}{\sqrt2}(-1,1)$. Thus, we project our data onto the vector $u_1$, obtaining the new feature $w = (x_1,x_2)\cdot u_1 = \frac{1}{\sqrt{2}}(x_1 + x_2)$. We can check that the feature $w$ that we obtained using PCA is essentially $(x_1+x_2)$ by plotting $(x_1+x_2)$ against $y$ and observing the plot is virtually the same as the one we obtained when we plotted $w$ against $y$:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt.scatter(x1+x2,y)
plt.title('x1+x2 against y')
plt.show()
</code></pre></div></div>
<p><img src="/files/sumfeature.png" alt="PCA feature" /></p>
<p>In this sense, while PCA is capable of obtaining the feature that captures most of the variance, it is not aware about how this feature and the response are related. Our final comment: be careful when using PCA in supervised learning, and always perform evaluation before and after applying PCA so we are able to spot situations like the one shown here. Thanks to my friend Alexis Moraga for making me aware of this phenomenon.</p>Felipe Pérezfel.prz@gmail.comThe situation is this: you have been given data, with several variables $x_1,\dots,x_d$ and a response $y$ that we want to predict using such variables. You perform some basic statistical analysis on your variables, see their averages, ranges, distribution. Then you look at the correlation between these variables, and find that there is some strong correlation between some of them. You decide to perform principal components analysis (PCA) to reduce the dimension of your features to $w_1,\dots,w_m$, with $m < d$. Now you fit your model, and you find that it gives terrible results, even though your PCA variables are capable of explaining most of the variance of the features. What went wrong?Martingales 02020-02-20T00:00:00-08:002020-02-20T00:00:00-08:00https://felperez.github.io/posts/2020/02/blog-post-29<p>I want to talk about martingales, but unfortunately in order to do that properly, we need to talk first about sigma-algebras and conditional expectations, subjects which can be a bit harsh at first. These concepts are essential, and while we could just work with them just as formal objects with certain properties, it is fundamental to have a deeper understanding of them so we do not get lost in formalism and we are able to capture the intuition behind this theory.</p>
<p>If you like my content, consider following my <a href="https://www.linkedin.com/in/felperez/">linkedIn</a> page to stay updated.</p>
<h3 id="sigma-algebras-and-information">Sigma-algebras and information</h3>
<p>In this section, I will try to give <em>some</em> intuition for what the measure-theoretic conditional expectation is, as it is one of the central objects of probability theory, and in particular, martingale theory.</p>
<p>Let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, and $X$ a random variable on $\Omega$. We can think of the sigma-algebra $\mathcal{F}$ as all the events that are somehow accesible for us to measure. For instance, the question <em>did the rv $X$ take a positive value?</em> can be represented with the event $\{ \omega\in \Omega : X(\omega) > 0\}$, which by definition of random variables, is an element of $\mathcal{F}$. In general most reasonable questions about what our random variable is doing can be represented by an element of $\mathcal{F}$. To each of these, we assign them a probability using $\mathbb{P}$. We can ask then for questions such as <em>what is the probability that $X$ takes a positive value?</em> by computing $\mathbb{P}\{ \omega\in \Omega : X(\omega) > 0\}$.</p>
<p>One caveat of the previous discussion, is that normally we do not have all the information of $\mathcal{F}$ available for us, we only have a subset of it. In other words, we normally have a sub sigma-algebra $\mathcal{G}$ of $\mathcal{F}$. If we have no information at all, then $\mathcal{G} = \{ \Omega,\emptyset\}$. In order to gain more information about the events of our space, we can use random variable (or observables of our space). If $X$ is a random variable that we can measure (let us say for instance, the energy of our space), then we can use it to extract information by looking at events of the form $X^{-1}(A)$ for $A\in\mathcal{F}$. The set of all these events is a sigma-algebra which we denote $\sigma(X)$. This sigma-algebra contains essentially all the information that can be extracted using measurements of $X$.</p>
<p>Normally, we do not really have all the values of $X$ so making any kind of inference is not a trivial task. Let us say for instance that our space $\Omega$ corresponds to the box $\{ (x,y) : -1\leq x,y\leq y \}$ and $X$ is the temperature at each point of the box (we will not discuss in too much details the sigma-algebra here). If we do not know the exact values of $X$ at every point but we still need to report some information about the temperature of the box, then we can look at the average of $X$ over the box, or</p>
\[\mathbb{E}(X) = \int_\Omega X d\mathbb{P}.\]
<p>While this number does not provide detailed information about the distribution of the temperature (as knowing $X$ would do), it is still a decent summary about what is going on in the box. If we take a random point $p$ of the box, then our best guess for the temperature at that point would be $\mathbb{E}(X)$ (of course there are multiple situations where this is not true, but let us keep it simple for now). Suppose now that we partition the box into two different sides $A = \{ (x,y) \in \Omega : x \geq 0\},B =A^c= \{ (x,y) \in \Omega : x < 0\}$, and measure the average temperature in each side, that is,</p>
\[\mathbb{E}_A(X) = \int_{A} X d\mathbb{P} \quad , \quad \mathbb{E}_B(X) = \int_{B} X d\mathbb{P} .\]
<p>If we are given a point $p$ of the box $\Omega$, what temperature would we the best guess for this new point? We can now give a more informed guess, by saying $\mathbb{E}_A(X)$ if $p\in A$ and $\mathbb{E}_B(X)$ if $p\in B$.</p>
<p>This improvement comes from the fact that we were able to make two measurements using $X$: the average temperature in $A$ and in $B$. In general, we gained information from performing such measurements. Technically, we count with the information of the sigma-algebra $\mathcal{G} = \{ \emptyset,A,B,\Omega\}\subset \mathcal{F}$. If we obtain information of bigger sigma-algebras, our estimate for $X$ will become more precise.</p>
<p>It is important to remark the following measurability issues: the fact that $X$ is a random variable means that $X$ is $\mathcal{F}$-measurable, that is, for any Borel set $E\subset \mathbb{R}$, the set $X^{-1}(E) = \{ \omega : X(\omega) \in E\}$ is an element of $\mathcal{F}$. If we only have at our disposition the information of the sigma-algebra $\mathcal{G}$, we will not be able to measure the events $X^{-1}(E)$ for arbitrary choices of $E$. In other words, $X$ is not necessarily $\mathcal{G}$-measurable.</p>
<h3 id="conditional-expectation">Conditional expectation</h3>
<p>The process we described in the previous section is a particular instance of a more general construction: the <strong>conditional expectation</strong>. Suppose that $\mathcal{G}$ is a sub sigma-algebra of $\mathcal{F}$ and that $X$ is a random variable with finite expectation. In the last paragraph of the previous section, we discussed the issue of measurability of our random variables. In practical terms, this means that we cannot answer questions such as <em>what is the probability that $X$ is between $4$ and $5$?</em> or other more general questions, using only the information of $\mathcal{G}$. What we can do to solve this problem, is to come up with a simplified version $X_\mathcal{G}$ of $X$ which is adapted to $\mathcal{G}$, in the sense that we can answer most questions about it using our knowledge of the sub sigma-algebra $\mathcal{G}$.</p>
<p>We say that a $\mathcal{G}$-measurable function $X_\mathcal{G}$ is a <strong>conditional expectation</strong> of $X$ with respect to the sub sigma-algebra $\mathcal{G}$ if</p>
\[\int_G X_\mathcal{G} d\mathbb{P} = \int_G X d\mathbb{P}\]
<p>for all $G\in\mathcal{G}$. We denote such function by $\mathbb{E}(X|\mathcal{G})$.</p>
<p>Note that the property of conditional expectations essentially says that $\mathbb{E}(X|\mathcal{G})$ integrates the same as $X$ wherever it can be integrated. For our example where $\mathcal{G} = \{ \emptyset,A,B,\Omega\}\subset \mathcal{F}$, note that the function</p>
\[X_\mathcal{G} = \dfrac{1}{\mathbb{P}(A)}\mathbb{E}(X 1_A)1_A+\dfrac{1}{\mathbb{P}(B)}\mathbb{E}(X 1_B)1_B\]
<p>is a conditional expectation of $X$ with respect to $\mathcal{G}$:</p>
<ul>
<li>$X_\mathcal{G}$ is clearly $\mathcal{G}$-measurable,</li>
<li>The integral condition can be easily checked for all elements of the sigma-algebra, for instance for $G=A$</li>
</ul>
\[\int_A X_\mathcal{G} d\mathbb{P} = \int_A \dfrac{1}{\mathbb{P}(A)}\mathbb{E}(X 1_A)1_A d\mathbb{P} = \int_A Xd\mathbb{P}.\]
<p>The existence of such functions $\mathbb{E}(X|\mathcal{G})$ can be proved using Radon-Nikodym’s theorem, and uniqueness can be proved almost-surely using the definition.</p>
<p>We can think of $\mathbb{E}(X|\mathcal{G})$ as the best estimate of $X$ given the information that we have in $\mathcal{G}$.</p>
<h3 id="independence">Independence</h3>
<p>We will have now a little detour to talk about information and independence. Suppose that $X$ is a random variable, and we want to perform measurements using such $X$. Our original sigma-algebra $\mathcal{F}$ contains <strong>all</strong> the information to be ever available to us. Normally, to measure $X$ we do not really need all such information. The information that we do need, is by definition, the information that defines all the events related to $X$ (tautological statement!). As $X$ is a random variable, that means that all such information is contained in a sub sigma-algebra of $\mathcal{F}$, which we previously called $\sigma(X)$. Now, suppose that we have a sub sigma-algebra $\mathcal{H}$ of $\mathcal{F}$ that we believe has nothing to do with $X$, or in other words, that it shares no information with $X$. We say that $\mathcal{H}$ and $X$ are <strong>independent</strong> if</p>
\[\mathbb{P}(H\cap E) = \mathbb{P}(H)\mathbb{P}(E)\]
<p>for all $H\in\mathcal{H}$ and $E\in\sigma(X)$. Of course this definition has nothing special about the sigma-algebra $\sigma(X)$ and can be extended to arbitrary sigma-algebras. We can think of them as having independent information.</p>
<h3 id="properties-of-mathbbecdotmathcalg">Properties of $\mathbb{E}(\cdot|\mathcal{G})$</h3>
<ol>
<li>$\mathbb{E}(\cdot|\mathcal{G})$ is linear: $\mathbb{E}(aX+Y|\mathcal{G}) = a\mathbb{E}(X|\mathcal{G}) +\mathbb{E}(Y|\mathcal{G})$;</li>
<li>If $X\geq 0$ then $\mathbb{E}(X|\mathcal{G})\geq 0$;</li>
<li>If $\mathcal{G}$ is independent of $X$, then $\mathbb{E}(X|\mathcal{G}) = \mathbb{E}(X)$;</li>
<li>If $X$ is $\mathcal{G}$-measurable, then $\mathbb{E}(X|\mathcal{G}) = X$;</li>
<li>$\mathbb{E}(\mathbb{E}(X|\mathcal{G})) = \mathbb{E}(X)$;</li>
<li>If $\mathcal{H}\subset\mathcal{G}\subset\mathcal{F}$ are sigma-algebras, then $\mathbb{E}(\mathbb{E}(X|\mathcal{G})|\mathcal{H}) = \mathbb{E}(X|\mathcal{H})$.</li>
</ol>
<p>We give a few words about each of these properties:</p>
<ol>
<li>self-explainatory,</li>
<li>self-explainatory,</li>
<li>This property is saying that if our sigma-algebra $\mathcal{G}$ does not know anything about $X$ then our best estimate for $X$ given our available information is the same as if we did not have any information, that is, $\mathbb{E}(X)$;</li>
<li>If $\mathcal{G}$ contains <strong>all</strong> the information that defines $X$, then our best estimate of $X$ given $\mathcal{G}$ is precisely $X$;</li>
<li>If we look at the average of our simplified estimate of $X$, then we obtain our no-information estimate $\mathbb{E}(X)$;</li>
<li>If we give an estimate of $X$ with respect to a sigma-algebra, and then estimate that resulting function with respect to a smaller sigma-algebra, we obtain what we would have obtained if we estimated with respect to the smaller sigma-algebra from the start.</li>
</ol>
<p>With this we finish to set up all the basic ingredients to start talking about martingales.</p>Felipe Pérezfel.prz@gmail.comI want to talk about martingales, but unfortunately in order to do that properly, we need to talk first about sigma-algebras and conditional expectations, subjects which can be a bit harsh at first. These concepts are essential, and while we could just work with them just as formal objects with certain properties, it is fundamental to have a deeper understanding of them so we do not get lost in formalism and we are able to capture the intuition behind this theory.Extreme value theory III2020-02-06T00:00:00-08:002020-02-06T00:00:00-08:00https://felperez.github.io/posts/2020/02/blog-post-28<p>In previous entries (<a href="/posts/2019/12/blog-post-16/">here</a> and <a href="/posts/2020/01/blog-post-26/v">here</a> we introduced and discussed the basic elements of Extreme Value Theory (EVT), such as the extreme value distributions, the generalized extreme value distribution, saw examples of such distribution, as well as simulated data and their corresponding fits. In this entry we get our hands on real data and see how we can make some inference using EVT. In particular, we focus on Maximum Likelihood methods for parameter estimation of a temperature dataset from my home city, <a href="https://en.wikipedia.org/wiki/Santiago">Santiago de Chile</a>.</p>
<p>If you like my content, consider following my <a href="https://www.linkedin.com/in/felperez/">linkedIn</a> page to stay updated.</p>
<h3 id="estimating-the-parameters-of-the-gev">Estimating the parameters of the GEV</h3>
<p>Recall the setting from our previous entries: consider an i.i.d. process ${ X_n }$ with distribution function $F$ and define $M_n:= \max\{X_1,\dots,X_n \}$. The extremal types theorem says that if there are constants $a_n,b_n$ and a non-degenerate distribution $G$ such that</p>
\[\dfrac{M_n - b_n}{a_n} \Rightarrow G,\]
<p>then $G$ is the generalized extreme value distribution, which has the form</p>
\[G(z) = \exp \left\{-\left[1+\xi\left(\frac{z-\mu}{\sigma}\right)\right]^{-1 / \xi}\right\}\]
<p>for parameters $\mu \in \mathbb{R}$, $\sigma > 0$, $\xi \in \mathbb{R}$ and $z \in \{z: 1+\xi(z-\mu) / \sigma>0\}$. We have see that the possible types (arising from the choices $\xi > 0$, $\xi = 0$ and $\xi < 0$) present different qualitative properties, such as the bounds on their supports or how heavy their tails are. It is important then to be able to make inference about the parameters $\mu,\sigma,\xi$ when we are presented data.</p>
<p>Suppose that we are given observations $m_1,\dots,m_n$ following a GEV $G(\mu,\sigma,\xi)$. All the following equations hold under the assumption that $z$ is in the correct region. The density of the GEV is given by</p>
\[\dfrac{d}{dz}G(z) = \dfrac{1}{\sigma}\exp\left(-\left[1+\xi\left(\dfrac{z-\mu}{\sigma}\right)\right]^{-1/\xi} \right)\left( 1+\xi\left(\dfrac{z-\mu}{\sigma}\right)\right)^{-1/\xi -1},\]
<p>from which it follows that the likelihood of this set of observations is</p>
\[L(m_1,\dots,m_n| \mu,\sigma,\xi) = \prod_{k=1}^n \dfrac{1}{\sigma}\exp\left(-\left[1+\xi\left(\dfrac{m_k-\mu}{\sigma}\right)\right]^{-1/\xi} \right)\left( 1+\xi\left(\dfrac{m_k-\mu}{\sigma}\right)\right)^{-1/\xi -1}.\]
<p>for the $\xi \neq 0$ case. Taking the logarithm of both sides, we obtain that the log-likelihood is then</p>
\[\ell(m_1,\dots,m_n| \mu,\sigma,\xi) = -n\log\sigma - (1+1/\xi)\sum_{k=1}^n \log\left[ 1 + \xi\left(\dfrac{m_k-\mu}{\sigma}\right) \right] - \sum_{k=1}^n\left[1 + \xi\left(\dfrac{m_k - \mu}{\sigma} \right) \right]^{-1/\xi}.\]
<p>Similarly, for the $\xi = 0$ case we obtain the log-likelihood</p>
\[\ell(m_1,\dots,m_n| \mu,\sigma,\xi = 0) = -n\log\sigma - \sum_{k=1}^n \left( \dfrac{m_k - \mu}{\sigma} \right) - \sum_{k=1}^n \exp\left[ -\left(\dfrac{m_k-\mu}{\sigma} \right)\right] .\]
<p>We find then the parameters that maximize $\ell$, normally by means of optimization libraries such as <code class="language-plaintext highlighter-rouge">scipy</code>, or libraries that can directly find the optimal parameters for us, such as <code class="language-plaintext highlighter-rouge">optimize</code> from scipy.</p>
<h3 id="hands-on-data">Hands-on data</h3>
<p>Now we will apply all we have seen to a data set consisting of measurements of the temperature at the Quinta Normal weather station in Santiago de Chile. The data was obtained <a href="#https://climatologia.meteochile.gob.cl/application/historicos/datosDescarga/330020">here</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>df = pd.read_csv("../input/330020_XXXX_Temperatura_.csv", sep = ';' ,parse_dates = True)
</code></pre></div></div>
<p>The data set consists of a table with three columns, <code class="language-plaintext highlighter-rouge">CodigoNacional</code>, corresponding to the station code, having the same value for all rows, <code class="language-plaintext highlighter-rouge">momento</code>, corresponding to the time and date of the measurement, and <code class="language-plaintext highlighter-rouge">Ts_valor</code>, corresponding to the temperature measurement in Celcius. We will disregard the first column as it has no valuable information when we are dealing with only one station, and rename the other columns to <code class="language-plaintext highlighter-rouge">date</code> and <code class="language-plaintext highlighter-rouge">temp</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>df = df.drop(columns = 'CodigoNacional')
df.columns = ['date','temp']
</code></pre></div></div>
<p>The number of observations is 140091 and it ranges from 01-03-1967 to 30-11-2019. There are 122 missing values in the <code class="language-plaintext highlighter-rouge">temp</code> column, to which we will assign the mean value of the column.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>np.sum(df.isnull(), axis = 0)
df['temp'] = df['temp'].fillna(df['temp'].mean())
</code></pre></div></div>
<p>We can plot now the time series to have a look at it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt.plot(df['temp'])
plt.title("Time series of temperatures")
plt.show()
</code></pre></div></div>
<p><img src="/files/meteo.png" alt="Time series" /></p>
<p>We can see that the series looks more compressed towards the first part of it. This might be due to an issue of the frequency of sampling changes. For our purposes, this will not be a problem, as we will not need all the samples of the time series.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>df = df.set_index('date')
</code></pre></div></div>
<p>We will now make the column <code class="language-plaintext highlighter-rouge">date</code> the index of the dataframe, making it easier to handle it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>df = df.set_index('date')
df.index = pd.to_datetime(df.index)
df = df.sort_index()
</code></pre></div></div>
<p>With this we can make some quick checks:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>days_freq = []
for i in range(1,8):
days_freq.append(np.sum(df.index.day == i))
print(days_freq[i-1])
plt.bar(range(7),days_freq)
plt.title("Frequency of week days in our data set")
plt.show()
</code></pre></div></div>
<p><img src="/files/days_freq.png" alt="Days frequency" /></p>
<p>Now we create a new time series consisting of the maximum temperature observed each day:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>df_max = df.groupby([df.index.date]).max()
plt.plot(df_max['temp'])
plt.title("Time series of maxima")
plt.show()
</code></pre></div></div>
<p><img src="/files/meteo_max.png" alt="Time series of maxima" /></p>
<p>We can see that it does not look compressed anymore. We can also see that there is a section missing at the start of the time series, so we will consider observations starting from 1970 onwards. We will also only consider maxima corresponding to the summer season, that is, between 22nd of December and 21st of March (we omit Dec 21st for rounding reasons but this is not relevant).</p>
<p>We will consider blocks of length equal to ten days, and for each them, we will consider its respective maximum. Thus, in a 90 days summer, we will obtain 9 different observations. Now, some important observations:</p>
<ul>
<li>We will assume that the observations are independent.</li>
</ul>
<p>While there are tests to check if this is actually plausible, we will not focus on it for now.</p>
<ul>
<li>We will also assume that the data is stationary.</li>
</ul>
<p>Again, there are corresponding tests for this, but we will assume this is true, fit the model and then evaluate it. We construct now the block maxima:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>block_max = []
for i in range(1970,2018):
startdate = pd.to_datetime(str(i) +"-12-22").date()
enddate = pd.to_datetime(str(i+1)+"-3-21").date()
summer = df.loc[startdate:enddate]
for j in range(9):
block_max.append(summer[j*10:(j+1)*10].max(axis = 0).values[0])
block_max = np.array(block_max)
</code></pre></div></div>
<p>Let us see the corresponding time series of the block maxima:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt.plot(block_max)
plt.title("Time series of block maxima")
plt.show()
</code></pre></div></div>
<p><img src="/files/block_max.png" alt="Time series of block maxima" /></p>
<p>More interesting than the time series itself is a histogram of it, as we can see the distribution</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt.hist(block_max)
plt.title("Histogram of block maxima")
plt.show()
</code></pre></div></div>
<p><img src="/files/block_max_hist.png" alt="Histogram of block maxima" /></p>
<p>Looks quite familiar, right? Let us see if we can fit a GEV distribution to it:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from scipy.stats import genextreme as gev
shape, loc, scale = gev.fit(block_max)
x = np.linspace(20, 50, num=100)
y = gev.pdf(x, shape, loc, scale)
plt.hist(block_max, density = True , alpha=.3 ,bins = 10)
plt.plot(x, y, 'r-')
plt.title("Histogram with fitted GEV")
plt.show()
</code></pre></div></div>
<p><img src="/files/block_max_hist_gev.png" alt="Histogram of block maxima with GEV" /></p>
<p><strong>Remark:</strong> while in the first section we introduced ML methods to estimate parameters of the GEV, there are other methods for doing it so that might yield better/faster results. I will not go into the details of how scipy is doing this, but one can always check the documentation/source code if interested.</p>
<p>We can see how we obtained a pretty good fit. We can see what the parameters of the fit are:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>print(shape,loc,scale)
> 0.255993226608202 30.786910300677764 1.7675333367793673
</code></pre></div></div>
<p>In particular we can see that the shape parameter is positive, and hence we are dealing with a type 2 or Frechet distribution.</p>
<p><strong>Word of caution:</strong> the form of the density/CDF of the GEV distribution that scipy uses is slightly different to the one we are using, so the exact value of the parameters will not be of great importance for us now.</p>
<p>In order to have a better idea of how good the fit is, we can use both <a href="https://en.wikipedia.org/wiki/P–P_plot">P-P</a> and <a href="https://en.wikipedia.org/wiki/Q–Q_plot">Q-Q</a> plots. We recall their construction for the sake of completeness:</p>
<p>For a P-P plot, suppose our observations $M_1,\dots,M_n$ with estimated distribution $\hat G$ are ordered increasingly $M_{(1)}\leq \dots \leq M_{(n)}$. If we plot the points</p>
\[\left( \hat G(M_{(k)}) , \dfrac{k}{n+1} \right)\]
<p>for $k = 1,\dots,n$. If our estimate $\hat G$ is reasonable, then the plot described has the following property: for any such $k$, we have that there are $k$ observations with size less or equal than $M_{(k)}$. This means that the emperical estimate of the probability $\mathbb{P}(M\leq M_{(k)}) = G(M_{(k)})$ is given by $\frac{k}{n}$. In practice, we will use $\frac{k}{n+1}$ as we do not necessarily want the support of our distribution to be bounded (i.e., $\mathbb{P}(M\leq M_{(n)}) = 1$). Thus if our model is reasonable, the points $\left( \hat G(M_{(k)}) , \frac{k}{n+1} \right)$ should lay close to the diagonal.</p>
<p>Let us check with our data:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x = np.array(block_max)
y = scipy.stats.genextreme(loc = loc, scale = scale ,c = shape).rvs(len(block_max))
pp_x = sm.ProbPlot(x, fit=True)
pp_y = sm.ProbPlot(y, fit=True)
fig = pp_y.ppplot(line='45', other=pp_x,alpha = 0.3, marker='.')
plt.title('P-P plot of our empirical data vs fitted distribution')
plt.show()
</code></pre></div></div>
<p><img src="/files/pp_plot.png" alt="PP plot" /></p>
<p>As remarked in Coles’ book, it is worth pointing out one weakness of the P-P plot in the context of EVT: since we are plotting probabilities (empirical and predicted), at the right end of the plot the points will inevitably get closer together. This represents a difficulty when we are interested in the goodness of the fit around the extremes of the distribution.</p>
<p>The Q-Q plot is similar, only that we apply $\hat G$ to both coordinates, hence we plot the set of points</p>
\[\left(M_{(k)}, \hat G^{-1}\left( \dfrac{k}{n+1} \right) \right)\]
<p>for $k = 1,\dots, n$ (remark: in some sources the plot is flipped with respect to $y = x$). While this might seem a trivial change to the P-P plot, it actually has important consequences: while we are plotting the very same information in both graphs, by applying $\hat G^{-1}$ to the P-P plot we are changing its scale. The Q-Q plot can also be interpreted as a plot of quantile against quantile for the empirical data and the fitted distribution respectively: if the numbers $M_{(k)}$ and $\hat G^{-1}\left( \frac{k}{n+1}\right)$ corresponding to both empirical and fitted quantile are in agreement (note that it is not trivial how to determine to which quantile these points correspond to). For our data, the Q-Q plot looks as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import statsmodels.api as sm
import scipy
sm.qqplot(block_max, scipy.stats.genextreme, fit = True ,distargs=(4,) , line='45' , alpha = 0.3, marker='.')
plt.title('Q-Q plot of our empirical data vs fitted distribution')
plt.show()
</code></pre></div></div>
<p><img src="/files/qq_plot.png" alt="QQ plot" /></p>
<p>We can see both from the P-P and the Q-Q plot that our model fits quite well, as there are no extreme deviations of the points from the diagonal of each plot.</p>
<h3 id="final-comments">Final comments</h3>
<p>In this entry we have seen how to estimate parameters of a GEV using empirical data, with the method of maximum likelihood. While this works for a generality of situations, there are some problems with numerical stability that can be avoided using other methods. For now, we will trust that scipy is doing its best to find good candidates for our parameters. It is also worth mentioning that there are also different methods to assess goodness of fit in this context, and particularly relevant are the <em>return level</em> plots. We will not go into the details of such tool in this entry but possibly in the next ones. It is also important to remark that the <em>asymptotic models</em> that we have discussed in these three entries are not the only models in ETV and in most ocassions, they are not the most useful, as they do not make use of the whole time series but just the maxima over a fixed period of time. In future series we will introduce <em>threshold methods</em> that do make use of such information and are able to yield stronger results. Finally, it is also worth mentioning that in future entries we will address the issue of <em>predicting</em> using EVT, although with what we have already seen it is more than possible to make some basic predictions.</p>
<p>One last thing to mention: this entry was inspired by the work of my ex-colleagues Meagan Carney from Max Planck Institute, and Matthew Nicol and Robert Azencott from the University of Houston.</p>Felipe Pérezfel.prz@gmail.comIn previous entries (here and here we introduced and discussed the basic elements of Extreme Value Theory (EVT), such as the extreme value distributions, the generalized extreme value distribution, saw examples of such distribution, as well as simulated data and their corresponding fits. In this entry we get our hands on real data and see how we can make some inference using EVT. In particular, we focus on Maximum Likelihood methods for parameter estimation of a temperature dataset from my home city, Santiago de Chile.Empirical error2020-01-25T00:00:00-08:002020-01-25T00:00:00-08:00https://felperez.github.io/posts/2020/01/blog-post-27<p>In a <a href="/posts/2019/12/blog-post-20/">previous entry</a> we studied the concepts of bias and variance in an additive context. In this entry we dive deeper into the analysis of the mean squared error and how to asses it using actual data.</p>
<p>If you like my content, consider following my <a href="https://www.linkedin.com/in/felperez/">linkedin</a> page to stay updated.</p>
<h3 id="mean-squared-error-revised">Mean squared error revised</h3>
<p>In the <a href="posts/2019/12/blog-post-20/">previous entry</a>, we introduced bias and variance them as part of a decomposition of the <strong>mean squared error</strong>. Recall our model consisted of $d$-dimensional inputs $x^{(1)},\dots,x^{(m)}$ and real valued outputs/responses $y^{(1)},\dots,y^{(m)}$. We also assume there is a function $f:\mathbb{R}^d\to\mathbb{R}$ and random variables $\epsilon^{(k)}:\mathbb{R}^d\to\mathbb{R}$ such that</p>
\[y^{(k)} = f(x^{(k)}) + \epsilon^{(k)}.\]
<p>Our model then is</p>
\[\widehat y = \widehat f (x),\]
<p>where $\widehat f:\mathbb{R}^d\to\mathbb{R}$ is a random function depending on the data we have. In this setting, the mean squared error is</p>
\[MSE = \mathbb{E}(y-\widehat f)^2.\]
<p>The problem is that evaluating this expectation is <strong>not</strong> a trivial task, as we do not really have access to the true distribution of $y$. Fortunately we have at our disposition one of the fundamental tools of probability theory, which allows us to approximate expectations.</p>
<h3 id="the-law-of-large-numbers">The law of large numbers</h3>
<p>In previous entries (see <a href="posts/2019/05/blog-post-3/">here</a> and <a href="posts/2019/06/blog-post-10/">here</a>) we have seen how we can use the law of large numbers to study the almost sure behavior of averages of i.i.d. random variables. We use it now the other way around: in order to evaluate an expectation, we compute averages of realizations of random variables. Recall the LLN:</p>
<p><strong>Theorem (LLN):</strong> suppose the i.i.d. sequence $\{X_n \}$ has expectation $\mathbb{E}(X_1) = \mu$ and finite fourth moment $\mathbb{E}(X_1^4)<\infty$. Then $S_n/n \to \mu$ almost surely.</p>
<p>Here $S_n = X_1 + \dots + X_n$. In general the assumption that the fourth moment of $X_n$ is finite can be relaxed to the assumption that only the first moment is finite, but the proof is much harder.</p>
<p>Now back to our problem: we proposed the MSE as a measure to asses the goodness of the fit of our model to the data. The MSE can be expressed as an expectation: $MSE = \mathbb{E}(y-\widehat f)^2$. If we want to evaluate this expectation using the LLN, we need to find a sequence of i.i.d. random variables $\{X_n\}$ such as</p>
\[\dfrac{X_1+\dots +X_n}{n} \to \mathbb{E}(X_n) = MSE\]
<p>almost surely.</p>
<h3 id="empirical-error">Empirical error</h3>
<p>Consider the (finite) sequence of random variables $X_k = (y - \hat f)^2$. It can be proved under reasonable assumptions that these functions are integrable, and hence we can apply the law of large numbers. If we consider their averages, we have that</p>
\[\dfrac{1}{n}\sum_{k=1}^n (y - \hat f)^2 \to \mathbb{E}(y - \hat f)^2 = MSE\]
<p>as $n\to\infty$. We call the sum in the left hand side the <strong>empirical error</strong>. Note that taking $n\to\infty$ can be thought as adding more points to our dataset. In other words, the bigger the dataset is, the better the approximation of the MSE by the empirical error. Recall that the MSE can be decomposed in the noise + bias^2 + variance decomposition (for the additive case):</p>
\[MSE = \sigma^2 + \text{var} (\widehat f) + \left(\mathbb{E}(f-\widehat f)\right)^2.\]
<p>being $\sigma^2$ the variance of $\epsilon^{(k)}$. Using similar ideas, one can approximate the bias and the variance. This is in general not necessary, as one can use directly the MSE to asses the fit of the model. One of the main paradigms nowadays in Statistical Learning is to fit models by finiding parameters which minimize the empirical risk (or analogue quantities). The idea is that we can find the parameters of our model using a portion of the dataset (called <strong>training set</strong>) and asses the fit using a portion of the dataset which was <strong>not</strong> used to train the model (called <strong>test set</strong>). We can formulate then a procedure to find appropriate models for our dataset:</p>
<ol>
<li>For each model, we find the optimal train and test error,</li>
<li>We plot in the x-axis the complexity of the model (i.e., the number of parameters), and in the y-axis the error. We plot separately two curves, one for the train error and one for the test error.</li>
<li>In general, we observe the following: as the complexity increases, the training error decreases; on the other hand, the test error decreases for the first part of the plot, and then it increases again.</li>
<li>To select the model, we chose the one with complexity so that the test error is minimal, while trying to keep the training error as low as possible.</li>
</ol>
<p>It is worth noticing that this procedure may be used to chose some of the hyperparameters of a model, such as the number of hidden layers and nodes of a neural network.</p>
<h3 id="final-comments">Final comments</h3>
<p>We have seen that we can use the empirical risk as a numerical measure to assess goodness of the fit of our models. This is a standard practice in Machine Learning. In the next entry, we will see how this paradigm is starting to get challenged by modern researchers, and that we might be entering the new era of Machine Learning.</p>Felipe Pérezfel.prz@gmail.comIn a previous entry we studied the concepts of bias and variance in an additive context. In this entry we dive deeper into the analysis of the mean squared error and how to asses it using actual data.Extreme value theory II2020-01-20T00:00:00-08:002020-01-20T00:00:00-08:00https://felperez.github.io/posts/2020/01/blog-post-26<p>In a <a href="/posts/2019/12/blog-post-16/">previous entry</a> we introduced the basics of Extreme Value Theory (EVT), such as the degeneracy of the maxima distribution, the extremal types theorem, as well as the Gumbel, Frechet, Weibull and GEV distributions. In this entry we will see a few examples of random variables and their respective maxima distribution, both theoretical and by performing simulations.</p>
<p>If you like my content, consider following my <a href="https://www.linkedin.com/in/felperez/">linkedin</a> page to stay updated.</p>
<h3 id="a-few-different-distributions">A few different distributions</h3>
<p><strong>Example 1:</strong> Consider an i.id. sequence $X_i$ with exponential distribution of parameter $\lambda=1$. Then its <a href="https://en.wikipedia.org/wiki/Cumulative_distribution_function">CDF</a> is given by $F(x) = 1 - \exp(x)$ for $x > 0$, and let $M_n = \max\{X_1,\dots,X_n \}$. If we take $a_n = 1$ and $b_n = \log n$ we can see that</p>
\[\mathbb{P}\left( \dfrac{M_n - b_n}{a_n}\leq z \right) = (F(z+\log n))^n = (1 - n^{-1}e^{-z})^n\to \exp(-e^{-z}).\]
<p>The first equality follows from the argument in the <a href="/posts/2019/12/blog-post-16/">previous entry</a> and the limit is just the definition of the exponential function. This implies that the maxima of our random variables follows a Gumbel distribution.</p>
<p><strong>Example 2:</strong> In <a href="https://stats.stackexchange.com/questions/105745/extreme-value-theory-show-normal-to-gumbel/105749#105749">this post</a> it is given a sufficient condition to ensure that $M_n$ has Gumbel extremal type (the condition can be found in H.A. David & H.N. Nagaraja (2003), “Order Statistics” (3d edition)). We reproduce here the condition and its applications:</p>
<p>Suppose that the CDF $F$ and density $f$ of $X_i$ are such that</p>
\[L = \lim_{x\to F^{-1}(1)}\left( \dfrac{d}{dx}\dfrac{1-F(x)}{f(x)} \right) = 0.\]
<p>Then $M_n$ has Gumbel extremal type. It also gives a way to compute the sequences $a_n$ and $b_n$: in fact, one can take $a_n = (nf(b_n))^{-1}$ and $b_n =F^{-1}(1-\frac{1}{n})$</p>
<p>Suppose that $X_i$ is an i.i.d. sequence with common distribution $\mathcal{N}(0,1)$. We apply the previous result to the standard normal sequence, where $F(x)=\Phi(x)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^x e^{-t^2/2}dt$, $f(x)= \phi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}$ and $F^{-1}(1)=\infty$. Note that using the quotient rule we obtain that the expression in the parenthesis is equivalent to</p>
\[\dfrac{d}{dx}\dfrac{1-F(x)}{f(x)} =- \dfrac{-\phi'(x)}{\phi(x)}\dfrac{(1-\Phi(x))}{\phi(x)} - 1.\]
<p>The first part of the product can be calculated explicitly:</p>
\[-\dfrac{-\phi'(x)}{\phi(x)} = -\dfrac{-x e^{-x^2/2}}{e^{-x^2/2}} = x.\]
<p>The second part of the product is a bit more delicate: the term $\frac{(1-\Phi(x))}{\phi(x)}$ is known as the <a href="https://en.wikipedia.org/wiki/Mills_ratio">Mills ratio</a> of the standard normal distribution. As stated in the article, we have that following asymptotic: $\frac{(1-\Phi(x))}{\phi(x)}\sim \frac{1}{x}$ as $x\to\infty$ (a proof follows immediately from the bounds found in <a href="https://math.stackexchange.com/questions/3163568/bound-on-mills-ratio-of-normal-distribution">this post</a>). This implies that the limit $L$ is indeed equal to zero, and consequently, $M_n$ has extremal type Gumbel. In this case there is no explicit form for $b_n$, so we will not bother to compute it.</p>
<p>Similarly, if $X\sim\exp(\lambda)$, we have that $F(x)=1-e^{-\lambda x}$, $f(x)=\lambda e^{-\lambda x}$ and $F^{-1}(1)=\infty$. The first part of the product is</p>
\[-\dfrac{-f'(x)}{f(x)} = \lambda,\]
<p>while the second part is</p>
\[\dfrac{1-F(x)}{f(x)} = \dfrac{1}{\lambda},\]
<p>hence we have $\frac{d}{dx}\frac{1-F(x)}{f(x)}\equiv0$. As a consequence, $M_n$ has Gumbel extremal type. Additionally, the sequences $a_n$ and $b_n$ can be taken as $b_n = 1 - e^{-\lambda(1-\frac{1}{n})}$ and $b_n = (n\lambda)^{-1}\exp({\lambda(1 - e^{-\lambda(1-\frac{1}{n})})})$.</p>
<p>The same method proves that the extremal type of a Gamma distribution is Gumbel as well.</p>
<p><strong>Example 3:</strong> Consider an i.i.d. sequence $X_n$ with uniform distribution $U(0,1)$. Then $F(x) = x$ and $f(x) = 1$ for $[0,1]$. Take $z < 0$ and $n> -z$, and the sequences $a_n = 1/n$ and $b_n = 1$. Then</p>
\[\mathbb{P}\left( \dfrac{M_n - b_n}{a_n}\leq z \right) = (F(a_nz+b_n))^n = \left( 1+ \dfrac{z}{n} \right)^n \to e^z,\]
<p>proving that $M_n$ has Weibull extremal type.</p>
<h3 id="simulations">Simulations</h3>
<p>In this section we perform some simulations several random variables and see what the distribution of their maxima looks like. The approach we follow is based on obtaining $nk$ samples and group them in $n$ groups of $k$ observations. We compute the maximum for each group and construct the corresponding histogram.</p>
<p><strong>Example 4:</strong> consider an i.i.d. sequence $X_i\sim\mathcal{N}(0,1)$, and take $k=50$ and $n=2000$. We generate $100000$ observations of our random variable and group them in $2000$ groups of $50$ samples.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import expon
n = 2000
k = 50
obs = n*k
s = expon.rvs(size=obs)
plt.hist(s, bins='auto')
plt.title("Histogram of exponential r.v.")
plt.show()
</code></pre></div></div>
<p><img src="/files/exponential.png" alt="exp dist" /></p>
<p>Now we look at the maxima of the blocks:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>blocks = np.empty([n,k])
for i in range(n):
blocks[i,:] = s[k*i:k*(i+1)]
maxima = np.empty(n)
for i in range(n):
maxima[i] = np.max(blocks[i,:])
plt.hist(maxima, bins='auto')
plt.title("Histogram of block maxima")
plt.show()
</code></pre></div></div>
<p><img src="/files/exp_gumbel.png" alt="Exp gumbel" /></p>
<p>We can see that it looks like a Gumbel distribution as expected. We use scipy’s functions to fit the corresponding curve. We will see in the next entries how can we fit such curves.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from scipy.stats import gumbel_r
params = gumbel_r.fit(maxima)
fig, ax = plt.subplots(1, 1)
x = np.linspace(0,15, 100)
pdf_fitted = gumbel_r.pdf(x , loc = params[0] , scale = params[1] )
ax.plot(x,pdf_fitted,'r-')
ax.hist(maxima,density=True,alpha=.3, bins ='auto')
plt.show()
</code></pre></div></div>
<p><img src="/files/fit_exp_gumbel.png" alt="T dist" /></p>
<p>which looks like a very decent fit.</p>
<p><strong>Example 5:</strong> consider now uniform random variables $X_i\sim U(0,1)$. We proceed in a similar way as in the previous example: first we sample $X_i$ and take a look at the histogram:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from scipy.stats import uniform
n = 2000
k = 50
obs = n*k
s = uniform.rvs(size=obs)
plt.hist(s, bins='auto')
plt.title("Histogram of uniform r.v.")
plt.show()
</code></pre></div></div>
<p><img src="/files/uniform.png" alt="unif dist" /></p>
<p>Now we construct the blocks and their maxima:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>blocks = np.empty([n,k])
for i in range(n):
blocks[i,:] = s[k*i:k*(i+1)]
maxima = np.empty(n)
for i in range(n):
maxima[i] = np.max(blocks[i,:])
plt.hist(maxima, bins='auto')
plt.title("Histogram of block maxima")
plt.show()
</code></pre></div></div>
<p><img src="/files/unif_weibull.png" alt="unif dist" /></p>
<p>Finally we fit the curve of the Weibull distributioin:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from scipy.stats import weibull_max
params = weibull_max.fit(maxima , floc = 1)
fig, ax = plt.subplots(1, 1)
x = np.linspace(0,15, 100)
pdf_fitted = weibull_min.pdf(x , loc = params[1] , scale = params[2], c = params[0] )
ax.plot(x,pdf_fitted,'r-')
ax.hist(maxima,density=True,alpha=.3, bins ='auto')
plt.show()
</code></pre></div></div>
<p><img src="/files/fit_unif_weibull.png" alt="unif dist" /></p>
<p>Remark: there are some issues with the Weibull distribution in the scipy library.</p>
<h3 id="final-comments">Final comments</h3>
<p>In this entry we have seen how to find the extremal type of certain distributions by using some tricks. Although, the examples were artficial, and in general these methods do not generalize well, so in the next entry we will see more consistent methods to find the extremal distribution of data using maximum likelihood.</p>Felipe Pérezfel.prz@gmail.comIn a previous entry we introduced the basics of Extreme Value Theory (EVT), such as the degeneracy of the maxima distribution, the extremal types theorem, as well as the Gumbel, Frechet, Weibull and GEV distributions. In this entry we will see a few examples of random variables and their respective maxima distribution, both theoretical and by performing simulations.Confidence intervals2020-01-15T00:00:00-08:002020-01-15T00:00:00-08:00https://felperez.github.io/posts/2020/01/blog-post-25<p>Confidence intervals represent one of the most powerful tools used by statisticians/data scientists. The allow us to quantify the uncertainty of our predictions, which proves crucial when making important decisions. In this entry we will take a first dive into this topic, finding confidence intervals for means of i.i.d. processes.</p>
<p>If you like my content, consider following my <a href="https://www.linkedin.com/in/felperez/">linkedin page</a> to stay updated.</p>
<h3 id="the-central-limit-theorem">The central limit theorem</h3>
<p>In previous entries we have seen how the <a href="/posts/2019/06/blog-post-10/">Law of large numbers</a>, the <a href="/posts/2019/06/blog-post-12/">Central limit theorem</a> (CLT) and the principle of <a href="/posts/2019/06/blog-post-13/">Large deviations</a> give a wide description of the asymptotic behavior of averages of i.i.d. They describe the almost sure behavior, the distribution of the fluctuations, and the decay of extreme probabilities. This time, we will use the CLT to study the probability of giving estimates which are close enough to the mean up to a certain confidence level. Recall that if $X_1,X_2,\dots$ is and i.i.d. sequence with finite variance $0<\sigma^2 < \infty$ and mean $\mu$, and denote $S_n := X_1+\dots +X_n$, $\bar X:= S_n/n$. The CLT then can be stated as</p>
\[\dfrac{S_n -n\mu}{\sqrt{n\sigma^2}}\Rightarrow\mathcal{N}(0,1),\]
<p>where $\mathcal{N}(0,1)$ is the standard normal distribution. If we scale and translate, we obtain the equivalent asymptotics</p>
\[\bar X \Rightarrow\mathcal{N}\left( \mu , \dfrac{\sigma^2}{\sqrt{n}}\right)\\
\dfrac{\bar X-\mu}{\frac{\sigma}{\sqrt{n}}}\Rightarrow \mathcal{N}(0,1)\]
<p>These results will be useful in the next section.</p>
<h3 id="estimating-the-mean">Estimating the mean</h3>
<p>Suppose we are in a situation where we observe i.i.d. data $X_1,X_2,\dots$ with mean $\mu$ and variance $0<\sigma^2<\infty$ (the question of how we know the data is i.i.d. is the topic for a whole another entry!). Suppose that we are interested in estimating the mean $\mu$. For this section, suppose that we know the variance $\sigma^2$ (which in real life is a pretty unlikely situation). A natural candidate to estimate $\mu$ is $\bar X$, which we call the <strong>sample mean</strong>. We can see from the results following from the CLT that:</p>
<ul>
<li>If we compute the sample mean, as we gather more data, it converges in distribution to the true parameter $\mu$, with decreasing variance $\sigma^2/n$. This means that if we perform many random samples of $\bar X$ and plot the corresponding histogram, it will tend to concentrate around $\mu$ and the curve will become more and more narrow.</li>
</ul>
<p>This provides a good starting point to quantify how certain we are that our estimate $\bar X$ is close enough to $\mu$, as we can actually measure probabilities using the normal distribution $\mathcal{N}$. Denote</p>
\[Z = \dfrac{\bar X - \mu}{\sigma/\sqrt{n}},\]
<p>which we call the <strong>Z statistic</strong>. Note now that $|Z| \leq \eta$ if and only if</p>
\[\bar X - \eta\dfrac{\sigma}{\sqrt{n}} \leq \mu \leq \bar X +\eta\dfrac{\sigma}{\sqrt{n}}.\]
<p>This means that if we can control the probability that the Z statistic is within a certain interval, we are exactly controlling the probability that the <strong>true</strong> value of the parameter is within an interval that we can explicitely write down. More concretely, we have that</p>
\[\mathbb{P}\left(\bar X - \eta\dfrac{\sigma}{\sqrt{n}} \leq \mu \leq \bar X +\eta\dfrac{\sigma}{\sqrt{n}} \right) = \mathbb{P}(|Z|\leq\eta) \approx \Phi(\eta) - \Phi(-\eta) =2 \Phi(\eta) - 1 = \alpha,\]
<p>where $\Phi$ is the cumulative distribution function of the standard normal distribution $\mathcal{N}(0,1).$ For instance, if we wanted an interval $I$ which contains $\mu$ with probability $\alpha = 0.95$, then we need to find $\eta$ such that $2\Phi(\eta) -1 = 0.95$. One can numerically invert $\Phi$ or look its approximate values in a table and see that this means that $\eta \approx 1.96$ (see for instance <a href="http://z-scoretable.com">here</a>: look for the area in the table and find in the axis the correspondig value of $\eta$). This shows where the classical “1.96” number comes from. We call the interval $[\bar X - \eta\frac{\sigma}{\sqrt{n}},\bar X + \eta\frac{\sigma}{\sqrt{n}}]$ a <strong>confidence interval</strong> for $\mu$ when $\sigma^2$ is known.</p>
<p>We can interpret the previous result as follows: if we draw samples and construct a sequence of confidence intervals $I_1,\dots,I_n$, then the proportion of intervals that contain $\mu$ is given by $\alpha$. In the previous example, $95$ out of $100$ intervals contain $\mu$. We stress that this is an asymptotic result, as it relies both in the law of large numbers and the central limit theorem approximations.</p>
<p>Note that the width of the interval is determined by three factors:</p>
<ol>
<li>The size of the variance of the distribution, ie, $\sigma^2$: if the distribution has high variance, then we are less likely to capture $\mu$, since our data may be skewed away from its mean.</li>
<li>The amount of data we have. This one is obvious: the more data, the smaller $\frac{1}{\sqrt n}$, since we know more about our distribution if we sample more data.</li>
<li>The confidence level $\eta$: note that $\eta$ is determined by the equation $F(\eta) = (\alpha+1)/2$ and that $F$ is an non-decreasing function (it is a <a href="https://en.wikipedia.org/wiki/Cumulative_distribution_function#Properties">CDF</a>). This means that the closer we want $\alpha$ to be to $1$, the bigger $\eta$ is (assuming that $F$ has no right endpoint).</li>
</ol>
<h3 id="but-we-do-not-know-sigma2">But we do not know $\sigma^2$</h3>
<p>Yes, that is a problem for the method described before. One would be tempted to replace the unknown quantity $\sigma^2$ with the estimator $s^2$ given by</p>
\[s^2 = \dfrac{1}{n-1}\sum_{k=1}^n (X_k - \bar X)^2\]
<p>and then estimate $\mathbb{P}(\mu \in I)$ by taking</p>
\[\mathbb{P}\left(\bar X - \eta\dfrac{s}{\sqrt{n}} \leq \mu \leq \bar X +\eta\dfrac{s}{\sqrt{n}} \right),\]
<p>where we replaced $\sigma$ by $s$. It turns out that the statistic</p>
\[T = \dfrac{\bar X - \mu}{\frac{s}{\sqrt n}}\]
<p>is <strong>not</strong> normally distributed, hence we cannot use the same idea as above. Luckily, there is a work around: William Sealy Gosset, aka, Student, proved that under certain assumptions$^*$ the statistic $T$ has a particular distribution, called <strong>Student’s t-distribution</strong>, and is given by</p>
\[f_k(t) = \dfrac{\Gamma\left(\frac{k+1}{2}\right)}{\sqrt{\nu \pi} \ \Gamma\left(\frac{k}{2}\right)}\left(1+\frac{t^{2}}{k}\right)^{-\frac{k+1}{2}}\]
<p>where $k$ is a parameter called the <strong>degrees of freedom</strong>, and $\Gamma(\cdot)$ is the <a href="https://en.wikipedia.org/wiki/Gamma_function">gamma function</a>. We can see the graph of the density for different choices of $k$ in the plot below.</p>
<p><img src="/files/t_distr.png" alt="T dist" /></p>
<p>For the $T$ statistic, $k=n-1$. Qualitatively, the t-distribution has a bell shape and is symmetric, so it <em>looks</em> like the Gaussian distribution, but has some different properties. For instance, we can see from its density that it has heavier tails than the normal distribution (it does not decay exponentially). If we call $F_k$ the CDF of the t-distribution, then we can apply the same technique as above to construct confidence intervals when $\sigma^2$ is not known:</p>
\[\mathbb{P}\left(\bar X - \eta\dfrac{s}{\sqrt{n}} \leq \mu \leq \bar X +\eta\dfrac{s}{\sqrt{n}} \right) = \mathbb{P}(|T|\leq\eta) = F_{n-1}(\eta) - F_{n-1}(-\eta) =2 F_{n-1}(\eta) - 1 = \alpha.\]
<p>The comments on the width of the resulting confidence intervals apply similarly here, so we will not go over them again. What is worth mentioning is that as $n\to\infty$, the t-distribution with $n-1$ degrees of freedom <em>tends to look like</em> the normal distribution. Arguments for this can be found <a href="https://stats.stackexchange.com/questions/110359/why-does-the-t-distribution-become-more-normal-as-sample-size-increases">here</a>. There is a big discussion on whether to use the t-distribution for large values of $n$ or not, but we will not go in that direction for now. The only remark we make about it is that for $\alpha = 0.95$ and a large enough number of degrees of freedom, we have that $\eta \approx 1.96$ as well, giving practically the same confidence interval as in the $Z$ statistic case.</p>
<hr />
<p>$^*:$ The assumptions for this result to hold are the following: if we write the T statistic as $T= \frac{\bar X - \mu}{\sigma/\sqrt{n}} = \frac{Z}{s}$ where</p>
<ol>
<li>$Z$ is has standard normal distribution (ie, $Z\sim \mathcal{N}(0,1)$);</li>
<li>$s^2$ has <a href="https://en.wikipedia.org/wiki/Chi-squared_distribution">chi-squared</a> distribution with $n-1$ degrees of freedom;</li>
<li>$Z$ and $s$ are independent</li>
</ol>
<p>These assumptions hold in the case when $X_i\sim \mathcal{N}(\mu,\sigma^2)$ (see <a href="https://en.wikipedia.org/wiki/Cochran%27s_theorem">Cochran’s</a> (see also <a href="http://users.stat.umn.edu/~sandy/courses/8311/handouts/ch05.pdf?fbclid=IwAR0iRW6ah-vO_x_tvLCXmFrU1INkYxZE0bSFqb24EWbowj8l0oGOmr8nePY">here</a>) and <a href="https://en.wikipedia.org/wiki/Basu%27s_theorem">Basu’s theorem</a>). This is usually how the t-distribution is defined: the ratio of a normal and a chi-squared distribution. Obtaining the density of the t-distribution is a matter of integration and careful analysis of integral kernels (see <a href="https://math.stackexchange.com/questions/474733/derivation-of-the-density-function-of-student-t-distribution-from-this-big-integ">here</a> and <a href="https://math.stackexchange.com/questions/1384338/math-intuition-and-natural-motivation-behind-t-student-distribution">here</a>).</p>
<hr />
<h3 id="final-comments">Final comments</h3>
<p>We have seen how to construct confidence intervals for the sample mean by using the T-statistic. In future notebooks we will explore confidence intervals for different estimators, as well as their use in hypothesis testing.</p>
<h3 id="references">References</h3>
<ul>
<li><em>Statistical inference</em>, Berger & Casella,</li>
<li><em>Statistical data analysis</em>, Cowan</li>
</ul>Felipe Pérezfel.prz@gmail.comConfidence intervals represent one of the most powerful tools used by statisticians/data scientists. The allow us to quantify the uncertainty of our predictions, which proves crucial when making important decisions. In this entry we will take a first dive into this topic, finding confidence intervals for means of i.i.d. processes.Problem 5: Unbiased and consistent estimators2020-01-12T00:00:00-08:002020-01-12T00:00:00-08:00https://felperez.github.io/posts/2020/01/blog-post-24<p>What does it mean for an estimator to be unbiased? What about consistent? Give examples of an unbiased but not consistent estimator, as well as a biased but consistent estimator.</p>
<p>If you like my content, consider following my <a href="https://www.linkedin.com/in/felperez/">linkedin</a> page to stay updated.</p>
<p><strong>Concise answer:</strong></p>
<p>An unbiased estimator is such that its expected value is the true value of the population parameter. A consistent estimator is such that it converges in probability to the true value of the parameter as we gather more samples.</p>
<p><strong>Long answer:</strong></p>
<p>When we have a population $S$ with a parameter $\theta$ we want to estimate, we propose <strong>estimators</strong> $\hat\theta$ . These estimators are random variables and as such, they have a distribution. If the expected value of this estimator is equal to the true value of the parameter, we say that the estimator is <strong>unbiased</strong>, otherwise we say it is <strong>biased</strong>. Mathematically, this can be written as $\mathbb{E}(\hat\theta)=\theta$. On the other hand, whether the estimators expected value is the true value of the paramter, it may still converge in probability to the true parameter. When this is the case, we say that the estimator is <strong>consistent</strong>.</p>
<p><strong>Examples:</strong></p>
<p>An unbiased estimator which is not consistent: suppose $X_1,\dots,X_n$ are i.i.d. samples with mean $\mu$ and consider the estimator of $\mu$ given by $\hat\mu = X_1$. Then $\mathbb{E}(\hat\mu) =\mu$ since $X_n$ has the same distribution for all $n$. On the other hand, $\hat\mu$ is not consistent:</p>
\[\mathbb{P}(|X_1 - \mu| > \epsilon) > 0\]
<p>and is independent of $n$, so it will not converge to zero.</p>
<p>A biased estimator which is consistent: suppose $X_1,\dots,X_n$ are i.i.d. samples with mean $\mu$ and variance $\sigma^2$, and consider the estimator of $\mu$ given by</p>
\[\hat\mu = \dfrac{1}{n} + \dfrac{1}{n}\sum_{i=1}^n X_i .\]
<p>Then $\mathbb{E}(\hat\mu)=\mu + \frac{1}{n}\neq\mu$ for all $n$, while by <a href="https://en.wikipedia.org/wiki/Markov's_inequality">Markov’s inequality</a></p>
\[\mathbb{P}\left\{\left( \dfrac{1}{n} + \dfrac{1}{n}\sum_{i=1}^n X_i - \mu\right)^2 > \epsilon^2\right\}\leq \dfrac{\mathbb{E}( \frac{1}{n} + \frac{1}{n}\sum_{i=1}^n X_i - \mu)^2}{\epsilon^2},\]
<p>so we need to estimate the expectation of the square. Note that for this we can do</p>
\[\mathbb{E}\left( \dfrac{1}{n} + \dfrac{1}{n}\sum_{i=1}^n X_i - \mu\right)^2 \leq \dfrac{1}{n^2}+\dfrac{2}{n}\mathbb{E} \left(\dfrac{1}{n}\sum_{i=1}^n X_i - \mu\right) + \mathbb{E}\left(\dfrac{1}{n}\sum_{i=1}^n X_i - \mu \right)^2.\]
<p>The first term obviously converges to zero, while for the second we can see that the expectation of $X_i$ is $\mu$ so the whole term vanishes. For the last term, we can expand the square and see that</p>
\[\newcommand{\var}{var}
\mathbb{E}\left(\dfrac{1}{n}\sum_{i=1}^n X_i - \mu \right)^2= \var\left(\dfrac{1}{n}\sum_{i=1}^n X_i - \mu \right)=\var\left(\dfrac{1}{n}\sum_{i=1}^n X_i \right) =\dfrac{1}{n}\sigma^2\]
<p>from which consistency follows.</p>
<p><a href="https://stats.stackexchange.com/questions/303398/smarter-example-of-biased-but-consistent-estimator">Another nice example</a>.</p>Felipe Pérezfel.prz@gmail.comWhat does it mean for an estimator to be unbiased? What about consistent? Give examples of an unbiased but not consistent estimator, as well as a biased but consistent estimator.The Law of Anomalous Numbers2020-01-10T00:00:00-08:002020-01-10T00:00:00-08:00https://felperez.github.io/posts/2020/01/blog-post-23<p>We have seen in previous entries how multiple statistical results allow us to find patterns in randomness. Today we talk about a <em>strange</em> law of numbers, <strong>Benford’s law</strong>. This is an empirical law that explains the distribution of the <strong>leading digit</strong> of observed data in real life situations. We will then draw a parallel with the <strong>law of leading digits for the powers of 2</strong> and how can ergodic theory help us understand these phenomena.</p>
<p>If you like my content, consider following my <a href="https://www.linkedin.com/in/felperez/">linkedin</a> page to stay updated. The code generating the plots and computing the distributions can be found in my github <a href="https://github.com/felperez/statistical-laws/blob/master/benford.ipynb">here</a>.</p>
<h3 id="real-life-data-and-distribution-of-leading-digits">Real life data and distribution of leading digits</h3>
<p>Suppose we collect data from a real life setting: in this example, we will use the nominal gdp (in US dollars) from countries. The data has been obtained from <a href="https://datahub.io/core/gdp">https://datahub.io/core/gdp</a>. It consists of a list of regions and countries together with, among other indicators, their nominal gdp. We first take a look at how the table looks like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import pandas as pd
gdp = pd.read_csv("../input/gdp.csv")
gdp.head()
</code></pre></div></div>
<p><img src="/files/table_gdp.png" alt="Table gdp" /></p>
<p>Taking a deeper look, we can see that the first rows consist of several different regions of the world. If we go low enough, we find the list of countries with several years of measurements. With some quick filtering we obtain the list of countries and their gdp in 2016. If we look at the distribution of the first digit of each of the values in the dataframe, we observe the following distribution:</p>
<p><img src="/files/observed_distr.png" alt="Observed distribution" /></p>
<p>We can see that the distribution has a peak at 1 and decays logarithmically. This is what Benford’s law predicts:</p>
<p><strong>Benford’s law:</strong> we say that a set of numbers $S$ satisfies the Benford distribution if the frequency $f_d$ of the digit $d$ as the first digit of the numbers in $S$ is approximately</p>
\[f_d = \log_{10}\left(1+ \dfrac{1}{d}\right)\]
<p>We plot this function along with the observed distribution from the previous plot.</p>
<p><img src="/files/theoretical_distr.png" alt="Theoretical distribution" /></p>
<p>We can see how the distribution of the observed data closelly follows Benford’s distribution. In general, sets of naturally ocurring numbers have been tested to follow Benford’s distribution, appearing in a mixture of context. Two heuristic conditions that seems to ensure that a given dataset follows Benford’s distribution are:</p>
<ol>
<li>The dataset spans over several orders of magnitude;</li>
<li>The fluctuations are multiplicative.</li>
</ol>
<p>More about these heuristics can be read in Wikipedia’s page about Benford’s law <a href="https://en.wikipedia.org/wiki/Benford%27s_law#Example">here</a>. In particular, it has been observed to hold in several</p>
<h3 id="what-can-ergodic-theory-say-about-this">What can ergodic theory say about this?</h3>
<p>In previous entries (see <a href="/posts/2019/06/blog-post-10/">‘Law of large numbers, part 1’</a> and <a href="/posts/2020/01/blog-post-22/">‘The ergodic theorem’</a>) we have seen several equidistribution results, for i.i.d. processes, Markov chains and ergodic processes. They all three describe the asymptotic behavior of averages of random variables: under certain conditions, the averages converge to the expected value of the random variable, almost surely. While this result is extremely powerful, we must be extremely careful with how we apply it to a real life situation. Let us see the following example:</p>
<p><strong>All digits are equally likely:</strong> consider $X = [0,1]$ endowed with the Borel sigma-algebra and the Lebesgue measure. If we pick a random number from $[0,1]$, then the probability that the frequency of digits $k$ in a random number is equal to $1/10$ for all $k\in\{0,\dots,9\}$ is one. To prove this, we observe that we can define the following random variables: $X_{k,n}(\omega) = 1 \text{ if n-th digit of }\omega \text{ is } k$. This sequence of random variables is i.i.d (exercise). Note that the quantity</p>
\[f_{k,n}(\omega) = \dfrac{1}{n}\{\text{#of k's in first n digits of }\omega\}\]
<p>is equal to the averages of $X_{k,n}$:</p>
\[f_{k,n}(\omega) = \dfrac{X_{k,1}+\dots+X_{k,n}}{n}\]
<p>and by the <a href="/posts/2019/06/blog-post-10/">Law of large numbers</a>, this converges almost surely to the mean of the $X_{k,n}$:</p>
\[\mathbb{E}(X_{k,n}) = \int_{0}^1 X_{k,n} dx = \dfrac{1}{10}.\]
<p><strong>Words of caution:</strong> this does <strong>not</strong> that if we sample random numbers from $[0,1]$ we will actually observe such distribution. It does not follow from the LLN that this works for a given number.</p>
<h3 id="overcoming-this-issue-unique-ergodicity">Overcoming this issue: unique ergodicity</h3>
<p>In the previous section we saw how even when we have almost sure asymptotic results, it is not clear if we can use this result with a given number. This is a problem which in intrinsic to almost sure convergence results: there is no way to tell if we are sampling from the zero measure set where the convergence may not be happening. To solve this, we need results with stronger convergence. It turns out that the notion of <strong>unique ergodicity</strong> is what we really need.</p>
<p>As in the settings of <a href="/posts/2020/01/blog-post-22/">‘The ergodic theorem’</a>, let $(X,\mathcal{B},\mu,T)$
a probability preserving system on a compact space $X$, where the sigma-algebra is the Borel sigma-algebra. We say that the system is uniquely ergodic if $\mu$ is the unique invariant Borel measure for $T$. This implies that $\mu$ is ergodic: if it is not ergodic, there are subspaces $A,B\subset X$ with positive measure. Thus the measures</p>
\[\mu_A(\cdot) = \dfrac{\mu(A\cap\cdot)}{\mu(A)} \quad , \quad \mu_B(\cdot) = \dfrac{\mu(B\cap\cdot)}{\mu(B)}\]
<p>are two different invariant probability measures for $T$. We also have the following characterization:</p>
<p><strong>Theorem (unique ergodicity):</strong> $\mu$ is uniquely ergodic if and only if for every continuous function $f$, we have that</p>
\[\lim_{n\to\infty}\sup_{x\in X} \left|\dfrac{1}{n}(f(x)+f\circ T(x) + \dots + f\circ T^k(x)) - \int_X fd\mu\right| = 0.\]
<p>Note that this implies that the convergences happens for <strong>all</strong> points in the space. This is particularly helpful, as we will see in the next example.</p>
<p><strong>Distribution of leading digit of the powers of 2:</strong></p>
<p>Benford’s law states that naturally ocurring datasets have leading digits which tend to follow Benford’s law. In this section we <strong>prove</strong> that this is actually the case for powers of 2. The proof is based on the above aproach using unique ergodicity.</p>
<p>Let $X = S^1 = [0,1]/\sim$ where $\sim$ identifies $0$ and $1$, and for $\alpha$ irrational, consider the map $T\colon X\to X$ given by $T(x) = x + \alpha \pmod 1$. It is easy to see that the Lebesgue measure $m$ on $X$ is invariant under $T$ (see picture below).</p>
<p><img src="/files/irrational_rotation.png" alt="Irrational rotation" /></p>
<p>We will prove that $T$ is uniquely ergodic, and that in fact, $m$ is the only invariant measure for $T$. Before the proof, we will see how this result yield the distribution law for the leading digits of the powers of 2. Let $\alpha = \log 2$ (from now to the end of the article logarithms are assumed to be in base 10). Using <a href="https://personalpages.manchester.ac.uk/staff/charles.walkden/ergodic-theory/ergodic_theory.pdf">Stone-Weierstrass theorem</a> one can extend the convergence in the unique ergodicity theorem to indicator functions. Assuming that the averages converge <strong>everywhere</strong> to their expected value, we can in particular take $x = 0$ and $f(x)=\chi_{[\log k , \log k +1))}(x)$. Note that if we write a given power of 2 in base 10, we have that</p>
\[2^ n = 10^m a_m + \dots +10^0 a_0.\]
<p>Note that the leading digit of $2^n$ is then $a_m$ if and only if</p>
\[10^m a_m \leq 2^ n \le 10^{m} (a_m+1),\]
<p>and this is equivalent to</p>
\[m+ \log a_m \leq n\log 2 \le m + \log(a_m+1).\]
<p>If we work$\pmod {1}$ then the above inequality is equivalent to</p>
\[\log a_m \leq n\log 2 \pmod {1} = T^n(0) \le \log(a_m+1).\]
<p>This means that the leading digit of $2^n$ is $a_m$ if and only if $T^n(0)\in [\log a_m, \log(a_m+1))$. Since the averages</p>
\[\dfrac{1}{n}(\chi_{[\log a_m, \log(a_m+1))}(T(0))+\dots + \chi_{[\log a_m, \log(a_m+1))}(T^n(0)))\]
<p>computes how many times $a_m$ is the leading digit of $2^k$ for $k=1\dots n$, taking the limit we obtain the distribution $f_{a_m}$ of $a_m$ as leading digit, corresponding to the expected value of $\chi_{[\log a_m, \log(a_m+1))}:$</p>
\[f_{a_m} = \lim_{n\to\infty} \dfrac{1}{n}\{\text{ #of times }a_m\text{ is the leading digit of }2^k, k=1\dots n \}= \int_0^1 \chi_{[\log a_m, \log(a_m+1))}dx = \log\left(1+\dfrac{1}{a_m}\right)\]
<p>as Benford’s distribution.</p>
<h3 id="proof-of-unique-ergodicity">Proof of unique ergodicity</h3>
<p>This proof is a classic example of how harmonic analysis comes to helps us proving uniformity results. We will show that any Borel probability measure $\nu$ invariant under $T$ must be equal to the Lebesgue probability measure $\mu$ on $X$. For this, we will show that they integrate the same against any continuous function $f\in\mathcal{C}(X)$.</p>
<p>First, we take a look at the <a href="https://www.math.u-bordeaux.fr/~jli004/publications/renewal.pdf">Fourier coefficients</a> of $\nu$:</p>
\[\hat\nu(k) = \int_X e^{2\pi i k \theta}d\nu(\theta) = \int_X e^{2\pi i k (\theta+\alpha)}d\nu(\theta) = e^{2\pi i k\alpha}\hat\nu(k),\]
<p>where the first equality is by definition and the second by invariance of $\nu$. Since $\alpha$ is irrational, we have that $e^{2\pi i k \alpha}\neq 0$ for all $k\neq 0$, and hence $\hat\nu(k) = 0$. For $k = 0$ it is immediate that $\hat\nu(0)=1$. This implies that $\nu$ and $m$ have the same Fourier coefficients, and hence correspond to the same measure.</p>
<h3 id="final-comments">Final comments</h3>
<p>In this entry we have seen how some real life datasets follow Benford’s distribution. We have also seen how ergodic theory is able to prove that a certain set of numbers (powers of 2) follow Benford’s distribution. While this does not prove that other real life sets follow the same distribution, it provides some evidence for it. We <em>dream</em> that there is some not know yet ergodic theoretic proof for this result, but one can only hope.</p>
<h3 id="exercise">Exercise</h3>
<p>This is an exercise from Barry Simon’s book on Harmonic Analysis: prove that the leading digits of $2^n$ and $3^n$ are asymptotically independent, that is,</p>
\[\lim_{n\to\infty} \left(\dfrac{1}{n} \#\{ k\leq n: ld(2^n) = a_1, \ ld(3^n)= a_2\} - \dfrac{1}{n^2} \#\{ k\leq n: ld(2^n) = a_1\} \# \{ k\leq n: ld(3^n)= a_2\} \right) \to 0,\]
<p>where $ld(p)$ represents the leading digit of $p$, and $a_1,a_2\in\{1,\dots,9\}$. The hint provided by the book suggests to consider multi-dimensional <em>irrational</em> rotations.</p>
<h3 id="further-references">Further references</h3>
<p>Further comments on Benford’s law
<a href="https://web.williams.edu/Mathematics/sjmiller/public_html/BrownClasses/197/benford/Benford_AVKontorovichSJMIller_Final06b_aa.pdf">https://web.williams.edu/Mathematics/sjmiller/public_html/BrownClasses/197/benford/Benford_AVKontorovichSJMIller_Final06b_aa.pdf</a></p>
<p>An excellent course in Ergodic Theory
<a href="https://personalpages.manchester.ac.uk/staff/charles.walkden/ergodic-theory/ergodic_theory.pdf">https://personalpages.manchester.ac.uk/staff/charles.walkden/ergodic-theory/ergodic_theory.pdf</a></p>Felipe Pérezfel.prz@gmail.comWe have seen in previous entries how multiple statistical results allow us to find patterns in randomness. Today we talk about a strange law of numbers, Benford’s law. This is an empirical law that explains the distribution of the leading digit of observed data in real life situations. We will then draw a parallel with the law of leading digits for the powers of 2 and how can ergodic theory help us understand these phenomena.The ergodic theorem2020-01-07T00:00:00-08:002020-01-07T00:00:00-08:00https://felperez.github.io/posts/2020/01/blog-post-22<p>In previous entries (see <a href="/posts/2019/05/blog-post-3/">here</a> and <a href="/-posts/2019/06/blog-post-10/">here</a>) we have seen how the weak and the strong law of large numbers gives us asymptotics for the averages of i.i.d. sequences. While the assumption of independence allows us to apply this result in many different contexts, it is still a quite strong assumption. The ergodic theorem constitues a generalization where we allow a certain degree of dependence between the random variables being considered.</p>
<p>If you like my content, consider following my <a href="https://www.linkedin.com/in/felperez/">linkedin</a> page to stay updated.</p>
<h3 id="dynamical-systems">Dynamical systems</h3>
<p>In order to formulate the ergodic theorem, we need some background from dynamical systems. Consider a probability space $(X,\mathcal{B},\mu)$, which we call the <strong>phase space</strong>. If our measure space is of finite measure, we can normalize and turn it into a probability space, while if the space is not of finite measure, most of the results we will talk about do not hold. We also consider a measurable transformation $T\colon X\to X$, which we call the <strong>dynamics</strong> of the space. We think of $T$ as a deterministic law of evolution of the phase space. In order to observe the phase space, we consider measurable functions $f\colon X\to\mathbb{R}$, which we call <strong>observables</strong> or <strong>potentials</strong>. We think of observables as internal characteristics of the space, such as its internal energy. We say that $T$ <strong>preserves</strong> $\mu$ if</p>
\[\mu(T^{-1}(A)) = \mu(A),\]
<p>for all $A\in\mathcal{B}$. This property is equivalent to the fact that for any integrable observable $f\in L^1(\mu)$, the time series $Z_n=f\circ T^n$ is identically distributed. We are interested in the averages of $f$ as we compose it with the iterates of $T$:</p>
\[\dfrac{1}{n}(f+f\circ T+\dots +f\circ T^{n-1})(x)\]
<p>for generic points $x\in X$.</p>
<p><strong>Example:</strong> consider $X=[0,1]$ equipped with the Borel sigma-algebra and the Lebesgue measure $m$ restricted to $X$. Consider also the transformation $T(x) = 2x \pmod 1$. Then $T$ preserves $m$.</p>
<p><img src="/files/doubling.png" alt="Doubling map" /></p>
<h3 id="ergodicity">Ergodicity</h3>
<p>In order to have a law of large numbers in the above context, we need to assume some irreducibility conditions. Consider the dynamics in the picture:</p>
<p><img src="/files/non-ergodic.png" alt="Non ergodic system" /></p>
<p>and assume that $T$ preserves the Lebesgue measure. Note that $T([0,1/2])\subset [0,1/2]$. Consider the observable $f (x)= 1_{[0,1/2]}(x)$, where $1_A$ is the indicator function of the set $A$. Suppose we have that the averages of $f\circ T^k$ converge to a constant $c$ almost surely. This would mean that for almost all points $x\in [0,1/2]$, we have that</p>
\[\dfrac{1}{n}(f+f\circ T+\dots +f\circ T^{n-1})(x)\to c.\]
<p>The left-hand side is the number of iterates $T^k(x)$ are in $[1/2,1]$, divided by $n$. But given $T([0,1/2])\subset [0,1/2]$, the left-hand side is zero for all $x\in [0,1/2]$, hence $c$ must be equal to $0$. Similarly, if we take $f (x)= 1_{[1/2,1]}(x)$, we obtain that $c$ must be equal to $1$.</p>
<p>From the example above we conclude that an <em>irreducibility</em> condition must be necessary in order for a LLN to hold in this context. It turns out that this is not only necessary but sufficient. Let’s formulate precisely the above condition: suppose that if $T^{-1}A = A$ for $A\in\mathcal{B}$, then $\mu(A) = 0$ or $\mu(A) = 1$. We say that $\mu$ is <strong>ergodic</strong> with respect to $T$.</p>
<h3 id="the-ergodic-theorem">The ergodic theorem</h3>
<p>We have already all the elements to formulate the ergodic theory.</p>
<p><strong>Theorem (Birkhoff):</strong> suppose that $T$ preserves the probability measure $\mu$, and that this is ergodic. For any integrable observable $f\in L^1(\mu)$, we have that</p>
\[\dfrac{1}{n}(f+f\circ T+\dots +f\circ T^{n-1})(x) \to \int_X f d\mu\]
<p>for $\mu$-almost every point $x$.</p>
<p>The proof of this result involves a lot of work and we postone it for now. We see now how this generalizes the LLN: let $\{ Z_n\}$ be a sequence of i.i.d. random variables on $(\Omega,\mathbb{P})$. Define $X:=\mathbb{R}^\mathbb{N}$ with the sigma-algebra generated by the cylinders and the dynamics $\sigma\colon X\to X$ given by $\sigma(x_1,x_2,\dots) = (x_2,x_3,\dots)$. We call this map, the <strong>left shift</strong> on $X$. We can think of each realization of the sequence $\{Z_n\}$ as an element of $X$ given by $(Z_1(\omega),Z_2(\omega),\dots)$. The random variables $Z_n$ define a probability measure $\nu$ on $\mathbb{R}$. Consider the product $\mu=\nu^{\otimes\mathbb{N}}$ on $X$, and the fact that $\{Z_n\}$ are i.i.d., implies that $\sigma$ preserves $\mu$. Finally, consider the observable $f\colon X\to\mathbb{R}$ given by $f(x_1,\dots)=x_1$. Proving that $\mu$ is ergodic requires a bit more work: see <a href="https://math.stackexchange.com/questions/175369/how-follows-the-strong-law-of-large-numbers-from-birkhoffs-ergodic-theorem?rq=1">here</a>. Applying the ergodic theory to $(X,\mathcal{B},\mu,\sigma,f)$ and noting that $f\circ\sigma^n(\omega) = Z_{n+1}(\omega)$ for $n\geq 0$ gives that</p>
\[\dfrac{1}{n}(Z_1+\dots + Z_n)\to \mathbb{E}(f) = \int_X f(x)d\mu(x)=\int_\Omega X_1(\omega) d\mathbb{P}(\omega) =\mathbb{E}(X_1)\]
<p>for $\mathbb{P}$ almost every realization as we wanted to prove.</p>
<h3 id="applications-to-markov-chains">Applications to Markov chains</h3>
<p>One of the biggest applications of the ergodic theorem is proving that Markov chains have a LLN. In what follows we will consider only homogeneous Markov chains. Recall a stochastic process $\{Z_n\}$ on $(\Omega,\mathbb{P})$ taking values in a finite/countable set $S$ is called a (discrete time) <strong>Markov chain</strong> if</p>
\[\mathbb{P}(Z_{n+1}= z | Z_n = z_n,\dots, Z_1 = z_1) = \mathbb{P}(Z_{n+1} = z | Z_n = z_n) = \mathbb{P}(Z_{1} = z | Z_0 = z_n)\]
<p>for all $z,z_n,\dots, z_1 \in S$ and $n\geq 0$. We call $\mathcal{S}$ the space of <strong>states</strong>. The Markov chain is characterized by the probabilities $p_{i} = \mathbb{P}(Z_{1} = i)$ and $p_{ij} = \mathbb{P}(Z_{n+1} = j | Z_n = i)$ for all $i,j\in\mathcal{S}$ and $n\geq 1$. We denote the matrix with entries $p_{ij}$ by $P$ and we call it the matrix of <strong>transition probabilities</strong>, and the vector with entries $p_i$ by $p$, which we call the <strong>initial distribution</strong>. Before diving into the LLN, we will recall some properties of Markov chains.</p>
<p>The idea behind the model of Markov chains is that we jump from one state $i$ in $\mathcal{S}$ to another state $j$ at random, and the probability for that jump is given by $p_{ij}$. The dynamic properties of this process are interesting, and can be classified in terms of the matrix $P$. For instance, the probability that after $n$ steps the state is $i$ is given by $(p^T P^n)_i$.</p>
<p>An important question about Markov chains is when can two given states be connected with non-zero probability. We say that two a state $j$ is <strong>accessible</strong> from the state $i$ if there exists a sequence $(i_0 = i, i_1,\dots ,i_{n-1} , i_n = j)$ such that $P_{i_k i_{k+1}} > 0$ for all $k$. We say that $i$ and $j$ are <strong>connected</strong> if $i$ is accessible from $j$ and vice versa. Note that <em>being connected</em> is an equivalence relation, and partitions the space of states in classes, which we call <strong>communication classes</strong>. We say that the Markov chain is <strong>irreducible</strong> if there is only one communication class.</p>
<p>The asymptotic properties we are interested in are <em>recurrence</em> properties. In order to give the proper definition, we need to define a relevant stopping time. For a state $i$, we define its <strong>return time</strong> $T_i = \inf{ n > 0 : Z_n =i }$. This gives us the first time before the Markov chain returns to the state $i$. With this we can classify the states in two classes:</p>
<ol>
<li>If $\mathbb{P}(T_i < \infty | Z_1 = i) = 1$, we call $i$ <strong>recurrent</strong>.</li>
<li>If $\mathbb{P}(T_i < \infty | Z_1 = i) < 1$, we call $i$ <strong>transient</strong>.</li>
</ol>
<p>In other words, paths either return to their initial state with probability $1$, or there is a positive probability that they never return. In the case that they almost surely return, we can further divide the recurrent states in two classes:</p>
<ol>
<li>A state is <strong>positive recurrent</strong> if $\mathbb{E}(T_i | Z_{1}=i) < \infty$.</li>
<li>A state is <strong>null recurrent</strong> if $\mathbb{E}(T_i | Z_{1}=i) =\infty$.</li>
</ol>
<p>All states in a given communication class are either recurrent or transient. If a communication class is recurrent, then all of its elements are either null recurrent or positive recurrent. We can describe now the ergodic theory in the setting of Markov chains:</p>
<p><strong>Theorem:</strong> let $\{Z_n\}$ be an irreducible Markov chain. Let $V_i(n)$ the number of visits of the chain to the state $i$ in the first $n$ steps. Then</p>
\[\mathbb{P}\left\{ \dfrac{V_i(n)}{n} \to \dfrac{1}{\mathbb{E}(T_i| Z_1 = i)} \Big| Z_1 = i \right\} = 1\]
<p>If $\mathbb{E}(T_i| Z_1 = i) = \infty$, we understand $\mathbb{E}(T_i| Z_1 = i)^{-1}=0$.</p>
<p>This is essentially an equidistribution result: the Markov chain spends proportionaly as much time as the inverse of its return time. In other words, the longer it takes it to return, the less time it spends in that state. Null recurrent states take so long to be visited back that the chain spends virtually no time there. This theorem can be extended to functions on the Markov chain. The proof from the ergodic theorem follows a similar line to what we observed above for i.i.d. sequences: we form a space of sequences, construct an invariant ergodic measure and apply the ergodic theorem to indicator functions.</p>
<p>More about Markov chains can be read <a href="http://web.math.ku.dk/noter/filer/stoknoter.pdf">here</a>.</p>Felipe Pérezfel.prz@gmail.comIn previous entries (see here and here) we have seen how the weak and the strong law of large numbers gives us asymptotics for the averages of i.i.d. sequences. While the assumption of independence allows us to apply this result in many different contexts, it is still a quite strong assumption. The ergodic theorem constitues a generalization where we allow a certain degree of dependence between the random variables being considered.