Blog (mostly math)

Statistics

Link to lectures

Instructor: Prof. Philippe Rigollet

ROUGH NOTES (!)

Lec-1 [Slides]

Idea: Use data to get insight and perhaps make decisions.
Computational view: Data is a (large) sequence of numbers that needs to be processed by a fast algorithm (approximate nearest neighbours, low dimensional embeddings, etc)
Statistical view: Data comes from a random process. Need to understand more about the process from given data.

A Silly Eg: [This paper]

Say we want to understand ${ p },$ the proportion (globally) of couples that turn their head to the right when kissing.
Here is a statistical experiment:

  • Randomly observe some ${ n }$ kissing couples
  • For each ${ 1 \leq i \leq n }$ define ${ R _i = 1 }$ if ${ i }$th couple turns to the right and ${ R _i = 0 }$ otherwise (We think of ${ R _i }$s as random variables)
  • An estimator of ${ p }$ is ${ \hat p = \frac{1}{n} (R _1 + \ldots + R _n) }$ (Even ${ \hat p }$ is a random variable)

One question is: How accurately does ${ \hat p }$ estimate ${ p }$ ?
Assuming independence ${ R _1, \ldots, R _n \overset{i.i.d}{\sim} \text{Ber}(p). }$ From LLN, ${ \hat{p} = \overline{R _n} \overset{a.s}{\to} p }$ as ${ n \to \infty }.$
From CLT, ${ \sqrt{n} \left( \frac{\overline{R _n} - p}{\sqrt{p(1-p)}} \right) \overset{(d)}{\to} N(0,1) }$ as ${ n \to \infty }.$ Now from LLN above and Slutzky, ${ \sqrt{n} \left( \frac{\overline{R _n} - p}{\sqrt{\hat{p}(1-\hat{p})}} \right) \overset{(d)}{\to} N(0,1) }$ as ${ n \to \infty }.$
So if ${ x \gt 0 },$ assuming large ${ n }$ throughout, we have ${ \mathbb{P}(\vert \overline{R _n} - p \vert \geq x ) }$ ${ = \mathbb{P} \left( \sqrt{n} \frac{\vert \overline{R _n} - p \vert}{\sqrt{\hat{p}(1-\hat{p})}} \geq \sqrt{n} \frac{x}{\sqrt{\hat{p}(1-\hat{p})}} \right) }$ ${ \approx \mathbb{P} _{Z \sim N(0,1) } \left( \vert Z \vert \geq \sqrt{n} \frac{x}{\sqrt{\hat{p}(1-\hat{p})}} \right). }$
If ${ x }$ is such that ${ \sqrt{n} \frac{x}{\sqrt{\hat{p}(1-\hat{p})}} = q _{0.025} },$ the RHS is ${ 0.05 }$ and hence ${ \vert \overline{R _n} - p \vert \geq x }$ occurs with probability about ${ 0.05. }$ So ${ p }$ lies in the interval ${ \overline{R _n} \pm q _{0.025} \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} } }$ with probability about ${ 0.95 }.$

Since ${ \hat{p} (1-\hat{p}) \leq \frac{1}{4} }$ there is also a more conservative interval ${ \overline{R _n} \pm q _{0.025} \frac{1}{2\sqrt{n}}. }$

In the paper, ${ n = 124 }$ and random variable ${ \hat{p} }$ realises as ${ \hat{p} = \frac{80}{124} \approx 0.645 }.$ So the ${ 95 \% }$ asymptotic confidence interval for ${ p }$ (this is a random interval) realises as ${ 0.645 \pm 1.96(\frac{1}{2 \times 11.13}) }$ ${ \approx 0.645 \pm 0.087 }$ i.e. roughly ${ [0.56 , 0.73] }.$


Lec-2 [Slides]

Let the observed outcome of a statistical experiment be a sample ${ X _1, \ldots, X _n , }$ of i.i.d random variables taking values in some measurable space ${ E \subseteq \mathbb{R} }.$ Let ${ \mathbb{P} }$ denote their common distribution.
Now a statistical model associated to that statistical experiment is a pair ${ (E, (\mathbb{P} _{\theta}) _{\theta \in \Theta}) }$ where ${ (\mathbb{P} _{\theta}) _{\theta \in \Theta} }$ is a family of probability measures on ${ E }$ (and ${ \Theta }$ is any set).
We assume ${ \Theta \subseteq \mathbb{R} ^d }$ for some ${ d \geq 1 }$ (i.e. the model is parametric), and that there is a ${ \theta \in \Theta }$ for which ${ \mathbb{P} _{\theta} = \mathbb{P} }$ (i.e. the model is well specified).
In above setup, the parameter ${ \theta }$ is called identifiable if ${ \theta \mapsto \mathbb{P} _{\theta} }$ is injective.

Let ${ (E, (\mathbb{P} _{\theta}) _{\theta \in \Theta}) }$ be a statistical model based on observations/sample ${ X _1, \ldots, X _n }.$
A statistic is any measurable function of the sample (like ${ \overline{X _n}, \log(1+ \vert X _n \vert), }$ etc). A statistic which doesn’t depend on ${ \theta }$ is called an estimator of ${ \theta }.$ An estimator ${ \hat{\theta} _n }$ of ${ \theta }$ is consistent if ${ \hat{\theta} _n \overset{(p)}{\to} \theta }$ as ${ n \to \infty }.$ (So in the kiss example, ${ \hat{p} = \overline{X _n} }$ is a consistent estimator of ${ p }$). Motivated from CLT, an estimator ${ \hat{\theta} _n }$ of ${ \theta }$ is called asymptotically normal if ${ \sqrt{n} (\frac{\hat{\theta} _n - \theta}{\sigma}) \overset{(d)}{\to} N(0,1) }$ as ${ n \to \infty ,}$ and in this case ${ \sigma ^2 }$ is called asymptotic variance of ${ \hat{\theta} _n }.$ For an estimator ${ \hat{\theta} _n }$ of ${ \theta },$ its bias is ${ \text{bias}(\hat{\theta} _n) = \mathbb{E}[\hat{\theta} _n] - \theta }$ and its quadratic risk is ${ R(\hat{\theta} _n) = \mathbb{E} [\vert \hat{\theta} _n - \theta \vert ^2] }$ ${ (= \text{var}(\hat{\theta} _n) + \text{bias}(\hat{\theta} _n) ^2 ).}$
Let ${ \alpha \in (0,1) }.$ A confidence interval (CI) of asymptotic level ${ 1-\alpha }$ for ${ \theta }$ is a random interval ${ \mathcal{I} }$ (depending on ${ X _1, \ldots, X _n }$ and not on ${ \theta }$) for which ${ \lim _{n \to \infty} \mathbb{P} _{\theta} (\mathcal{I} \ni \theta) \geq 1-\alpha }$ for all ${ \theta \in \Theta }.$

The delta method lets one consider transformations of asymptotically normal random variables:
Let ${ (Z _n) }$ be a sequence of random variables satisfying ${ \sqrt{n}(Z _n - \theta) \overset{(d)}{\to} N(0, \sigma ^2) }$ for some ${ \theta \in \mathbb{R} }$ and ${ \sigma ^2 \gt 0 }.$ Let ${ g : \mathbb{R} \to \mathbb{R} }$ be ${ \mathcal{C} ^1 }$ in a neighbourhood of ${ \theta }.$ Then ${ \sqrt{n} (g(Z _n) - g(\theta)) \overset{(d)}{\to} N(0, (g’(\theta)) ^2 \sigma ^2 ) }.$

Sketch: ${ g(Z _n) = g(\theta) + g’(\overline{\theta}) (Z _n - \theta) }$ for some (random variable) ${ \overline{\theta} }$ between ${ Z _n }$ and ${ \theta }.$ So ${ \sqrt{n} (g(Z _n) - g(\theta)) = \underbrace{g’(\overline{\theta})} _{\to g’(\theta)} \underbrace{ \sqrt{n} (Z _n - \theta) } _{\to N(0, \sigma ^2)}. }$


Lec-3 [Slides]

Total Variation distance between two probability measures ${ \mathbb{P} _{\theta } }$ and ${ \mathbb{P} _{\theta ‘} }$ on ${ E }$ is ${ \text{TV}(\mathbb{P} _{\theta}, \mathbb{P} _{\theta ‘}) := \max _{A \subseteq E} \vert \mathbb{P} _{\theta} (A) - \mathbb{P} _{\theta ‘} (A) \vert. }$ (Subtleties involving measurability are ignored for now). If the measures are discrete, turns out ${ \text{TV}(\mathbb{P} _{\theta}, \mathbb{P} _{\theta ‘}) = \frac{1}{2} \sum _{x \in E} \vert p _{\theta} (x) - p _{\theta ‘} (x) \vert }$ where ${ p _{\theta}, p _{\theta ‘} }$ are respective PMFs. If the measures are continuous, ${ \text{TV}(\mathbb{P} _{\theta}, \mathbb{P} _{\theta ‘}) = \frac{1}{2} \int _{E} \vert f _{\theta} (x) - f _{\theta ‘} (x) \vert dx }$ where ${ f _{\theta}, f _{\theta ‘} }$ are respective PDFs. Total variation is a metric on the space of probability measures.
Kullback-Liebler divergence between two probability measures ${ \mathbb{P} _{\theta} }$ and ${ \mathbb{P} _{\theta ‘} }$ on ${ E },$ namely ${ \text{KL}(\mathbb{P} _{\theta}, \mathbb{P} _{\theta ‘} ) },$ is given by ${ \sum _{x \in E} p _{\theta} (x) \log( \frac{p _{\theta} (x)}{p _{\theta ‘} (x)} ) }$ if ${ E }$ is discrete and ${ \int _E f _{\theta} (x) \log(\frac{f _{\theta} (x)}{f _{\theta ‘} (x)} )dx }$ if ${ E }$ is continuous. KL divergence is not a metric on the space of probability measures, since for eg there is asymmetry ${ \text{KL}(\mathbb{P} _{\theta}, \mathbb{P} _{\theta ‘}) \neq \text{KL}(\mathbb{P} _{\theta ‘}, \mathbb{P} _{\theta} ). }$

Let ${ (E, (\mathbb{P} _{\theta}) _{\theta \in \Theta}) }$ be a statistical model based on sample ${ X _1, \ldots, X _n }.$ One assumes there is a true parameter ${ \theta ^{\ast} \in \Theta }$ with ${ X _1 \sim \mathbb{P} _{\theta ^{\ast} } }$. A goal is, given ${ X _1, \ldots, X _n },$ to find an estimator ${ \hat{\theta} = \hat{\theta}(X _1, \ldots , X _n) }$ for which ${ \mathbb{P} _{\hat{\theta}} }$ is “close” to ${ \mathbb{P} _{\theta ^{\ast}} }.$
We can assume discrete ${ E }$ for now. For any ${ \theta \in \Theta }$ a separation from true distribution is ${ \text{KL}(\mathbb{P} _{\theta ^{\ast}} , \mathbb{P} _{\theta}) }$ ${ = \mathbb{E} _{\theta ^{\ast}} [\log(\frac{p _{\theta ^{\ast}} (X)}{p _{\theta} (X)} )] }$ ${ = \underbrace{\mathbb{E} _{\theta ^{\ast}} [\log p _{\theta ^{\ast}} (X) ]} _{\text{constant, } C(\theta ^{\ast}) } - \mathbb{E} _{\theta ^{\ast}} [\log p _{\theta} (X) ]. }$ This gives an estimator ${ \widehat{\text{KL}}(\mathbb{P} _{\theta ^{\ast}}, \mathbb{P} _{\theta}) = C(\theta ^{\ast}) - \frac{1}{n} \sum _{1} ^{n} \log p _{\theta} (X _i) }$ ${ = C(\theta ^{\ast}) - \frac{1}{n} \log(\prod _1 ^n p _{\theta} (X _i) ) }$ of ${ \text{KL}(\mathbb{P} _{\theta ^{\ast}} , \mathbb{P} _{\theta}) }.$ Since we want ${ \hat{\theta}(X _1, \ldots, X _n) }$ such that ${ \text{KL}(\mathbb{P} _{\theta ^{\ast}}, \mathbb{P} _{\hat{\theta}}) }$ is small, we can set ${ \hat{\theta} }$ to be a minimizer of ${ \theta \mapsto \widehat{\text{KL}}(\mathbb{P} _{\theta ^{\ast}}, \mathbb{P} _{\theta}) }.$ This minimisation occurs when ${ \prod _{1} ^{n} p _{\theta} (X _i) }$ is maximized. The ${ E }$ continuous case is similar, where we try to maximize ${ \prod _1 ^n f _{\theta} (X _i) }.$ So one considers the following.

Let ${ (E, (\mathbb{P} _{\theta}) _{\theta \in \Theta}) }$ be a statistical model based on sample ${ X _1, \ldots, X _n },$ with true parameter ${ \theta ^{\ast} }.$ Likelihood of the model is defined as: If ${ E }$ is discrete, it is the map ${ L _n : E ^n \times \Theta \to \mathbb{R} }$ sending ${ (x _1, \ldots, x _n; \theta) \mapsto \prod _{1} ^{n} p _{\theta} (x _i) }.$ If ${ E }$ is continuous, it is the map ${ L _n : E ^n \times \Theta \to \mathbb{R} }$ sending ${ (x _1, \ldots, x _n; \theta) \mapsto \prod _{1} ^{n} f _{\theta} (x _i) }.$
The maximum likelihood estimator of ${ \theta }$ is ${ \hat{\theta} ^{\text{MLE}} _{n} = \text{argmax} _{\theta \in \Theta} L _n(X _1, \ldots, X _n ; \theta) .}$ (One maximises log-likelihood ${ \log L _n (X _1, \ldots, X _n ; \theta) }$ in practice). Turns out when ${ \Theta \subseteq \mathbb{R} }$ and mild regularity conditions hold, we have consistency ${ \hat{\theta} ^{\text{MLE}} _n \overset{(p)}{\to} \theta ^{\ast}, }$ and asymptotic normality ${ \sqrt{n} (\hat{\theta} ^{\text{MLE}} _n - \theta ^{\ast}) \overset{(d)}{\to} N(0, \mathcal{I}(\theta ^{\ast} ) ^{-1} ) }$ where Fischer Information ${ \mathcal{I}(\theta) := \text{var}[\ell ‘ (\theta)] = -\mathbb{E}[\ell ‘’ (\theta)] }$ and random variable ${ \ell(\theta) }$ is given by ${ \ell(\theta) := \log L _1 (X _1; \theta) }.$

Let ${ (E, (\mathbb{P} _{\theta}) _{\theta \in \Theta} ) , \Theta \subseteq \mathbb{R} ^d }$ be a statistical model based on sample ${ X _1, \ldots, X _n },$ with true parameter ${ \theta ^{\ast} }.$ Now for ${ 1 \leq k \leq d }$ one has population moments ${ m _k (\theta) = \mathbb{E} _{\theta} [X _{1} ^{k}] }$ and sample moments ${ \hat{m} _k = \frac{1}{n} \sum _{i=1} ^{n} X _i ^k }.$ From LLN, ${ (\hat{m} _1 \ldots, \hat{m} _d) \overset{a.s}{\to} (m _1 (\theta ^{\ast}), \ldots , m _d (\theta ^{\ast}) ) }$ as ${ n \to \infty }.$
Suppose the moment map ${ M : \Theta \to \mathbb{R} ^d }$ sending ${ \theta \mapsto (m _1 (\theta), \ldots, m _d (\theta)) }$ is injective. So true parameter is ${ \theta ^{\ast} = M ^{-1} (m _1 (\theta ^{\ast}), \ldots , m _d (\theta ^{\ast}) ), }$ and LLN above suggests this can be estimated by ${ \hat{\theta} ^{\text{MM}} _n := M ^{-1} (\hat{m} _1, \ldots, \hat{m} _d) }$ provided it exists.
( ${ \hat{\theta} ^{\text{MM}} _n }$ is called the moments estimator for ${ \theta }$ ).

comments powered by Disqus