Ref: Prof. Vershynin’s handwritten notes
Lec-1:
Consider the problem of numerically computing the integral of an ${ f : [0,1] ^d \to \mathbb{R} }.$
Breaking ${ [0,1] ^d }$ into (axis aligned) cubes of width ${ \epsilon },$ there are about ${ N \approx (\frac{1}{\epsilon}) ^{d} }$ many such smaller cubes. Now the integral ${ \int _{[0,1] ^d} f(x) dx \approx \frac{1}{N} \sum _{1} ^{N} f(x _i) }$ where the ${ N }$ many ${ x _i }$s are each points from the ${ N }$ smaller cubes.
In the Monte-Carlo method, instead of having ${ x _i }$s taken from the smaller cubes, they are considered as random variables ${ X _1, \ldots, X _N \stackrel{iid}{\sim} \text{Unif}[0,1] ^d }$ and the random variable ${ \frac{1}{N} \sum _{1} ^{N} f(X _i) }$ approximates (i.e. is an estimator of) ${ \int _{[0,1] ^d} f (x) dx }.$
Notice expectation ${ \mathbb{E}[\frac{1}{N} \sum _{1} ^{N} f (X _i)] }$ ${ = \mathbb{E}[f(X)] = \int _{[0,1] ^d} f(x) dx }$ where ${ X \sim \text{Unif}[0,1] ^{d} }.$ Also, the mean squared error ${ \mathbb{E}[ \left( \frac{1}{N} \sum _{1} ^{N} f(X _i) - \int _{[0,1] ^d} f(x) dx \right) ^2] }$ ${ = \text{var} \left( \frac{1}{N} \sum _{1} ^{N} f (X _i) \right) }$ ${ = \frac{1}{N ^2} \text{var}(\sum _{1} ^{N} f (X _i) ) }$ ${ = \frac{1}{N} \text{var}(f(X)) },$ which is ${ \leq \frac{1}{N} \int f ^{2} (x) dx }$ ${ \leq \frac{1}{N} \lVert f \rVert _{\infty} ^{2} }.$ So standard deviation (root of mean squared error) of the estimator is ${ O(\frac{1}{\sqrt{N}}) }.$
Lec-2:
A set ${ T \subseteq \mathbb{R} ^d }$ is convex if for all ${ x, y \in T }$ the segment ${ [x,y] = \lbrace x + t(y-x) : t \in [0,1] \rbrace \subseteq T }.$
Intersection of any family of convex sets is convex. For ${ T \subseteq \mathbb{R} ^d }$ its convex hull is ${ \text{conv}(T) = \cap \lbrace \text{convex sets containing } T \rbrace }$ (the smallest convex set containing ${ T }$).
Obs: Let ${ T \subseteq \mathbb{R} ^d }.$ Now ${ \text{conv}(T) = \lbrace \text{convex combinations of } z _1, \ldots, z _n \in T \text{ for } n \in \mathbb{Z} _{> 0} \rbrace }$
(A convex combination of ${ z _1, \ldots, z _n \in \mathbb{R} ^d }$ is an expression of the form ${ \sum _{1} ^{n} \lambda _i z _i }$ where ${ \lambda _i \geq 0, \sum _{1} ^{n} \lambda _i = 1 }$)
Sketch: For the ${ T }$ finite case, that ${ \text{conv}(\lbrace p _1, \ldots, p _n \rbrace) = \lbrace \sum _{1} ^{n} \lambda _i p _i : \text{each } \lambda _i \geq 0, \sum _{1} ^{n} \lambda _i = 1 \rbrace }$ is shown inductively.
Thm [Caratheodory]: Let ${ T \subseteq \mathbb{R} ^d },$ and ${ z \in \text{conv}(T) }.$ Consider the smallest ${ m }$ such that ${ z }$ can be written as a convex combination of some ${ m }$ points in ${ T }.$ Now ${ m \leq d + 1 }.$
Pf: Link. Since ${ z \in \text{conv}(T) }$ it can be written as ${ z = \sum _{i=1} ^{n} \lambda _i z _i }$ with each ${ \lambda _i \geq 0 ,}$ ${ \sum _{i=1} ^{n} \lambda _i = 1 }.$ If ${ n \leq d +1 }$ we are done, so suppose ${ n > d + 1 }.$ So ${ z _2 - z _1, \ldots, z _n - z _1 \in \mathbb{R} ^d }$ are linearly dependent. So ${ \sum _{i=2} ^{n} \beta _i (z _i - z _1) = 0 }$ for some ${ \beta _i }$s not all ${ 0 }.$ Equivalently ${ \sum _{i=1} ^{n} \gamma _i z _i = 0 }$ with ${ \sum _{i=1} ^{n} \gamma _i = 0 }$ and not all ${ \gamma _i }$ zero. The equations we now have are \({ z = \sum _{i=1} ^{n} \lambda _i z _i , \lambda _i \geq 0, \sum _{1} ^{n} \lambda _i = 1, }\) \({ \sum _{i =1} ^{n} \gamma _i z _i = 0, \sum _{i=1} ^{n} \gamma _i = 0, \text{ and not all } \gamma _i \text{ are } 0 }.\)
Focus on the indices ${ \mathscr{I} = \lbrace i \in \lbrace 1 , \ldots, n \rbrace : \gamma _i > 0 \rbrace },$ a nonempty set. The quantity ${ \lambda = \min _{i \in \mathscr{I}} \frac{\lambda _i}{\gamma _i} }$ lets us write ${ z = \sum _{i =1} ^{n} (\lambda _i - \lambda \gamma _i) z _i ,}$ and notice this is a convex combination as well. This new convex combination has atleast one coefficient ${ 0 }.$ So roughly, we showed whenever ${ z }$ is a convex combination of ${ n > d + 1 }$ points we can write it as a convex combination with one lesser point, as needed.
Thm [Approx Caratheodory]: Let ${ T \subseteq \mathbb{R} ^{d} }$ with ${ \text{diam}(T) < \infty }.$ Then for all ${ z \in \text{conv}(T) }$ and ${ k \in \mathbb{Z} _{> 0} ,}$ there exist ${ z _1, \ldots, z _k \in T }$ such that ${ \left\lVert z - \frac{1}{k} \sum _{i=1} ^{k} z _i \right\rVert _{2} \leq \frac{\text{diam}(T)}{\sqrt{2k}} .}$
Pf: Recall the following fact about variances.
For a random variable ${ X },$ its variance ${ \mathbb{E}[(X - \mathbb{E}[X]) ^2] = \frac{1}{2} \mathbb{E}[(X - X ^{‘}) ^2] }$ where ${ X ^{‘} }$ is an i.i.d copy of ${ X }.$ The proof is by expanding RHS. Similarly, for a random vector ${ Z }$ we have ${ \mathbb{E}\lVert Z - \mathbb{E}[Z] \rVert ^2 = \frac{1}{2} \mathbb{E}\lVert Z - Z ^{‘} \rVert ^2 }$ where ${ Z ^{‘} }$ is an i.i.d copy of ${ Z }.$
Since ${ z \in \text{conv}(T) }$ we can write ${ z = \sum _{i=1} ^{n} \lambda _i z _i }$ with ${ \lambda _i \geq 0 ,}$ ${ \sum _{i=1} ^{n} \lambda _i = 1 ,}$ ${ z _i \in T }.$
Let ${ Z }$ be a random vector taking value ${ z _i }$ with probability ${ \lambda _i }.$ (So especially ${ \mathbb{E}[Z] = z }$). Consider its i.i.d copies ${ Z, Z _1, Z _2, \ldots }$ By SLLN (Strong Law of Large Numbers), ${ \frac{1}{k} \sum _{i=1} ^{k} Z _i \stackrel{a.s}{\longrightarrow} \mathbb{E}[Z] = z }.$ Looking at the error, ${ \mathbb{E} \lVert z - \frac{1}{k} \sum _{i=1} ^{k} Z _i \rVert ^{2} }$ ${ = \mathbb{E} \lVert \frac{1}{k} \sum _{i=1} ^{k} (z - Z _i) \rVert ^{2} }$ ${ = \frac{1}{k ^2} \mathbb{E} \lVert \sum _{i=1} ^{k} (z - Z _i) \rVert ^2 }.$
For random vectors ${ A, B \in \mathbb{R} ^{d} }$ note ${ \mathbb{E}[\lVert A + B \rVert ^2] }$ ${ = \mathbb{E}[\lVert A \rVert ^2] + \mathbb{E}[\lVert B \rVert ^2] + 2 \mathbb{E}[A \cdot B] }.$ When say ${ A = (Z _1 - \mathbb{E}[Z _1]) }$ and ${ B = (Z _2 - \mathbb{E}[Z _2]) }$ where ${ Z _1, Z _2 }$ are i.i.d vectors, the dot product has expectation ${ \mathbb{E}[A \cdot B] = 0 }$ and hence ${ \mathbb{E}[\lVert A + B \rVert ^2] = \mathbb{E}[\lVert A \rVert ^2] + \mathbb{E}[\lVert B \rVert ^2] }$ holds.
So ${ \mathbb{E} \lVert z - \frac{1}{k} \sum _{i=1} ^{k} Z _i \rVert ^{2} = \frac{1}{k} \mathbb{E}\lVert z - Z \rVert ^2 }$ ${ = \frac{1}{2k} \mathbb{E} \lVert Z ^{‘} - Z \rVert ^2 }$ ${ \leq \frac{\text{diam}(T) ^2}{2k} }.$
So there exist realisations ${ z _1, \ldots, z _k }$ of ${ Z _1, \ldots, Z _k }$ for which ${ \lVert z - \frac{1}{k} \sum _{i=1} ^{k} z _i \rVert ^{2} \leq \frac{\text{diam}(T) ^2}{2k} , }$ as needed.
Lec-3:
The convering number of a set ${ T \subseteq \mathbb{R} ^d }$ at scale ${ \epsilon }$ is the smallest number of Euclidean balls of radius ${ \epsilon }$ needed to cover ${ T }.$ It is denoted ${ N(T, \epsilon ) }.$
Eg: If ${ B \subseteq \mathbb{R} ^d }$ is the unit Euclidean ball, covering number ${ N(B, \epsilon) \geq (\frac{1}{\epsilon}) ^{d} }.$
Pf: Euclidean balls of radius ${ \epsilon }$ have volume ${ \epsilon ^d \text{vol}(B) }.$ Considering the cover of ${ B }$ by ${ N(B, \epsilon ) }$ many balls, and comparing volumes, ${ \epsilon ^d \text{vol}(B) N(B, \epsilon) \geq \text{vol}(B) }.$
Thm: Let ${ P }$ be a polytope in ${ \mathbb{R} ^d }$ with ${ m }$ vertices. Then covering number ${ N(P, \epsilon) \leq m ^{\text{diam}(T) ^2 / 2 \epsilon ^2} .}$
Pf: Firstly ${ P \subseteq \text{conv}(T) }$ where ${ T = \lbrace \text{vertices of } P \rbrace }.$ Every ${ x \in P \, (\subseteq \text{conv}(T)) }$ is within ${ \frac{\text{diam}(T)}{\sqrt{2k}}}$ of some point in the set ${ \mathscr{N} = \lbrace \frac{1}{k} \sum _{i=1} ^{k} z _i : z _i \in T \rbrace. }$ (That is, the ${ \frac{\text{diam}(T)}{\sqrt{2k}}}$ net of ${ \mathscr{N}}$ contains ${ P }$). So ${ N(P, \frac{\text{diam}(T)}{\sqrt{2k}}) \leq \vert \mathscr{N} \vert }.$ Further ${ \vert \mathscr{N} \vert \leq m ^{k} }.$ Given ${ \epsilon > 0 },$ picking ${ k }$ such that ${ \frac{\text{diam}(T)}{\sqrt{2k}} \approx \epsilon }$ gives ${ \mathscr{N}(P, \epsilon) \leq m ^k = m ^{\text{diam}(T) ^2 / 2 \epsilon ^2}. }$
Lec-4:
Let ${ B \subseteq \mathbb{R} ^d }$ be unit ball, and ${ P \subseteq B }$ a polytope with ${ m }$ vertices.
Note ${ \text{vol}(P) \leq N(P, \epsilon) \text{vol}(\epsilon B ) }$ ${ \leq m ^{2/ \epsilon ^2} \epsilon ^d \text{vol}(B) .}$ So ${ \frac{\text{vol}(P)}{\text{vol}(B)} \leq \inf _{\epsilon > 0} m ^{2/ \epsilon ^2} \epsilon ^d . }$
Obs [Gaussian Tails]: If ${ g \sim N(0,1) },$ then tail probability ${ \mathbb{P}[g \geq t] \leq \frac{1}{t \sqrt{2\pi}} e ^{-t ^2 /2} }$ for ${ t \geq 0 }.$
Pf: Note ${ \mathbb{P}[g \geq t] }$ ${ = \int _{t} ^{\infty} \frac{1}{\sqrt{2 \pi}} e ^{- x ^2 /2} dx }$ ${ = \int _{0} ^{\infty} \frac{1}{\sqrt{2 \pi}} e ^{-(t+y) ^2 /2} dy }$ ${ \leq \frac{1}{\sqrt{2 \pi}} e ^{-t ^2 /2} \int _{0} ^{\infty} e ^{-ty} dy }$ ${ = \frac{1}{t\sqrt{2 \pi}} e ^{-t ^2 /2} . }$
(More simply, ${ \mathbb{P}(\vert g \vert \geq t) }$ ${ \leq \frac{1}{t} \sqrt{\frac{2}{\pi}} e ^{- t ^2 /2 } }$ ${ \leq e ^{- t ^2 /2} }$ for ${ t \geq 1 }$).
Lec-5:
[Markov] For random variable ${ X \geq 0 }$ and real ${ t > 0 },$ we have ${ \mathbb{E}[X] \geq t \mathbb{P}(X \geq t). }$
Pf: Writing ${ X = X \mathbb{1}(X \geq t) + X \mathbb{1}(X < t) }$ we have ${ \mathbb{E}[X] }$ ${ = \mathbb{E}[X \mathbb{1}(X \geq t)] + \mathbb{E}[X \mathbb{1}(X < t)] }$ ${ \geq \mathbb{E}[X \mathbb{1}(X \geq t)] }$ ${ \geq t \mathbb{P}(X \geq t). }$
[Chebyshev] Consider random variable ${ X }$ with mean ${ \mu }$ and variance ${ \sigma ^2 }.$ Now ${ \mathbb{E}[\vert {X-\mu} \vert ^2] \geq t ^2 \mathbb{P}(\vert X - \mu \vert ^2 \geq t ^2) }$ that is ${ \sigma ^2 \geq t ^2 \mathbb{P}(\vert X - \mu \vert \geq t) }.$
[Berry-Essen] Let ${ X _1, X _2, \ldots }$ be i.i.d random variables with mean ${ 0 }$ and variance ${ 1 }.$ By CLT, ${ \frac{1}{\sqrt{N}} \sum _{1} ^{N} X _i }$ is approximately ${ N (0,1) }$ distributed. Berry-Essen says the error in tails ${ \left\vert \mathbb{P}\left(\frac{1}{\sqrt{N}} \sum _{1} ^{N} X _i \geq t \right) - \mathbb{P}(g \geq t) \right\vert \leq \frac{\mathbb{E}\vert X _1 \vert ^3}{\sqrt{N}} }$ (where ${ g \sim N(0,1) }$).
Lec-6:
[Hoeffding] Let ${ X _1, X _2, \ldots }$ be ${ \text{Unif}\lbrace -1, 1 \rbrace }$ variables. Then tail ${ \mathbb{P}\left( \frac{1}{\sqrt{N}} \sum _{1} ^{N} X _i \geq t \right) \leq e ^{- t ^2 / 2 }}$ for ${ t \geq 0 . }$
Pf: Note ${ \mathbb{P}\left( \frac{1}{\sqrt{N}} \sum _{i=1} ^{N} X _i \geq t \right) }$ ${ = \mathbb{P}\left( \exp(\lambda \sum _{i=1} ^{N} X _i ) \geq \exp( \lambda \sqrt{N} t ) \right) }$ ${ \leq \frac{\mathbb{E}[\exp(\lambda \sum _{i=1} ^{N} X _i )]}{\exp( \lambda \sqrt{N} t )} }$ ${ = e ^{-\lambda \sqrt{N} t} \Pi _{i=1} ^{n} \mathbb{E}[e ^{\lambda X _i}] }$ ${ = e ^{-\lambda \sqrt{N} t} \left( \frac{e ^{\lambda} + e ^{-\lambda}}{2} \right) ^{N} }$ ${ \leq e ^{-\lambda \sqrt{N} t } (e ^{\lambda ^2 /2}) ^N ,}$ and RHS viewed as a function of ${ \lambda > 0 }$ attains minimum at ${ \lambda = \frac{t}{\sqrt{N}} .}$ So ${ \mathbb{P}\left( \frac{1}{\sqrt{N}} \sum _{1} ^{N} X _i \geq t \right) \leq e ^{- t ^2 / 2 }}.$
Prob: Estimate the mean ${ \mu }$ of a distribution from a sample ${ X _1, \ldots, X _N ,}$ which is drawn independently from this distribution.
Classical estimator: Sample mean ${ \hat{\mu} = \frac{1}{N} \sum _{i=1} ^{N} X _i }$ (unbiased because its expectation is ${ \mu }$, and with mean squared error ${ \mathbb{E}[\vert \mu - \hat{\mu} \vert ^2] = (\frac{\sigma}{\sqrt{N}}) ^2 }$).
Confidence interval: ${ \mathbb{P}(\vert \hat{\mu} - \mu \vert \geq t \frac{\sigma}{\sqrt{N}} ) }$ (inequality of this kind is considered since mean squared error is ${ (\frac{\sigma}{\sqrt{N}}) ^2 }$) is ${ \mathbb{P}(\vert \hat{\mu} - \mu \vert \geq t \frac{\sigma}{\sqrt{N}} ) }$ ${ = \mathbb{P}(\vert \hat{\mu} - \mu \vert ^2 \geq \frac{t ^2 \sigma ^2}{N} ) }$ ${ \leq \frac{1}{t ^2 \sigma ^2 /N} \mathbb{E}[\vert \mu - \hat{\mu} \vert ^2] }$ ${ = \frac{1}{t ^2} . }$
For normal sample ${ X _i \sim N(\mu ,\sigma ^2) }$ a much better confidence interval (since ${ \hat{\mu} \sim N(\mu, \frac{\sigma ^2}{N}) }$ that is ${ \frac{\hat{\mu} - \mu}{\sigma / \sqrt{N}} \sim N(0,1) }$) using Gaussian tail bound is ${ \mathbb{P}(\vert \hat{\mu} - \mu \vert \geq t \frac{\sigma}{\sqrt{N}} ) \leq e ^{- t ^2 /2}}$ for ${ t \geq 1 }.$
Lec-7:
[Hoeffding again] Let ${ X _1, \ldots, X _N }$ with ${ X _i \in [a _i, b _i] }$ be independent random variables. Then sums ${ S _N = \sum _{i=1} ^{N} X _i }$ satisfy ${ \mathbb{P}(S _N - \mathbb{E}[S _N] \geq t ) }$ ${ \leq \exp\left(\frac{-2 t ^2}{\sum _{i=1} ^{N} (b _i - a _i ) ^2} \right) }.$
Pf: [Taken from Measure Concentration part of Shalev-Shwartz and Ben-David’s ML book] Write ${ Z _i = X _i - \mathbb{E}[X _i] ,}$ now ${ \mathbb{P}(S _N - \mathbb{E}[S _N] \geq t ) }$ ${ = \mathbb{P}(\sum _{i =1} ^{N} Z _i \geq t ) }$ ${ = \mathbb{P}(e ^{\lambda \sum _{i=1} ^{N} Z _i } \geq e ^{\lambda t}) }$ ${ \leq \frac{1}{e ^{\lambda t}} \mathbb{E}[e ^{\lambda \sum _{i=1} ^{N} Z _i}] }$ ${ = e ^{-\lambda t} (\Pi _{i=1} ^{N} \mathbb{E}[e ^{\lambda Z _i}]) }.$ Notice ${ Z _i = X _i - \mathbb{E}[X _i] }$ are in ${ [-(b _i-a _i), (b _i -a _i)] }$ with mean ${ 0 }.$ By below lemma, ${ \mathbb{P}(S _N - \mathbb{E}[S _N] \geq t ) }$ ${ \leq e ^{-\lambda t} (\Pi _{i=1} ^{N} \mathbb{E}[e ^{\lambda Z _i}]) }$ ${ \leq e ^{-\lambda t} e ^{\frac{\lambda ^2 \sum _{i=1} ^{N} (2(b _i - a _i)) ^2}{8} }}.$ The RHS as a function of ${ \lambda > 0 }$ is minimized at ${ \lambda = \frac{t}{\sum _{i=1} ^{N} (b _i - a _i) ^2} }.$ This minimum value is ${ \exp\left(\frac{-2 t ^2}{\sum _{i=1} ^{N} (b _i - a _i ) ^2} \right)},$ as needed.
Lemma: Let ${ Z \in [a,b] }$ be a random variable with ${ \mathbb{E}[Z] = 0 }.$ Then ${ \mathbb{E}[e ^{\lambda Z}] \leq \exp(\frac{\lambda ^2 (b-a) ^2}{8}) }$ for ${ \lambda > 0 }.$
Pf: The function ${ e ^{\lambda x} }$ is below line ${ e ^{\lambda a} + (\frac{e ^{\lambda b} - e ^{\lambda a}}{b-a}) (x-a) }$ on ${ [a,b] }.$ Taking expectation, ${\mathbb{E}[e ^{\lambda X}] \leq \frac{b e ^{\lambda a } - a e ^{ \lambda b }}{b-a} }.$ RHS is ${ e ^{\lambda a} (\frac{b}{b-a} - \frac{a}{b-a} e ^{\lambda (b-a)} ) }$ ${= e ^{\lambda a} (1 + \frac{a}{b-a} {\color{red}{- \frac{a}{b-a}}} e ^{\color{blue}{\lambda (b-a)}} ) }$ ${ = e ^{L(h)} }$ where ${ L(h) = -hp + \log(1-p+{\color{red}{p}} e^{\color{blue}{h}}) }$ with ${ {\color{red}{p = - \frac{a}{b-a}}} }$ and ${ {\color{blue}{h = \lambda (b-a)}} }.$ It suffices to show ${ L(h) \leq \frac{h ^2}{8} }.$ This follows from verifying ${ L(0) = L’(0) = 0 }$ and ${ L ^{‘’} (h) \leq \frac{1}{4} }$ for all ${ h > 0 }.$
Obs: Consider independent variables ${ X _1 ^{(N)}, \ldots, X _{N} ^{(N)} \sim \text{Ber}(\frac{\mu}{N}) }.$ Then sum ${ \sum _{i=1} ^{N} X _i ^{(N)} \to \text{Poi}(\mu) }$ in distribution.
Pf: Probability ${ \mathbb{P}(\sum _{i=1} ^{N} X _i ^{(N)} = k) }$ ${ = \binom{N}{k} (\frac{\mu}{N}) ^k (1 - \frac{\mu}{N}) ^{N-k} }$ ${ = \frac{N ^k + (\text{poly in N of degree k-1})}{k!} (\frac{\mu}{N}) ^k (1 - \frac{\mu}{N}) ^{N-k} }$ ${ \to \frac{\mu ^k}{k!} e ^{-\mu} }$ as ${ N \to \infty }.$
[Chernoff, upper tail] Consider independent ${ X _i \sim \text{Ber}(p _i) }$ random variables. Then ${ S _N = \sum _{i=1} ^{N} X _i }$ has mean ${ \mu = \sum _{i=1} ^{N} p _i , }$ and upper tail ${ \mathbb{P}(S _N \geq t) \leq e ^{-\mu} (\frac{e\mu}{t}) ^{t} }$ for all ${ t \geq \mu }.$
Pf: Same theme, ${ \mathbb{P}(S _N \geq t) }$ ${ = \mathbb{P}(e ^{\lambda S _N} \geq e ^{\lambda t}) }$ ${ \leq \frac{1}{e ^{\lambda t}} \mathbb{E}[e ^{\lambda S _N}] }$ ${ = e ^{-\lambda t} \Pi _{i=1} ^{N} \mathbb{E}[e ^{\lambda X _i}] }$ ${ = e ^{-\lambda t} \Pi _{i=1} ^{N} ((1-p _i) + e ^{\lambda} p _i ) }.$ Each term ${(1-p) + e ^{\lambda} p }$ ${ = 1 + p (e ^{\lambda} - 1) }$ ${ \leq \exp(p(e ^{\lambda} - 1)). }$ So ${ \mathbb{P}(S _N \geq t) }$ ${ \leq e ^{-\lambda t} \exp((\sum _{i=1} ^{n} p _i)(e ^{\lambda} - 1)) }$ ${ = e ^{-\lambda t} \exp(\mu(e ^{\lambda} -1)) .}$ RHS as a function of ${ \lambda > 0 }$ is minimized at ${ \lambda = \log(\frac{t}{\mu}) (>0) ,}$ so the bound is ${ \leq e ^{-\mu} (\frac{e\mu}{t}) ^t }$ for ${ t \geq \mu }.$
[Chernoff, lower tail] Consider independent ${ X _i \sim \text{Ber}(p _i) }$ random variables. Then ${ S _N = \sum _{i=1} ^{N} X _i }$ has mean ${ \mu = \sum _{i=1} ^{N} p _i ,}$ and lower tail ${ \mathbb{P}(S _N \leq t) \leq e ^{-\mu} (\frac{e\mu}{t}) ^t }$ for ${ 0 < t < \mu }.$
Pf: Same theme, ${ \mathbb{P}(S _N \leq t) }$ ${ = \mathbb{P}(-S _N \geq -t) }$ ${ = \mathbb{P}(e ^{-\lambda S _N} \geq e ^{-\lambda t}) }$ ${ \leq e ^{\lambda t} \mathbb{E}[e ^{-\lambda S _N}] }$ ${ = e ^{\lambda t} \Pi _{i=1} ^{N} \mathbb{E}[e ^{-\lambda X _i}] }$ ${ = e ^{\lambda t} \Pi _{i=1} ^{N} (e ^{-\lambda} p _i + (1-p _i)) }.$ Each term ${ (1-p _i) + e ^{-\lambda} p _i }$ ${= 1 - p _i (1 - e ^{-\lambda}) }$ ${ \leq \exp(-p _i (1 - e ^{-\lambda})) }.$ So ${\mathbb{P}(S _N \leq t) }$ ${ \leq e ^{\lambda t} \exp(-\mu (1-e ^{-\lambda})) }.$ RHS as a function of ${ \lambda \gt 0 }$ is minimized at ${ \lambda = \log(\frac{\mu}{t}) (> 0) }$ for ${ t < \mu }.$ So the bound is ${ \mathbb{P}(S _N \leq t) }$ ${ \leq (\frac{\mu}{t}) ^t e ^{-(\mu - t)} }$ for ${ 0 < t < \mu }$ as needed.
Lec-8:
Another ref: Lec-1 handwritten notes by Prof. Arindam Khan at https://www.csa.iisc.ac.in/~arindamkhan/courses/toolkit21/TheoristsToolkit.html
Consider independent ${ X _i \sim \text{Ber}(p _i) }$ random variables. As usual we have sum ${ S _N = \sum _{i=1} ^{N} X _i }$ with mean ${ \mu = \sum _{i=1} ^{N} p _i }.$
For ${ \delta \in (0,1) },$ the tail ${ \mathbb{P}(S _N \geq \mu(1+\delta) ) }$ ${ \leq e ^{-\mu} (\frac{e}{(1+\delta)}) ^{\mu(1+\delta)} }$ ${ = e ^{\delta \mu} (\frac{1}{1+\delta}) ^{(1+\delta) \mu } }$ ${ = \text{exp}[\mu (\delta - (1+\delta) \text{log}(1+\delta) ) ] }$ ${ {\color{red}{\leq}} e ^{-\mu \frac{\delta ^2}{3} } }.$
For ${ \delta \in (0,1) },$ the tail ${ \mathbb{P}(S _N \leq \mu (1 - \delta) ) }$ ${ \leq e ^{-\mu} (\frac{e}{1-\delta}) ^{\mu (1 - \delta)} }$ ${ = e ^{- \mu \delta } (\frac{1}{1-\delta}) ^{\mu (1-\delta)} }$ ${ = \text{exp}[\mu(-\delta - (1-\delta) \text{log}(1-\delta) )] }$ ${ {\color{blue}{\leq}} e ^{-\mu \frac{\delta ^2}{2}} }.$
So, for ${ \delta \in (0,1) }$ we have ${ \mathbb{P}(\vert S _N - \mu \vert \geq \mu \delta) }$ ${ = \mathbb{P}(S _N \geq \mu + \mu \delta) + \mathbb{P}(S _N \leq \mu - \mu \delta) }$ ${ \leq 2 e ^{- \mu \frac{\delta ^2}{3} } }.$
${ {\color{red}{\text{Lemma}}} }$: For ${ \delta \in (0,1) },$ we have ${ \delta - (1+\delta ) \text{log}(1+\delta) }$ ${ \leq - \frac{\delta ^2}{3} .}$
Pic: https://lensdump.com/i/BZC1v2
Pf: From observation https://math.stackexchange.com/a/4311294/303300, informally the odd taylor polynomials are above and even taylor polynomials are below ${ \text{log}(1+x) }.$ So ${ (1 + \delta )\text{log}(1+\delta) }$ ${ \geq (1+\delta)(\delta - \frac{\delta ^2}{2} + \frac{\delta ^3}{3} - \frac{\delta ^4}{4}) }$ ${ = \delta + \frac{\delta ^2}{2} - \frac{\delta ^3}{6} + \frac{\delta ^4}{12} - \frac{\delta ^5}{4} .}$ If say ${ \delta \in (0, \frac{1}{3} ) }$ the RHS is ${ \geq \delta + \frac{\delta ^2}{2} - \frac{\delta ^3}{6} }$ ${ \geq \delta + \frac{\delta ^2}{2} - \frac{\delta ^2}{6} }$ ${ = \delta + \frac{\delta ^2}{3}, }$ giving the required bound. Required bound can be shown to work for ${ \delta \in (0,1) }$ in general.
AltPf: Informally, ${ (1 + \delta) \text{log}(1+\delta) }$ ${ = (1+\delta) (\delta - \frac{\delta ^2}{2} + \frac{\delta ^3}{3} - \ldots ) }$ ${ = \delta + \frac{\delta ^2}{2} - \frac{\delta ^3}{6} + (\text{higher degree terms}) },$ and ${ \delta + \frac{\delta ^2}{2} - \frac{\delta ^3}{6} }$ ${ \geq \delta + \frac{\delta ^2}{2} - \frac{\delta ^2}{6} }$ ${= \delta + \frac{\delta ^2}{3}. }$ So it suffices to show ${ (1+\delta) \text{log}(1+\delta) }$ ${ \geq \delta + \frac{\delta ^2}{2} - \frac{\delta ^3}{6}.}$
Derivatives of LHS (wrt ${ \delta }$) are ${ \text{log}(1+\delta) + 1, }$ ${ \frac{1}{1+\delta}, \ldots }$ Derivatives of RHS (wrt ${ \delta }$) are ${ 1 + \delta - \frac{\delta ^2}{2} ,}$ ${ 1 - \delta, \ldots }$ Now one shows in the usual way.
${ {\color{blue}{\text{Lemma}}} }$: For ${ \delta \in (0,1), }$ we have ${ - \delta - (1-\delta) \text{log}(1-\delta) \leq - \frac{\delta ^2}{2} }.$
Pf: Informally ${ (1-\delta) \text{log}(1-\delta) }$ ${ = (1-\delta) (-\delta - \frac{\delta ^2}{2} - \frac{\delta ^3}{3} - \ldots) }$ ${ = -\delta + \frac{\delta ^2}{2} + (\text{higher degree terms}) },$ and it suffices to show ${ (1-\delta) \text{log}(1-\delta) }$ ${ \geq -\delta + \frac{\delta ^2}{2} }.$ Derivatives of LHS are ${ -1 - \text{log}(1-\delta) , }$ ${ \frac{1}{1-\delta} , \ldots}$ and derivatives of RHS are ${ -1 + \delta, 1 }.$ Since ${ -1 - \text{log}(1-\delta) }$ ${ \geq -1 + \delta }$ the inequality follows.
Erdos-Renyi Model:
Consider vertices ${ 1, \ldots, N }.$ Connect each pair of vertices with an edge independently, with probability ${ p }.$ The random graph is denoted ${ G(N,p) }.$
For every ${ i \in \lbrace 1 , \ldots, N \rbrace, }$ random variable ${ \text{deg}(i) = ( \text{no. of vertices connected to i} ) }$ is sum of ${ (N-1) }$ independent ${ \text{Ber}(p) }$ variables hence ${ \text{deg}(i) \sim \text{Bin}(N-1, p) }.$ Especially, ${ \mathbb{E}[\text{deg}(i)] = (N-1)p =: d },$ called expected degree of ${ G(N,p) }.$
Recall a graph is called d-regular if degree of each vertex is ${ d }.$
For eg a cyclic graph is 2-regular, a tetrahedral graph is 3-regular.
Thm: Let ${ p \in (0,1) }.$ There is an absolute constant ${ C > 0 }$ such that degree ${ d = (N-1) p \geq C \text{log}(N) }$ implies ${ \mathbb{P}[\forall i : 0.9d \leq \text{deg}(i) \leq 1.1d] \geq 0.9. }$
Rem: The latter part of implication roughly means “${ G(N,p) }$ is almost ${d-}$regular with high probability”.
Pf: Let ${ E _i }$ be the event that ${ \lbrace 0.9d \leq \text{deg}(i) \leq 1.1d \rbrace }.$ By Chernoff, ${ \mathbb{P}(E _i ^c) }$ ${ = \mathbb{P}(\vert \text{deg}(i) - d \vert \geq 0.1d ) }$ ${ \leq 2 e ^{-d \frac{(0.1) ^2}{3} } }$ ${ \leq 2 e ^{-C\text{log}(N)/300 } }$ ${ = 2 N ^{-C/300} }$ which is ${ \leq \frac{1}{10N} }$ if we pick large enough ${ C }$ (for eg, ${ C = 900 }$). So ${ \mathbb{P}(\cap E _i) }$ ${ = 1 - \mathbb{P}(\cup E _i ^c) }$ ${ \geq 1 - \sum \mathbb{P}(E _i ^c) }$ ${ \geq 1 - N (\frac{1}{10N}) }$ ${ = 0.9 }$ as needed.