Latest revision as of 17:07, 19 March 2024

Let there be $Y_{1},Y_{2},\ldots ,Y_{n}$ , where each $Y_{i}$ is a randomv variable from the population.

Every Y have the same mean and distribution that we don't know.

$E(Y_{i})=\mu ,Var(Y_{i})=\sigma ^{2}$

We then have the sample mean

${\bar {Y}}={\frac {1}{n}}\sum _{i=1}^{n}Y_{i}$

The sample mean is expected to be $\mu$ through a pretty easy direct proof

The variance of the sample mean is ${\frac {\sigma ^{2}}{n}}$ , also through a pretty easy direct proof.

Central limit theorem

The central limit theorem states that the distribution of the sample mean follows normal distribution.

${\bar {Y}}\sim N(\mu ,{\frac {\sigma ^{2}}{n}})$

As long as the following two conditions are satisfied, CLT applies, regardless of the population's distribution.

The population distribution of $Y$ is normal, or
The sample size for each $Y_{i}$ is large $n>30$

By extension, we also have the distribution of the sum.

$S\sim N(\mu _{S}=n\mu ,\sigma _{S}={\sqrt {n\sigma }})$

where $S=\sum Y_{i}$

Proportion Approximation

The sampling distribution of the proportion of success of n bernoulli random variables ( ${\hat {p}}$ ) can also be approximated to a normal distribution under the CLT.

Consider bernoulli random variables

$Y_{1}\ldots Y_{n}\sim E(Y_{i})=p,Var(Y_{i})=p(1-p)$

The proportion of success ${\hat {p}}$ is the sum over the count, so the expected probability is

$E({\hat {p}})=E\left({\frac {1}{n}}\sum Y_{i}\right)=p$

and the variance of ${\hat {p}}$ is

$Var\left({\frac {1}{n}}\sum Y_{i}\right)={\frac {1}{n^{2}}}Var\left(\sum Y_{i}\right)={\frac {p(1-p)}{n}}$

With a large sample size n, we can appoximate this to a normal distribution. Notably, the criteria for a large n is different from that of the continuous random variable.

$np>5,n(1-p)>5$

then we have

${\hat {p}}\sim N\left(p,{\frac {p(1-p)}{n}}\right)$

The reasoning behind the weird criteria relates to the binomial distribution. It's not very elaborated on in the lecture, but $np$ is the mean of the binomial (i.e. the expected number of successes). The criteria essentially makes sure that no negative values are plausible in the approximation; with a small mean and a large variance, the left side of a normal approximation goes into the negative, but bernoulli/binomial must always be positive.

Binomial Approximation

I'm short on time. This is based on the above section and a bit of math.

$Y\sim N(np,np(1-p))$

Confidence Interval

Estimation is the guess for the unknown parameter. A point estimate is a "best guess" of the population parameter, where as the confidence interval is the range of reasonable values that are intended to contain the parameter of interest with a certain degree of confidence, calculated with

(point estimate - margin of error, point estimate + margin of error)

Standard Error

The standard error measures how much error we expect to make when estimating $\mu _{Y}$ by ${\bar {y}}$ .

Constructing CIs

By CLT, ${\bar {Y}}\sim N(\mu ,{\frac {\sigma ^{2}}{n}})$ . The confidence interval is the range of plausible ${\bar {Y}}$ .

If we define the middle 90% to be plausible, to find the confidence interval, simply find the 5th and 95th percentile.

Generalized, if we want a confidence interval of the middle $(1-\alpha )100\%$ , have a confidence interval of

${\bar {y}}\pm Z_{\alpha /2}{\frac {\sigma }{\sqrt {n}}}$

where ${\bar {y}}$ is the sample mean and $Z_{x}$ is the z score of the x-th percentile.

T-Distribution

CLT is based on the population variance. Since we don't know the population variance $\sigma ^{2}$ , we have to use the sample variance $s$ to estimate it. This introduces more uncertainty, accounted for by the t-distribution.

T-distribution is the distribution of sample mean based on population mean, sample variance and degrees of freedom (covered later). It looks very similar to normal distribution.

When the sample size $n$ is small, there is greater uncertainty in the estimates. T-di

$t_{\alpha /2}>Z_{\alpha /2}$

The spread of t-distribution depends on the degrees of freedom, which is based on sample size. When looking up the table, round down df.

$\upsilon =n-1$

As the sample size increases, degrees of freedom increase, the spread of t-distribution decreases, and t-distribution approaches normal distribution.

Based on CLT and normal distribution, we had the confidence interval

${\bar {Y}}\pm Z_{\alpha /2}{\frac {\sigma }{\sqrt {n}}}$

Now, based on T-distribution, we have the CI

${\bar {Y}}\pm t_{\alpha /2}{\frac {s}{\sqrt {n}}}$

Find Sample Size

To calculate sample size needed depending on desired error margin and sample variance by assuming that $\upsilon =\infty$

$n={\frac {Z_{\alpha /2}^{2}s^{2}}{E^{2}}}$

We want to always round up to stay within the error margin.

I don't really know why.

Sampling Distribution of Difference

By linear combination of RVs, sampling distribution of ${\bar {Y_{1}}}-{\bar {Y_{2}}}$ is

$({\bar {Y_{1}}}-{\bar {Y_{2}}})\sim N(\mu _{1}-\mu _{2},{\frac {\sigma _{1}^{2}}{n_{1}}}+{\frac {\sigma _{2}^{2}}{n_{2}}})$

However, we do not know the population variance $\sigma ^{2}$ . If the CLT assumptions hold, then we have

$\upsilon ={\frac {\left({\frac {s_{1}^{2}}{n_{1}}}+{\frac {s_{2}^{2}}{n_{2}}}\right)^{2}}{{\frac {(s_{1}^{2}/n_{1})^{2}}{n_{1}-1}}+{\frac {(s_{2}^{2}/n_{2})^{2}}{n_{2}-1}}}}$

Trust me bro. Remember to round down to use t-table. With this degree of freedom, we can use sample variance to estimate the distribution.

@@ Line 29: / Line 29: @@
 # The population distribution of <math>Y</math> is normal, ''or''
 # The sample size for each <math>Y_i</math> is large <math>n>30</math>
-By extension, we also have
+By extension, we also have the distribution of the sum.
 <math>S \sim N(\mu_S = n\mu, \sigma_S = \sqrt{n \sigma})</math>
 where <math>S = \sum Y_i </math>
+== Proportion Approximation ==
+The sampling distribution of the proportion of success of ''n'' bernoulli random variables (<math>\hat{p}</math>) can also be approximated to a normal distribution under the CLT.
+Consider [[Discrete Random Variable#Bernoulli|bernoulli]] random variables
+<math>Y_1 \ldots Y_n \sim E(Y_i) = p, Var(Y_i) = p(1 - p)</math>
+The proportion of success <math>\hat{p}</math> is the sum over the count, so the expected probability is
+<math>E(\hat{p}) = E \left(\frac{1}{n} \sum Y_i \right) = p</math>
+and the variance of <math>\hat{p}</math> is
+<math>Var \left(\frac{1}{n} \sum Y_i \right) = \frac{1}{n^2} Var \left(\sum Y_i \right)  = \frac{p (1 - p)}{n}</math>
+With a large sample size ''n'', we can appoximate this to a normal distribution. Notably, the criteria for a large ''n'' is different from that of the continuous random variable.
+<math>np > 5, n(1 - p) > 5</math>
+then we have
+<math>\hat{p} \sim N \left( p, \frac{p(1-p)}{n} \right)</math>
+The reasoning behind the weird criteria relates to the [[Discrete Random Variable|binomial distribution]]. It's not very elaborated on in the lecture, but <math>np</math> is the mean of the binomial (i.e. the expected number of successes). The criteria essentially makes sure that no negative values are plausible in the approximation; with a small mean and a large variance, the left side of a normal approximation goes into the negative, but bernoulli/binomial must always be positive.
+== Binomial Approximation ==
+I'm short on time. This is based on the above section and a bit of math.
+<math>Y \sim N(np, np(1 - p))</math>
 = Confidence Interval =
@@ Line 40: / Line 71: @@
 ''(point estimate - margin of error, point estimate + margin of error)''
-==== Constructing CIs ====
+== Standard Error ==
+The '''standard error''' measures how much error we expect to make when estimating <math>\mu_Y</math> by <math>\bar{y}</math>.
+== Constructing CIs ==
 By CLT, <math>\bar{Y} \sim N(\mu, \frac{\sigma^2}{n} )</math>. The
 confidence interval is the range of plausible <math>\bar{Y}</math>.
@@ Line 58: / Line 92: @@
 [[File:T distribution table.png|thumb|T distribution table]]
-CLT has several restrictions, the biggest one being a large sample size.
+CLT is based on the population variance. Since we don't know the population variance <math>\sigma^2</math>, we
-'''T-'''
-Since we don't know the population variance <math>\sigma^2</math>, we
 have to use the sample variance <math>s</math> to estimate it. This
 introduces more uncertainty, accounted for by the '''t-distribution.'''
@@ Line 125: / Line 156: @@
 </math>
-Trust me bro. Remember to round down to use t-table.
+Trust me bro. Remember to round down to use t-table. With this degree of freedom, we can use sample variance to estimate the distribution.
 [[Category:Sample Statistics]]

Anonymous

Search

Sampling Distribution: Difference between revisions

Namespaces

More

Page actions

Latest revision as of 17:07, 19 March 2024

Contents

Central limit theorem

Proportion Approximation

Binomial Approximation

Confidence Interval

Standard Error

Constructing CIs

T-Distribution

Find Sample Size

Sampling Distribution of Difference

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Sampling Distribution: Difference between revisions

Latest revision as of 17:07, 19 March 2024

Central limit theorem

Proportion Approximation

Binomial Approximation

Confidence Interval

Standard Error

Constructing CIs

T-Distribution

Find Sample Size

Sampling Distribution of Difference

Navigation

Wiki tools

Page tools

Categories