Revision as of 16:41, 19 March 2024

Let there be $Y_{1},Y_{2},\ldots ,Y_{n}$ , where each $Y_{i}$ is a randomv variable from the population.

Every Y have the same mean and distribution that we don't know.

$E(Y_{i})=\mu ,Var(Y_{i})=\sigma ^{2}$

We then have the sample mean

${\bar {Y}}={\frac {1}{n}}\sum _{i=1}^{n}Y_{i}$

The sample mean is expected to be $\mu$ through a pretty easy direct proof

The variance of the sample mean is ${\frac {\sigma ^{2}}{n}}$ , also through a pretty easy direct proof.

Central limit theorem

The central limit theorem states that the distribution of the sample mean follows normal distribution.

${\bar {Y}}\sim N(\mu ,{\frac {\sigma ^{2}}{n}})$

As long as the following two conditions are satisfied, CLT applies, regardless of the population's distribution.

The population distribution of $Y$ is normal, or
The sample size for each $Y_{i}$ is large $n>30$

By extension, we also have the distribution of the sum.

$S\sim N(\mu _{S}=n\mu ,\sigma _{S}={\sqrt {n\sigma }})$

where $S=\sum Y_{i}$

Confidence Interval

Estimation is the guess for the unknown parameter. A point estimate is a "best guess" of the population parameter, where as the confidence interval is the range of reasonable values that are intended to contain the parameter of interest with a certain degree of confidence, calculated with

(point estimate - margin of error, point estimate + margin of error)

Constructing CIs

By CLT, ${\bar {Y}}\sim N(\mu ,{\frac {\sigma ^{2}}{n}})$ . The confidence interval is the range of plausible ${\bar {Y}}$ .

If we define the middle 90% to be plausible, to find the confidence interval, simply find the 5th and 95th percentile.

Generalized, if we want a confidence interval of the middle $(1-\alpha )100\%$ , have a confidence interval of

${\bar {y}}\pm Z_{\alpha /2}{\frac {\sigma }{\sqrt {n}}}$

where ${\bar {y}}$ is the sample mean and $Z_{x}$ is the z score of the x-th percentile.

Binomial Normal Approximation

The sampling distribution of the proportion of success of n bernoulli random variables ( ${\hat {p}}$ ) can also be approximated to a normal distribution under the CLT.

Consider bernoulli random variables

$Y_{1}\ldots Y_{n}\sim E(Y_{i})=p,Var(Y_{i})=p(1-p)$

The proportion of success ${\hat {p}}$ is the sum over the count, so the expected probability is

$E({\hat {p}})=E\left({\frac {1}{n}}\sum Y_{i}\right)=p$

and the variance of ${\hat {p}}$ is

$Var\left({\frac {1}{n}}\sum Y_{i}\right)={\frac {1}{n^{2}}}Var\left(\sum Y_{i}\right)={\frac {p(1-p)}{n}}$

With a large sample size n, we can appoximate this to a normal distribution. Notably, the criteria for a large n is different from that of the continuous random variable.

$np>5,n(1-p)>5$

then we have

${\hat {p}}\sim N\left(p,{\frac {p(1-p)}{n}}\right)$

The reasoning behind the weird criteria relates to the binomial distribution. It's not very elaborated on in the lecture, but $np$ is the mean of the binomial (i.e. the expected number of successes). The criteria essentially makes sure that no negative values are plausible in the approximation; with a small mean and a large variance, the left side of a normal approximation goes into the negative, but bernoulli/binomial should always be positive by common sense.

T-Distribution

CLT has several restrictions, the biggest one being a large sample size. T-

Since we don't know the population variance $\sigma ^{2}$ , we have to use the sample variance $s$ to estimate it. This introduces more uncertainty, accounted for by the t-distribution.

T-distribution is the distribution of sample mean based on population mean, sample variance and degrees of freedom (covered later). It looks very similar to normal distribution.

When the sample size $n$ is small, there is greater uncertainty in the estimates. T-di

$t_{\alpha /2}>Z_{\alpha /2}$

The spread of t-distribution depends on the degrees of freedom, which is based on sample size. When looking up the table, round down df.

$\upsilon =n-1$

As the sample size increases, degrees of freedom increase, the spread of t-distribution decreases, and t-distribution approaches normal distribution.

Based on CLT and normal distribution, we had the confidence interval

${\bar {Y}}\pm Z_{\alpha /2}{\frac {\sigma }{\sqrt {n}}}$

Now, based on T-distribution, we have the CI

${\bar {Y}}\pm t_{\alpha /2}{\frac {s}{\sqrt {n}}}$

Find Sample Size

To calculate sample size needed depending on desired error margin and sample variance by assuming that $\upsilon =\infty$

$n={\frac {Z_{\alpha /2}^{2}s^{2}}{E^{2}}}$

We want to always round up to stay within the error margin.

I don't really know why.

Sampling Distribution of Difference

By linear combination of RVs, sampling distribution of ${\bar {Y_{1}}}-{\bar {Y_{2}}}$ is

$({\bar {Y_{1}}}-{\bar {Y_{2}}})\sim N(\mu _{1}-\mu _{2},{\frac {\sigma _{1}^{2}}{n_{1}}}+{\frac {\sigma _{2}^{2}}{n_{2}}})$

However, we do not know the population variance $\sigma ^{2}$ . If the CLT assumptions hold, then we have

$\upsilon ={\frac {\left({\frac {s_{1}^{2}}{n_{1}}}+{\frac {s_{2}^{2}}{n_{2}}}\right)^{2}}{{\frac {(s_{1}^{2}/n_{1})^{2}}{n_{1}-1}}+{\frac {(s_{2}^{2}/n_{2})^{2}}{n_{2}-1}}}}$

Trust me bro. Remember to round down to use t-table. With this degree of freedom, we can use sample variance to estimate the distribution.

@@ Line 57: / Line 57: @@
 = Binomial Normal Approximation =
-The sampling distribution of the mean of ''n'' bernoulli random variables (<math>\hat{p}</math>) can also be approximated to a normal distribution under the CLT.
+The sampling distribution of the proportion of success of ''n'' bernoulli random variables (<math>\hat{p}</math>) can also be approximated to a normal distribution under the CLT.
 Consider [[Discrete Random Variable#Bernoulli|bernoulli]] random variables
@@ Line 63: / Line 63: @@
 <math>Y_1 \ldots Y_n \sim E(Y_i) = p, Var(Y_i) = p(1 - p)</math>
-The probability of success <math>\hat{p}</math> is the sum over the count, so the expected probability is
+The proportion of success <math>\hat{p}</math> is the sum over the count, so the expected probability is
 <math>E(\hat{p}) = E \left(\frac{1}{n} \sum Y_i \right) = p</math>
@@ Line 72: / Line 72: @@
 With a large sample size ''n'', we can appoximate this to a normal distribution. Notably, the criteria for a large ''n'' is different from that of the continuous random variable.
+<math>np > 5, n(1 - p) > 5</math>
+then we have
+<math>\hat{p} \sim N \left( p, \frac{p(1-p)}{n} \right)</math>
+The reasoning behind the weird criteria relates to the [[Discrete Random Variable|binomial distribution]]. It's not very elaborated on in the lecture, but <math>np</math> is the mean of the binomial (i.e. the expected number of successes). The criteria essentially makes sure that no negative values are plausible in the approximation; with a small mean and a large variance, the left side of a normal approximation goes into the negative, but bernoulli/binomial should always be positive by common sense.
 = T-Distribution =

Anonymous

Search

Sampling Distribution: Difference between revisions

Namespaces

More

Page actions

Revision as of 16:41, 19 March 2024

Contents

Central limit theorem

Confidence Interval

Constructing CIs

Binomial Normal Approximation

T-Distribution

Find Sample Size

Sampling Distribution of Difference

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Sampling Distribution: Difference between revisions

Revision as of 16:41, 19 March 2024

Central limit theorem

Confidence Interval

Constructing CIs

Binomial Normal Approximation

T-Distribution

Find Sample Size

Sampling Distribution of Difference

Navigation

Wiki tools

Page tools

Categories