Latest revision as of 01:12, 19 March 2024

When we investigate a dataset, two things are used to summarizing the dataset: the center and the spread.

Measuring Center

There are two ways to measure center: the mean and the median.

Mean

The mean is the average/expected value of a variable. The sample mean is denoted as ${\textstyle {\bar {x}}}$ , whereas the population mean is denoted as $\mu _{X}$ .

${\bar {x}}={\frac {1}{n}}\sum x_{i}$

where $n$ is the sample size.

Median/Percentiles/Quartiles

The median tell us the literal center of the dataset: 50% of statistics are on the left, 50% on the right. It is denoted with ${\widetilde {X}}$

The quartiles is the same, except at 25% for the first quartile, 50% for the second (also the median), and 75% for the third. They are denoted with $Q_{1},Q_{2},Q_{3}$

The percentiles is also the same, except at a particular percentage. For example, the 80th percentile has 80% of data before it.

To calculate the P-th percentile (and thereby calculating all the other something-tiles), we have

$\left({\frac {P}{100}}\right)(n+1)$

where $n$ is the sample size.

Mode

The mode is the most frequently occurring value. It's pretty neglected lol.

Measuring Spread

There are three main ways to measure the spread of the dataset: range, interquartile range, and variance.

Range

The range of a variable is the interval between the first statistic and the last statistic (after being ordered).

Interquartile Range (IQR)

The interquartile range (IQR) is the middle 50% of the dataset. It is just from Q₁ to Q₃.

Variance

The variance measures how much the dataset deviate from the mean. To be exact, it measures the average squared difference from the mean.

The reason we take the squared difference is because we want both statistics smaller than the mean and those greater than the mean to contribute to the variance.

The same variance is denoted as $s^{2}$ and the population variance is denoted as $\sigma ^{2}$ . We have

$s^{2}={\frac {1}{n-1}}\sum (x_{i}-{\bar {x}})^{2}={\frac {1}{n-1}}\left(\sum x_{i}^{2}-n{\bar {x}}^{2}\right)$

Without going into too much detail, the reason we use $n-1$ is because when only $n$ is used, we always underestimates variance when we use the sample mean instead of the true mean

Standard Deviation

Variance is difficult to interpret due to it being squared. Naturally then, we can take the square root of variance and get standard deviation: the typical deviation of a data point to its mean.

Outliers

Outliers are observations that appears extreme relative to the dataset. We use the following metric to determine outliers:

Upper cutoff: $>Q_{3}+1.5(IQR)$

Lower cutoff: $<Q_{1}-1.5(IQR)$

Robustness

We call a statistic robust if it is not strongly affected by outliers.

Robust statistics include median and IQR

Non-robust statistics include mean, range, and standard deviation.

Others

Besides the listed above, there are other statistics used to measure center and spread:

Spread:

Bivariate data: covariance, correlation

@@ Line 1: / Line 1: @@
-When we investigate a [[Variable (Statistics)|variable]] in a dataset, two things are great at ''summarizing'' the dataset: the ''center'' and the ''spread''.
+When we investigate a dataset, two things are used to ''summarizing'' the dataset: the ''center'' and the ''spread''.
 = Measuring Center =
@@ Line 42: / Line 42: @@
 == Interquartile Range (IQR) ==
-The '''interquartile range''' (IQR) is the middle 50% of the dataset.
+The '''interquartile range''' (IQR) is the middle 50% of the dataset. It is just from Q<sub>1</sub> to Q<sub>3</sub>.
 == Variance ==
@@ Line 52: / Line 52: @@
 <math>s^2 = \frac{1}{n - 1} \sum (x_i - \bar{x})^2 =  \frac{1}{n - 1} \left( \sum x_i^2 - n \bar{x}^2 \right)</math>
+Without going into too much detail, the reason we use <math>n - 1</math> is because when only <math>n</math> is used, we ''always'' underestimates variance when we use the sample mean instead of the true mean
+== Standard Deviation ==
+Variance is difficult to interpret due to it being squared. Naturally then, we can take the square root of variance and get '''standard deviation''': the typical deviation of a data point to its mean.
+== Outliers ==
+'''Outliers''' are observations that appears extreme relative to the dataset. We use the following metric to determine outliers:
+Upper cutoff: <math>> Q_3 + 1.5(IQR)</math>
+Lower cutoff: <math>< Q_1 - 1.5(IQR)</math>
+=== Robustness ===
+We call a statistic '''robust''' if it is not strongly affected by outliers.
+Robust statistics include median and IQR
+Non-robust statistics include mean, range, and standard deviation.
+= Others =
+Besides the listed above, there are other statistics used to measure center and spread:
+'''Spread:'''
+* [[Bivariate|Bivariate data]]: covariance, correlation
 [[Category: Statistics]]

Anonymous

Search

Summary Statistics: Difference between revisions

Namespaces

More

Page actions

Latest revision as of 01:12, 19 March 2024

Contents

Measuring Center

Mean

Median/Percentiles/Quartiles

Mode

Measuring Spread

Range

Interquartile Range (IQR)

Variance

Standard Deviation

Outliers

Robustness

Others

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Summary Statistics: Difference between revisions

Latest revision as of 01:12, 19 March 2024

Measuring Center

Mean

Median/Percentiles/Quartiles

Mode

Measuring Spread

Range

Interquartile Range (IQR)

Variance

Standard Deviation

Outliers

Robustness

Others

Navigation

Wiki tools

Page tools

Categories