Summary Statistics: Difference between revisions
(Created page with "When we investigate a variable in a dataset, two things are great at ''summarizing'' the dataset: the ''center'' and the ''spread''. = Measuring Center = There are two ways to measure center: the mean and the median. == Mean == The '''mean''' is the average/expected value of a variable. The sample mean is denoted as <math>\bar{X}</math>, whereas the population mean is denoted as <math>\mu_X</math>. <math> \bar{X} = \frac{1}{n} \sum x_i </ma...") |
No edit summary |
||
| Line 7: | Line 7: | ||
== Mean == | == Mean == | ||
The '''mean''' is the average/expected value of a variable. The sample mean is denoted as <math>\bar{ | The '''mean''' is the average/expected value of a variable. The sample mean is denoted as <math display="inline">\bar{x}</math>, whereas the population mean is denoted as <math>\mu_X</math>. | ||
<math> | <math> | ||
\bar{ | \bar{x} = \frac{1}{n} \sum x_i | ||
</math> | </math> | ||
| Line 36: | Line 36: | ||
= Measuring Spread = | = Measuring Spread = | ||
There are three main ways to measure the spread of the dataset: range, interquartile range, and variance. | |||
== Range == | |||
The '''range''' of a variable is the interval between the first statistic and the last statistic (after being ordered). | |||
== Interquartile Range (IQR) == | |||
The '''interquartile range''' (IQR) is the middle 50% of the dataset. | |||
== Variance == | |||
The '''variance''' measures how much the dataset deviate from the ''mean''. To be exact, it measures the average ''squared difference'' from the mean. | |||
The reason we take the ''squared'' difference is because we want both statistics smaller than the mean and those greater than the mean to contribute to the variance. | |||
The same variance is denoted as <math>s^2</math> and the population variance is denoted as <math>\sigma^2</math>. We have | |||
<math>s^2 = \frac{1}{n - 1} \sum (x_i - \bar{x})^2 = \frac{1}{n - 1} \left( \sum x_i^2 - n \bar{x}^2 \right)</math> | |||
[[Category: Statistics]] | [[Category: Statistics]] | ||
Revision as of 00:01, 19 March 2024
When we investigate a variable in a dataset, two things are great at summarizing the dataset: the center and the spread.
Measuring Center
There are two ways to measure center: the mean and the median.
Mean
The mean is the average/expected value of a variable. The sample mean is denoted as , whereas the population mean is denoted as Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mu_X} .
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \bar{x} = \frac{1}{n} \sum x_i }
where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n} is the sample size.
Median/Percentiles/Quartiles
The median tell us the literal center of the dataset: 50% of statistics are on the left, 50% on the right. It is denoted with Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \widetilde{X}}
The quartiles is the same, except at 25% for the first quartile, 50% for the second (also the median), and 75% for the third. They are denoted with Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Q_1, Q_2, Q_3}
The percentiles is also the same, except at a particular percentage. For example, the 80th percentile has 80% of data before it.
To calculate the P-th percentile (and thereby calculating all the other something-tiles), we have
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \left( \frac{P}{100} \right) (n + 1) }
where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n} is the sample size.
Mode
The mode is the most frequently occurring value. It's pretty neglected lol.
Measuring Spread
There are three main ways to measure the spread of the dataset: range, interquartile range, and variance.
Range
The range of a variable is the interval between the first statistic and the last statistic (after being ordered).
Interquartile Range (IQR)
The interquartile range (IQR) is the middle 50% of the dataset.
Variance
The variance measures how much the dataset deviate from the mean. To be exact, it measures the average squared difference from the mean.
The reason we take the squared difference is because we want both statistics smaller than the mean and those greater than the mean to contribute to the variance.
The same variance is denoted as Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle s^2} and the population variance is denoted as Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sigma^2} . We have
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle s^2 = \frac{1}{n - 1} \sum (x_i - \bar{x})^2 = \frac{1}{n - 1} \left( \sum x_i^2 - n \bar{x}^2 \right)}
