Bivariate: Difference between revisions

From Rice Wiki
 
(14 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Consider two numerica random variables <math>X</math> and
[[Category:Distribution (Statistics)]][[Category:Statistics]]
<math>Y</math>. We can measure their ''covariance''.
'''Bivariate''' data consider two variables instead of the usual one; each value of one of the variables is paired with a value of the other variable. We will be using <math>X, Y</math> to denote the two random variables throughout this page.


<math>Cov(X, Y)</math>
= Summary Statistics =
To summarize bivariate data, we use covariance and correlation in addition to the statistics detailed in [[Summary Statistics]].


The '''correlation''' of two random variables measures the '''line
== Covariance ==
dependent''' between <math>X</math> and <math>Y</math>
The '''covariance''' measures the total variation of two RVs and their centers. It indicates the relationship of two variables whenever one changes, measuring how much the two vary together.
 
We have ''sample covariance''
 
<math>s^2_{X, Y} = \hat{cov}(X, Y) = \frac{1}{n - 1} \sum(x_i - \bar{x}) (y_i - \bar{y}) = \frac{1}{n - 1} \left( \sum x_i y_i - n \bar{x} \bar{y} \right)</math>
 
A good way of thinking about covariance is by cases:
 
If x ''increases'' as y ''increases'', the signs of both terms of the covariance calculation is the same. Therefore, covariance is ''positive''.
 
If x ''decreases'' as y ''increases'', the signs are different. Therefore, covariance is ''negative''.
 
If x does not clearly vary with y, the signs are sometimes different, sometimes the same. Overall, it should cancel out to ''zero.''
 
== Correlation ==
The '''Pearson correlation''' of two random variables measures the '''line'''
dependent''' between <math>X</math> and <math>Y</math>'''


<math>
<math>
Cor(X, Y) = \rho = \frac{Cov(X,Y)}{sd(X) sd(Y)}
Cor(X, Y) = \rho = \frac{Cov(X,Y)}{sd(X) sd(Y)}
</math>
</math>
Correlation is ''normalized''. It has no units, and is always between -1 and 1. When r = 1, the relationship between X and Y is '''perfect positive linear'''. When r = -1, it is '''perfect negative linear'''. If it is 0, there is no linear relationship. This doesn't mean that there is no relationship. Notably, any symmetric scatter plot has a correlation of 0.
When the relationship between teh two is non-linear, we use the '''Spearman's correlation'''.
<math>\rho = 1 - \frac{6 \sum d^2_i}{n (n^2 - 1)}</math>
where d<sub>i</sub> is the difference between the two ranks of each observation. For the two variables of each observation, we ''rank'' them by sorting from smallest to largest.


= Bivariate Normal =
= Bivariate Normal =
[[File:Bivariate Normal Example Scatterplot.png|thumb|Scatterplots of bivariate normal distribution]]
 
The '''bivariate normal''' is one special type of continuous random
The '''bivariate normal''' (aka. bivariate gaussian) is one special type
variable.
of continuous random variable.
 
<math>(X, Y)</math> is ''bivariate normal'' if
 
# The marginal PDF of both X and Y are normal
# For any <math>x</math>, the condition PDF of <math>Y</math> given <math>X = x</math> is Normal
** Works the other way around: Bivariate gaussian means that condition is satisfied
 
== Predicting Y given X ==
 
Given bivariate normal, we can predict one variable given another.
Let us try estimating the expected Y given X is x
 
<math>
E(Y| X = x)
</math>
 
There are three main methods
* Scatter plot approximation
* Joint PDF
* 5 statistics
 
=== 5 Parameters ===
 
We need to know 5 parameters about <math>X</math> and <math>Y</math>
 
<math>E(X), sd(X), E(Y), sd(Y), \rho</math>
 
If <math>X, Y</math> follows bivariate normal distribution, then we
have
 
<math>
\left( \frac{E(Y|X = x) - E(Y)}{sd(Y)} \right) = \rho \left( \frac{x -
E(X)}{sd(X)} \right)
</math>
 
The left side is the ''predicted Z-score for Y'', and the right side is
''the product of correlation and Z-score of X = x''
 
The variance is given by
 
<math>
Var(Y | X = x) = (1 - \rho^2) Var(Y)
</math>
 
Due to the range of <math>\rho</math>, the variance of Y given X is
always smaller than the actual variance. The standard deviation is just
rooted that.
 
== Regression Effect ==
The '''regression effect''' is the phenomenon that the best prediction
of <math>Y</math> given <math>X = x</math> is less rare for
<math>Y</math> than <math>x</math>; Future predictions regress to
mediocrity.
 
If the sample is random (or at least somewhat random), it is unlikely that a subject that has an unlikely score in ''x'' got another unlikely score in ''y''. Therefore, the expectation of ''Y'' given ''X'' is close to the mean. It's barely elaborated on in class, go look at [[wikipedia:Regression_toward_the_mean|Wikipedia]]
 
When you plot all the predicted <math>E(Y|X = x)</math>, you get the
'''linear regression line'''. The regression effect can be demonstrated
by also plotting the SD line (where the correlation is not applied).
 
= Linear Regression =
 
== Assumption ==
 
# X and Y have a linear relationship
# A random sample of pairs was taken
# All pairs of data are independent
# The variance of the error is constant. <math>Var(\epsilon) = \sigma_\epsilon^2</math>
# The average of the errors is zero. <math>E(\epsilon) = 0</math>
# The errors are normally distributed.
 
<math>
\varepsilon \sim^{iid} N(0, \sigma_\epsilon^2), Y_i \sim^{iid} N(\beta_0
+ \beta_1 x_i, \sigma_\epsilon^2)
</math>
 
== Procedure ==
 
<math>
y_i = \beta_0 + \beta_1 x_i + \epsilon_i
</math>
 
where the <math>\beta_0, \beta_1</math> are '''regression
coefficients''' (slope, intercept) based on the population, and
<math>\epsilon_i</math> is error for the i-th subject.
 
We want to estimate the regression coefficients.
 
Let <math>\hat{y_i}</math> be an estimation of <math>y_i</math>; a
prediction at <math>X = x</math>, with
 
<math>
\hat{y_i} = \hat{\beta_0} + \hat{\beta_1} x_i
</math>
 
We can measure the vertical error <math>e_i = y_i - \hat{y_i}</math>
 
The overall error is the sum of squared errors <math>SSE = \sum_i^n
e_i^2</math>. The best fit line is the line minimizing SSE.
 
Using calculus, we can find that the line has the following scope and
intercept:
 
<math>
\hat{\beta_1} = r \frac{s_y}{s_x}
</math>
 
where <math>r</math> is the strength of linear relationship, and
<math>s_x, s_y</math> is the deviations of the sample. They are
basically the sample versions of <math>\rho, \sigma</math>
 
<math>
\hat{\beta_0} = \bar{Y} - \hat{\beta_1} \bar{X}
</math>
 
== Interpretation ==
 
<math>\beta_1</math> (the slope) is the estimated change in
<math>Y</math> when <math>X</math> changes by one unit.
 
<math>\beta_0</math> (the intercept) is the estimated average of
<math>Y</math> when <math>X = 0</math>. If <math>X</math> cannot be 0,
this may not have a practical meaning.
 
<math>r^2</math> ('''coefficient of determination''') measures how good
the line fits the data.
 
<math>
r^2 = \frac{\sum (\hat{y_i} - \bar{Y})^2 }{\sum (y_i - \bar{Y})^2}
</math>
 
The bottom is total variance. The top is reduced. The value is the
proportion of variance in <math>y</math> that is explained by the linear
relationship between <math>X</math> and <math>Y</math>.

Latest revision as of 17:38, 5 April 2024

Bivariate data consider two variables instead of the usual one; each value of one of the variables is paired with a value of the other variable. We will be using to denote the two random variables throughout this page.

Summary Statistics

To summarize bivariate data, we use covariance and correlation in addition to the statistics detailed in Summary Statistics.

Covariance

The covariance measures the total variation of two RVs and their centers. It indicates the relationship of two variables whenever one changes, measuring how much the two vary together.

We have sample covariance

A good way of thinking about covariance is by cases:

If x increases as y increases, the signs of both terms of the covariance calculation is the same. Therefore, covariance is positive.

If x decreases as y increases, the signs are different. Therefore, covariance is negative.

If x does not clearly vary with y, the signs are sometimes different, sometimes the same. Overall, it should cancel out to zero.

Correlation

The Pearson correlation of two random variables measures the line dependent between and

Correlation is normalized. It has no units, and is always between -1 and 1. When r = 1, the relationship between X and Y is perfect positive linear. When r = -1, it is perfect negative linear. If it is 0, there is no linear relationship. This doesn't mean that there is no relationship. Notably, any symmetric scatter plot has a correlation of 0.

When the relationship between teh two is non-linear, we use the Spearman's correlation.

where di is the difference between the two ranks of each observation. For the two variables of each observation, we rank them by sorting from smallest to largest.

Bivariate Normal

The bivariate normal (aka. bivariate gaussian) is one special type of continuous random variable.

is bivariate normal if

  1. The marginal PDF of both X and Y are normal
  2. For any , the condition PDF of given is Normal
    • Works the other way around: Bivariate gaussian means that condition is satisfied

Predicting Y given X

Given bivariate normal, we can predict one variable given another. Let us try estimating the expected Y given X is x

There are three main methods

  • Scatter plot approximation
  • Joint PDF
  • 5 statistics

5 Parameters

We need to know 5 parameters about and

If follows bivariate normal distribution, then we have

The left side is the predicted Z-score for Y, and the right side is the product of correlation and Z-score of X = x

The variance is given by

Due to the range of , the variance of Y given X is always smaller than the actual variance. The standard deviation is just rooted that.

Regression Effect

The regression effect is the phenomenon that the best prediction of given is less rare for than ; Future predictions regress to mediocrity.

If the sample is random (or at least somewhat random), it is unlikely that a subject that has an unlikely score in x got another unlikely score in y. Therefore, the expectation of Y given X is close to the mean. It's barely elaborated on in class, go look at Wikipedia

When you plot all the predicted , you get the linear regression line. The regression effect can be demonstrated by also plotting the SD line (where the correlation is not applied).

Linear Regression

Assumption

  1. X and Y have a linear relationship
  2. A random sample of pairs was taken
  3. All pairs of data are independent
  4. The variance of the error is constant.
  5. The average of the errors is zero.
  6. The errors are normally distributed.

Procedure

where the are regression coefficients (slope, intercept) based on the population, and is error for the i-th subject.

We want to estimate the regression coefficients.

Let be an estimation of ; a prediction at , with

We can measure the vertical error

The overall error is the sum of squared errors . The best fit line is the line minimizing SSE.

Using calculus, we can find that the line has the following scope and intercept:

where is the strength of linear relationship, and is the deviations of the sample. They are basically the sample versions of

Interpretation

(the slope) is the estimated change in when changes by one unit.

(the intercept) is the estimated average of when . If cannot be 0, this may not have a practical meaning.

(coefficient of determination) measures how good the line fits the data.

The bottom is total variance. The top is reduced. The value is the proportion of variance in that is explained by the linear relationship between and .