Rice Wiki - User contributions [en]

Linear first order ODE

2024-04-08T21:15:25Z

Admin:

[[Category:Differential Equations]]

'''Linear first order [[Ordinary Differential Equation|ODE]]'s'''

Homogeneous of g(t) = 0

Constant coefficient if p(t) = a is a constant. Otherwise, variable
coefficient.

Examples:
y' + ty = 2 nonhomogeneous, var
y' + 2y = 2 nonhomogeneous, constant
y' + 2y = 0 homogeneous, constant

Factor emthod 2.1

y' + p(t) y = g(t)

Compute integrating factor

mu(t) = exp(int p(t) dt)

Ordinary differential equation

2024-04-08T21:14:24Z

Admin: Admin moved page Ordinary Differential Equations to Ordinary Differential Equation without leaving a redirect

[[Category:Differential Equations]]

An '''ordinary differential equation (ODE)''' relates a function and its
derivatives. We usually use <math>y</math> to denote the function and
<math>t</math> to denote the variable.

''Ordinary'' means that the equation has one variable, as opposed to
partial differential.

There is ''no general solution'' to ODEs. We separate them by classes
and solve them individually.

== Example ==

An example of an ODE is the following

<math>
y' = y
</math>

The general solution of the above is

<math>
y(t) = c e^t
</math>

Notably, the solution is ''homogeneous'', meaning that <math>0</math> is
a solution. This will probably be covered later.

To get a unique solution, we need to apply additional conditions, such
as specifying a particular value

<math>
\begin{cases}
y' = y \\
y(0) = y_0
\end{cases}
</math>

This is called an ''initial value problem'', in which a function is
generated from an initial value with another equation.

== Usage ==

Since the derivative can be described as the rate of change, and the
function itself is the state, ODEs arises as mathematical models of
systems whose ''rate of change depends on the state of the system''.

The following are brief descriptions of some applications of ODEs.

# ''Radioactive decay'', where the function is the (large) number of atoms.
#* Atoms decay at an average constant rate <math>r</math>
#* <math>\frac{dN}{dt} = -rN</math>
# ''Object falling under gravity'', where the function is the velocity of the object
#* <math>\frac{dv}{dt} = g - \frac{\gamma v}{m}</math>

== Dimensions/Units ==

The two sides of the equation must match in dimensions (aka. units).

Consider radioactive decay.

<math>
\begin{cases}
\frac{dN}{dt} = -rN \\
N(0) = N_0
\end{cases}
</math>

The solution comes to

<math>
N(t) = N_0 e^{-rt}
</math>

We use '''time constant''' <math>\tau</math> to get a sense of how fast
it is decaying. Its units is time.

<math>
\tau = \frac{1}{r}
</math>

== Equilibrium Solution ==

Consider an object falling under gravity

<math>
\begin{cases}
\frac{dv}{dt} = g - \lambda v \\
v(0) = v_0
\end{cases}
</math>

We sometimes want the '''equilibrium solution'''

<math>
v(t) = v_*
</math>

<math>
\frac{dv}{dt} = 0 = g - \lambda v_*
</math>

Doing some math, we can eventually get

<math>
v(t) = v_* + (v_0 - v_*) e^{-\lambda t}
</math>

= Classification =

An ODE is '''linear''' if all terms are proportional to <math>y, y',
y''. \ldots</math> or are given functions of <math>t</math>. This
distinction is especially useful since linear combination can be used to
construct solutions.

The '''order''' of an ODE is the order of its highest derivative.

In a '''scalar''', there is only one unknown function <math>y(t)</math>.
In a '''system''', there are several, and you have to solve them
simultaneously.

Here is a list of ODEs we study, from simple to complex:
* [[Linear First Order ODE]]

Information Representation

2024-04-02T22:09:39Z

Admin: /* Arrays */

Inside a computer, everything is bits. The topic of '''information representation''' discusses how different information inside a computer program actually looks like in the background.

= Word =
Instead of one bit at a time, computers do operations on several at the same time. The '''word size''' of the computer is the number of bits representing each piece of data that it processes.

Notably, since instructions are also "data" received by the CPU, each instruction is also constrained by the word size.

Word size is determined by the CPU. For example, a 32-bit CPU has a word size of 32 bits (or 4 bytes).

== Byte and Endianess ==
Most modern day computers are '''byte-addressable'''. The byte is the lowest unit of data that has an address and can be accessed at once.

Since we are storing multiple bytes of a word across memory, the order in which we store them, '''endianess''', needs to be specified.

There are two choices: Big Endian and Little Endian. Let's consider how to store 0x10203040 in a 32-bit machine

In '''Big Endian''', the most significant byte is stored at the lowest part of an address (i.e. big end first). Addresses would look something like 0x10, 0x20, 0x30, 0x40. BE is used on the internet.

In '''Little Endian''', the least significant byte is stored at the lowest part of an address (i.e. little end first). Addresses would look something like 0x40, 0x30, 0x20, 0x10. LE is used on intel machines.

Besides different CPUs using different endianess to run instructions, most file formats specify endianess to support different machines. For example, a Unicode text file has a BOM (byte order mark) at the start to denote whether the file is BE or LE.

= Signed Integers =
While a series of bits naturally represent a binary number, negative numbers are a bit more complicated.

The first method to represent signed integers with bits is '''signed magnitude'''. Under this system, the first bit (''sign bit'') is used as a negative sign; If it is 0, then the number is positive, otherwise it is negative.

The second method is called '''2's complement''', and it is more widely used due to several advantages covered later.

In this system, think of the first bit in the bit pattern as negative. For example, 101 would be 4 + 1 = 5 as an unsigned number, but -4 + 1 = -3 as a signed number represented by 2's complement.

The advantage of using 2's complement over signed magnitude is twofold:

* Normal binary arithmetic can be applied.
* There is no negative zero

To negate signed numbers, flip all bits and add 1. This comes from math: <math>-2 = -4 + ((4 - 1) - 2) + 1</math>

= Floating Point =
To store decimal values, the '''floating point''' representation is used. Similar to scientific notation, it uses a series of bits called the '''mantissa''' to store significant binary digits, and another series of bits called the '''exponent''' to store the order of magnitude of the number. The final calculation is something like

<math>1.m\times2^{e}</math>

To handle negative numbers, the first bit is used as a sign bit.

To handle negative exponents, all exponents are subtracted by a constant to center the range of exponents on 0. For example, when 8 bits is used to represent the exponent (as is usually the case), the range of possible values is 0 to 255, so 127 is subtracted from the exponent so that the actual range is -127 to 128.

= Arrays =
The simplest structured group of values is an '''array''', in which multiple values of the same type is stored consecutively in a block of memory. We then keep track of the address of the first element in the array and the size of each element so that we can find the n-th element easily.

For example, consider an array of 10 32-bit integers <c>arr</c>. To access the 6th element, I can simply move 5 integer sizes after the first element to get the address I want:

<sh>
*(&arr + 5*sizeof(int))
</sh>

Notably, this is also a major reason why array indices start at 0 in most programming languages.

= Struct =
A '''struct''' boils down to an array that can have different types of values, each taking up different sizes.

[[Category:Computer Architecture]]

Information Representation

2024-04-02T22:06:59Z

Admin: /* Arrays */

Information Representation

2024-04-02T21:59:27Z

Admin: /* Floating Point */

Inside a computer, everything is bits. The topic of '''information representation''' discusses how different information inside a computer program actually looks like in the background.

= Word =
Instead of one bit at a time, computers do operations on several at the same time. The '''word size''' of the computer is the number of bits representing each piece of data that it processes.

Notably, since instructions are also "data" received by the CPU, each instruction is also constrained by the word size.

Word size is determined by the CPU. For example, a 32-bit CPU has a word size of 32 bits (or 4 bytes).

== Byte and Endianess ==
Most modern day computers are '''byte-addressable'''. The byte is the lowest unit of data that has an address and can be accessed at once.

Since we are storing multiple bytes of a word across memory, the order in which we store them, '''endianess''', needs to be specified.

There are two choices: Big Endian and Little Endian. Let's consider how to store 0x10203040 in a 32-bit machine

In '''Big Endian''', the most significant byte is stored at the lowest part of an address (i.e. big end first). Addresses would look something like 0x10, 0x20, 0x30, 0x40. BE is used on the internet.

In '''Little Endian''', the least significant byte is stored at the lowest part of an address (i.e. little end first). Addresses would look something like 0x40, 0x30, 0x20, 0x10. LE is used on intel machines.

Besides different CPUs using different endianess to run instructions, most file formats specify endianess to support different machines. For example, a Unicode text file has a BOM (byte order mark) at the start to denote whether the file is BE or LE.

= Signed Integers =
While a series of bits naturally represent a binary number, negative numbers are a bit more complicated.

The first method to represent signed integers with bits is '''signed magnitude'''. Under this system, the first bit (''sign bit'') is used as a negative sign; If it is 0, then the number is positive, otherwise it is negative.

The second method is called '''2's complement''', and it is more widely used due to several advantages covered later.

In this system, think of the first bit in the bit pattern as negative. For example, 101 would be 4 + 1 = 5 as an unsigned number, but -4 + 1 = -3 as a signed number represented by 2's complement.

The advantage of using 2's complement over signed magnitude is twofold:

* Normal binary arithmetic can be applied.
* There is no negative zero

To negate signed numbers, flip all bits and add 1. This comes from math: <math>-2 = -4 + ((4 - 1) - 2) + 1</math>

= Floating Point =
To store decimal values, the '''floating point''' representation is used. Similar to scientific notation, it uses a series of bits called the '''mantissa''' to store significant binary digits, and another series of bits called the '''exponent''' to store the order of magnitude of the number. The final calculation is something like

<math>1.m\times2^{e}</math>

To handle negative numbers, the first bit is used as a sign bit.

To handle negative exponents, all exponents are subtracted by a constant to center the range of exponents on 0. For example, when 8 bits is used to represent the exponent (as is usually the case), the range of possible values is 0 to 255, so 127 is subtracted from the exponent so that the actual range is -127 to 128.

= Arrays =
The simplest structured group of values is an '''array''', in which multiple values of the same type is stored consecutively in a block of memory. We then keep track of the address of the first element in the array and the size of each element so that we can find the n-th element easily.

For example, consider an array of 10 32-bit integers ''arr''. To access the 6th element, I take the address

[[Category:Computer Architecture]]

Information Representation

2024-04-02T21:45:34Z

Admin: /* Arrays */

Inside a computer, everything is bits. The topic of '''information representation''' discusses how different information inside a computer program actually looks like in the background.

= Word =
Instead of one bit at a time, computers do operations on several at the same time. The '''word size''' of the computer is the number of bits representing each piece of data that it processes.

Notably, since instructions are also "data" received by the CPU, each instruction is also constrained by the word size.

Word size is determined by the CPU. For example, a 32-bit CPU has a word size of 32 bits (or 4 bytes).

== Byte and Endianess ==
Most modern day computers are '''byte-addressable'''. The byte is the lowest unit of data that has an address and can be accessed at once.

Since we are storing multiple bytes of a word across memory, the order in which we store them, '''endianess''', needs to be specified.

There are two choices: Big Endian and Little Endian. Let's consider how to store 0x10203040 in a 32-bit machine

In '''Big Endian''', the most significant byte is stored at the lowest part of an address (i.e. big end first). Addresses would look something like 0x10, 0x20, 0x30, 0x40. BE is used on the internet.

In '''Little Endian''', the least significant byte is stored at the lowest part of an address (i.e. little end first). Addresses would look something like 0x40, 0x30, 0x20, 0x10. LE is used on intel machines.

Besides different CPUs using different endianess to run instructions, most file formats specify endianess to support different machines. For example, a Unicode text file has a BOM (byte order mark) at the start to denote whether the file is BE or LE.

= Signed Integers =
While a series of bits naturally represent a binary number, negative numbers are a bit more complicated.

The first method to represent signed integers with bits is '''signed magnitude'''. Under this system, the first bit (''sign bit'') is used as a negative sign; If it is 0, then the number is positive, otherwise it is negative.

The second method is called '''2's complement''', and it is more widely used due to several advantages covered later.

In this system, think of the first bit in the bit pattern as negative. For example, 101 would be 4 + 1 = 5 as an unsigned number, but -4 + 1 = -3 as a signed number represented by 2's complement.

The advantage of using 2's complement over signed magnitude is twofold:

* Normal binary arithmetic can be applied.
* There is no negative zero

To negate signed numbers, flip all bits and add 1. This comes from math: <math>-2 = -4 + ((4 - 1) - 2) + 1</math>

= Floating Point =
To store decimal values, the '''floating point''' representation is used. Similar to scientific notation, it uses a series of bits called the '''mantissa''' to store significant binary digits, and another series of bits called the '''exponent''' to store the order of magnitude of the number. The final calculation is something like

<math>1.m\times2^{e}</math>

To handle negative numbers, the first bit is used as a sign bit.

To handle negative exponents, all exponents are subtracted by a constant to center the range of exponents on 0. For example, when 8 bits is used to represent the exponent (as is usually the case), the range of possible values is 0 to 255, so 127 is subtracted from the exponent so that the actual range is -127 to 128.

= Arrays =
The simplest structured group of values is an '''array''', in which multiple values of the same type is stored consecutively in a block of memory. We then keep track of the address of the first element in the array and the size of each element so that we can find the n-th element easily.

For example, consider an array of 10 32-bit integers ''arr''. To access the 6th element, I take the address
[[Category:Computer Architecture]]

Information Representation

2024-04-02T21:38:34Z

Admin: /* Byte and Endianess */

Information Representation

2024-04-02T21:32:02Z

Admin: /* Signed Integers */

Information Representation

2024-04-02T20:33:06Z

Admin:

Inside a computer, everything is bits. The topic of '''information representation''' discusses how different information inside a computer program actually looks like in the background.

= Signed Integers =
While a series of bits naturally represent a binary number, negative numbers are a bit more complicated.

The first method to represent signed integers with bits is '''signed magnitude'''. Under this system, the first bit (''sign bit'') is used as a negative sign; If it is 0, then the number is positive, otherwise it is negative.

The second method is called '''2's complement''', and it is more widely used due to several advantages covered later.

In this system, think of the first bit in the bit pattern as negative. For example, 101 would be 4 + 1 = 5 as an unsigned number, but -4 + 1 = -3 as a signed number represented by 2's complement.

The advantage of using 2's complement over signed magnitude is twofold:

* Normal binary arithmetic can be applied.
* There is no negative zero

To negate signed numbers, flip all bits and add 1. This comes from math: <math>-2 = -4 + ((4 - 1) - 2) + 1</math>
[[Category:Computer Architecture]]

Information Representation

2024-04-02T20:23:24Z

Admin:

Inside a computer, everything is bits. The topic of '''information representation''' discusses how different information inside a computer program actually looks like in the background.

'''2's complement''' is a method of representing integers in computer science.

To calculate the value of a number represented by 2's complement, just think of the first bit in the bit pattern as negative. For example, 101 would be 4 + 1 = 5 as an unsigned number, but -4 + 1 = -3 as a signed number represented by 2's complement.

The advantage of using 2's complement over signed magnitude is twofold:

* Easier arithmetic operations
* Greater range of values (since there is no -0)

* To '''negate''' signed numbers, flip all bits and add 1. This comes from math: <math>-2 = -4 + ((4 - 1) - 2) + 1</math>

[[Category:Computer Architecture]]

Category:Computer Architecture

2024-04-02T20:22:38Z

Admin: Created page with "Category:Computer Science"

[[Category:Computer Science]]

Information Representation

2024-04-02T20:22:24Z

Admin: Created page with "Inside a computer, everything is bits. The topic of '''information representation''' discusses how different information inside a computer program actually looks like in the background. Category:Computer Architecture"

Bitwise Operation

2024-04-02T20:12:01Z

Admin: Created page with "A '''bitwise operation''' is an operation that is done on a series of bits. There are many bitwise operators, most of them being pretty self explanatory. I'll note down anything that seems interesting to me. ''Exclusive or (XOR)'' operator can be used to flip singular bits, since against 1, XOR flips a bit, whereas against 0, XOR does nothing. Category:Computer Science"

A '''bitwise operation''' is an operation that is done on a series of bits. There are many bitwise operators, most of them being pretty self explanatory. I'll note down anything that seems interesting to me.

''Exclusive or (XOR)'' operator can be used to flip singular bits, since against 1, XOR flips a bit, whereas against 0, XOR does nothing.
[[Category:Computer Science]]

Topic: Godot

2024-03-26T20:20:37Z

Admin:

'''Godot''' is a FOSS game engine. This is the main page; subpages will detail my notes. At the time of writing, Godot is on version 4.2. Most information comes from the [https://docs.godotengine.org/en/stable/getting_started/introduction/key_concepts_overview.html official documentation].

{{Special:PrefixIndex/Godot/}}
[[Category:Contents]]

Imperative Programming

2024-03-23T21:37:19Z

Admin:

[[Category:Programming Paradigm]]
'''Imperative programming''' is a [[Programming Paradigm|programming paradigm]] that uses statements to change a program's state. Like a recipe, it focuses on describing how a program operates step by step. This includes [[Procedural Programming|procedural programming]] and [[Object-Oriented Programming|object-oriented programming]]. It is the ''original'' form of programming in that at the most basic level, computers are machines that take in one instruction at a time and output results.

Imperative Programming

2024-03-23T21:34:58Z

Admin:

Category:Programming Paradigm

2024-03-23T21:24:11Z

Admin:

[[Category:Computer Science]]

Imperative Programming

2024-03-23T21:23:54Z

Admin: Created page with "Category:Programming Paradigm"

[[Category:Programming Paradigm]]

Programming Paradigm

2024-03-23T21:22:07Z

Admin:

<nowiki>*</nowiki>This page comes primarily from self-research

A '''programming paradigm''' is a relatively-highly abstracted model to organize/structure the implementation of a computer program.

At the most basic level of a computer program, we have simple instructions running. As a program get more complex, practices and abstractions such as functions or objects provide a way to organize a program such that it is easier to maintain and collaborate on. We group these practices and abstractions into paradigms to study and analyze them.

* Each paradigm has strengths and weaknesses
* Tools (notably programming languages) can often be classified as supporting one or more paradigms
[[Category:Programming Paradigm]]
[[Category:Computer Science]]

Category:Programming Paradigm

2024-03-23T21:21:29Z

Admin: Created blank page

Programming Paradigm

2024-03-23T21:21:17Z

Admin:

Programming Paradigm

2024-03-23T21:20:32Z

Admin: Created page with "<nowiki>*</nowiki>This page comes primarily from self-research A '''programming paradigm''' is a relatively-highly abstracted model to organize/structure the implementation of a computer program. At the most basic level of a computer program, we have simple instructions running. As a program get more complex, practices and abstractions such as functions or objects provide a way to organize a program such that it is easier to maintain and collaborate on. We group these..."

Topic: Godot/Key Concepts

2024-03-23T21:05:27Z

Admin:

In Godot, a game is a '''tree''' of '''nodes''' grouped in to '''scenes''' and communicate with each other using '''signals'''.

A '''node''' is the most basic element.

A '''scene''' is a reusable element of the game; anything from an item to a map. They are made from groups of nodes and other scenes.

All of a game's scenes create a '''scene tree'''.

Nodes emit a '''signal''' when some event happens. This is the primary way for nodes and scenes to communicate.

= Node =
A '''node''' is the most basic element of a Godot game. All nodes consists of the following characteristics:

* Has a name
* Has editable properties
* Receives callbacks to update every frame
* Extendable with new properties and functions
* Can contain other nodes

= Sources =

* https://docs.godotengine.org/en/stable/getting_started/step_by_step/nodes_and_scenes.html

Topic: Godot

2024-03-23T20:50:40Z

Admin:

Topic: Godot

2024-03-23T20:48:54Z

Admin:

'''Godot''' is a FOSS game engine. This is the main page; subpages will detail my notes.

{{Special:PrefixIndex/Godot/}}

Topic: Godot/Key Concepts

2024-03-23T20:48:16Z

Admin: Created page with "Test"

Test

Topic: Godot

2024-03-23T20:47:43Z

Admin:

'''Godot''' is a FOSS game engine. This is the main page; subpages will detail my notes.

{{Special:PrefixIndex/Help:Godot/}}

Topic: Godot

2024-03-23T20:43:23Z

Admin: Created page with "'''Godot''' is a FOSS game engine. This is the main page; subpages will detail my notes."

'''Godot''' is a FOSS game engine. This is the main page; subpages will detail my notes.

Bivariate

2024-03-19T19:02:51Z

Admin: /* Regression Effect */

[[Category:Distribution (Statistics)]][[Category:Statistics]]
'''Bivariate''' data consider two variables instead of the usual one; each value of one of the variables is paired with a value of the other variable. We will be using <math>X, Y</math> to denote the two random variables throughout this page.

= Summary Statistics =
To summarize bivariate data, we use covariance and correlation in addition to the statistics detailed in [[Summary Statistics]].

== Covariance ==
The '''covariance''' measures the total variation of two RVs and their centers. It indicates the relationship of two variables whenever one changes, measuring how much the two vary together.

We have ''sample covariance''

<math>s^2_{X, Y} = \hat{cov}(X, Y) = \frac{1}{n - 1} \sum(x_i - \bar{x}) (y_i - \bar{y}) = \frac{1}{n - 1} \left( \sum x_i y_i - n \bar{x} \bar{y} \right)</math>

A good way of thinking about covariance is by cases:

If x ''increases'' as y ''increases'', the signs of both terms of the covariance calculation is the same. Therefore, covariance is ''positive''.

If x ''decreases'' as y ''increases'', the signs are different. Therefore, covariance is ''negative''.

If x does not clearly vary with y, the signs are sometimes different, sometimes the same. Overall, it should cancel out to ''zero.''

== Correlation ==
The '''correlation''' of two random variables measures the '''line'''
dependent''' between <math>X</math> and <math>Y</math>'''

<math>
Cor(X, Y) = \rho = \frac{Cov(X,Y)}{sd(X) sd(Y)}
</math>

Correlation is always between -1 and 1. When r = 1, the relationship between X and Y is '''perfect positive linear'''. When r = -1, it is '''perfect negative linear'''. If it is 0, there is no linear relationship. This doesn't mean that there is no relationship. Notably, any symmetric scatter plot has a correlation of 0.

= Bivariate Normal =

The '''bivariate normal''' (aka. bivariate gaussian) is one special type
of continuous random variable.

<math>(X, Y)</math> is ''bivariate normal'' if

# The marginal PDF of both X and Y are normal
# For any <math>x</math>, the condition PDF of <math>Y</math> given <math>X = x</math> is Normal
** Works the other way around: Bivariate gaussian means that condition is satisfied

== Predicting Y given X ==

Given bivariate normal, we can predict one variable given another.
Let us try estimating the expected Y given X is x

<math>
E(Y| X = x)
</math>

There are three main methods
* Scatter plot approximation
* Joint PDF
* 5 statistics

=== 5 Parameters ===

We need to know 5 parameters about <math>X</math> and <math>Y</math>

<math>E(X), sd(X), E(Y), sd(Y), \rho</math>

If <math>X, Y</math> follows bivariate normal distribution, then we
have

<math>
\left( \frac{E(Y|X = x) - E(Y)}{sd(Y)} \right) = \rho \left( \frac{x -
E(X)}{sd(X)} \right)
</math>

The left side is the ''predicted Z-score for Y'', and the right side is
''the product of correlation and Z-score of X = x''

The variance is given by

<math>
Var(Y | X = x) = (1 - \rho^2) Var(Y)
</math>

Due to the range of <math>\rho</math>, the variance of Y given X is
always smaller than the actual variance. The standard deviation is just
rooted that.

== Regression Effect ==
The '''regression effect''' is the phenomenon that the best prediction
of <math>Y</math> given <math>X = x</math> is less rare for
<math>Y</math> than <math>x</math>; Future predictions regress to
mediocrity.

If the sample is random (or at least somewhat random), it is unlikely that a subject that has an unlikely score in ''x'' got another unlikely score in ''y''. Therefore, the expectation of ''Y'' given ''X'' is close to the mean. It's barely elaborated on in class, go look at [[wikipedia:Regression_toward_the_mean|Wikipedia]]

When you plot all the predicted <math>E(Y|X = x)</math>, you get the
'''linear regression line'''. The regression effect can be demonstrated
by also plotting the SD line (where the correlation is not applied).

= Linear Regression =

== Assumption ==

# X and Y have a linear relationship
# A random sample of pairs was taken
# All pairs of data are independent
# The variance of the error is constant. <math>Var(\epsilon) = \sigma_\epsilon^2</math>
# The average of the errors is zero. <math>E(\epsilon) = 0</math>
# The errors are normally distributed.

<math>
\varepsilon \sim^{iid} N(0, \sigma_\epsilon^2), Y_i \sim^{iid} N(\beta_0
+ \beta_1 x_i, \sigma_\epsilon^2)
</math>

== Procedure ==

<math>
y_i = \beta_0 + \beta_1 x_i + \epsilon_i
</math>

where the <math>\beta_0, \beta_1</math> are '''regression
coefficients''' (slope, intercept) based on the population, and
<math>\epsilon_i</math> is error for the i-th subject.

We want to estimate the regression coefficients.

Let <math>\hat{y_i}</math> be an estimation of <math>y_i</math>; a
prediction at <math>X = x</math>, with

<math>
\hat{y_i} = \hat{\beta_0} + \hat{\beta_1} x_i
</math>

We can measure the vertical error <math>e_i = y_i - \hat{y_i}</math>

The overall error is the sum of squared errors <math>SSE = \sum_i^n
e_i^2</math>. The best fit line is the line minimizing SSE.

Using calculus, we can find that the line has the following scope and
intercept:

<math>
\hat{\beta_1} = r \frac{s_y}{s_x}
</math>

where <math>r</math> is the strength of linear relationship, and
<math>s_x, s_y</math> is the deviations of the sample. They are
basically the sample versions of <math>\rho, \sigma</math>

<math>
\hat{\beta_0} = \bar{Y} - \hat{\beta_1} \bar{X}
</math>

== Interpretation ==

<math>\beta_1</math> (the slope) is the estimated change in
<math>Y</math> when <math>X</math> changes by one unit.

<math>\beta_0</math> (the intercept) is the estimated average of
<math>Y</math> when <math>X = 0</math>. If <math>X</math> cannot be 0,
this may not have a practical meaning.

<math>r^2</math> ('''coefficient of determination''') measures how good
the line fits the data.

<math>
r^2 = \frac{\sum (\hat{y_i} - \bar{Y})^2 }{\sum (y_i - \bar{Y})^2}
</math>

The bottom is total variance. The top is reduced. The value is the
proportion of variance in <math>y</math> that is explained by the linear
relationship between <math>X</math> and <math>Y</math>.

Sampling Distribution

2024-03-19T17:07:29Z

Admin: /* T-Distribution */

Let there be <math>Y_1, Y_2, \ldots, Y_n </math>, where each
<math>Y_i</math> is a randomv variable from the population.

Every Y have the same mean and distribution that we don't know.

<math>
E(Y_i) = \mu, Var(Y_i) = \sigma^2
</math>

We then have the sample mean

<math>
\bar{Y} = \frac{1}{n} \sum_{i = 1}^n Y_i
</math>

The sample mean is expected to be <math>\mu</math> through a pretty easy
direct proof

The variance of the sample mean is <math>\frac{\sigma^2}{n}</math>, also
through a pretty easy direct proof.

= Central limit theorem =
The '''central limit theorem''' states that the distribution of the sample mean follows normal distribution.

<math>\bar{Y} \sim N(\mu, \frac{\sigma^2}{n})</math>

As long as the following two conditions are satisfied, CLT applies, regardless of the population's distribution.

# The population distribution of <math>Y</math> is normal, ''or''
# The sample size for each <math>Y_i</math> is large <math>n>30</math>

By extension, we also have the distribution of the sum.

<math>S \sim N(\mu_S = n\mu, \sigma_S = \sqrt{n \sigma})</math>

where <math>S = \sum Y_i </math>

== Proportion Approximation ==
The sampling distribution of the proportion of success of ''n'' bernoulli random variables (<math>\hat{p}</math>) can also be approximated to a normal distribution under the CLT.

Consider [[Discrete Random Variable#Bernoulli|bernoulli]] random variables

<math>Y_1 \ldots Y_n \sim E(Y_i) = p, Var(Y_i) = p(1 - p)</math>

The proportion of success <math>\hat{p}</math> is the sum over the count, so the expected probability is

<math>E(\hat{p}) = E \left(\frac{1}{n} \sum Y_i \right) = p</math>

and the variance of <math>\hat{p}</math> is

<math>Var \left(\frac{1}{n} \sum Y_i \right) = \frac{1}{n^2} Var \left(\sum Y_i \right) = \frac{p (1 - p)}{n}</math>

With a large sample size ''n'', we can appoximate this to a normal distribution. Notably, the criteria for a large ''n'' is different from that of the continuous random variable.

<math>np > 5, n(1 - p) > 5</math>

then we have

<math>\hat{p} \sim N \left( p, \frac{p(1-p)}{n} \right)</math>

The reasoning behind the weird criteria relates to the [[Discrete Random Variable|binomial distribution]]. It's not very elaborated on in the lecture, but <math>np</math> is the mean of the binomial (i.e. the expected number of successes). The criteria essentially makes sure that no negative values are plausible in the approximation; with a small mean and a large variance, the left side of a normal approximation goes into the negative, but bernoulli/binomial must always be positive.

== Binomial Approximation ==
I'm short on time. This is based on the above section and a bit of math.

<math>Y \sim N(np, np(1 - p))</math>

= Confidence Interval =
'''Estimation''' is the guess for the unknown parameter. A '''point estimate''' is a "best guess" of the population parameter, where as the '''confidence interval''' is the range of reasonable values that are intended to contain the '''parameter of interest''' with a certain '''degree of confidence''', calculated with

''(point estimate - margin of error, point estimate + margin of error)''

== Standard Error ==
The '''standard error''' measures how much error we expect to make when estimating <math>\mu_Y</math> by <math>\bar{y}</math>.

== Constructing CIs ==
By CLT, <math>\bar{Y} \sim N(\mu, \frac{\sigma^2}{n} )</math>. The
confidence interval is the range of plausible <math>\bar{Y}</math>.

If we define the middle 90% to be plausible, to find the
confidence interval, simply find the 5th and 95th percentile.

Generalized, if we want a confidence interval of the middle <math>(1 - \alpha) 100%</math>, have a confidence interval of

<math>
\bar{y} \pm Z_{\alpha / 2} \frac{\sigma}{ \sqrt{n} }
</math>

where <math>\bar{y}</math> is the sample mean and <math>Z_{x}</math> is the z score of the x-th percentile.

= T-Distribution =
[[File:T distribution table.png|thumb|T distribution table]]

CLT is based on the population variance. Since we don't know the population variance <math>\sigma^2</math>, we
have to use the sample variance <math>s</math> to estimate it. This
introduces more uncertainty, accounted for by the '''t-distribution.'''

T-distribution is the distribution of sample mean based on population
mean, sample variance and ''degrees of freedom'' (covered later). It
looks very similar to normal distribution.

When the sample size <math>n</math> is small, there is greater
uncertainty in the estimates. T-di

<math>
t_{\alpha/2} > Z_{\alpha/2}
</math>

The spread of t-distribution depends on the '''degrees of freedom''',
which is based on sample size. When looking up the table, round down df.

<math>
\upsilon = n - 1
</math>

As the sample size increases, degrees of freedom increase, the spread of
t-distribution decreases, and t-distribution approaches normal
distribution.

Based on CLT and normal distribution, we had the confidence interval

<math>
\bar{Y} \pm Z_{\alpha / 2} \frac{\sigma}{ \sqrt{n} }
</math>

Now, based on T-distribution, we have the CI

<math>
\bar{Y} \pm t_{\alpha / 2} \frac{s}{ \sqrt{n} }
</math>

==== Find Sample Size ====
To calculate sample size needed depending on desired
error margin and sample variance by assuming that <math>\upsilon = \infty</math>

<math>
n = \frac{Z^2_{\alpha/2} s^2}{E^2}
</math>

We want to always ''round up'' to stay within the error margin.

I don't really know why.

= Sampling Distribution of Difference =

By linear combination of RVs, sampling distribution of <math>\bar{Y_1} - \bar{Y_2}</math> is

<math>
( \bar{Y_1} - \bar{Y_2} ) \sim N(\mu_1 - \mu_2, \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2})
</math>

However, we do not know the population variance <math>\sigma^2</math>. If the CLT assumptions hold, then we have

<math>
\upsilon = \frac{ \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2 }{ \frac{(s_1^2 / n_1)^2 }{n_1 - 1} + \frac{(s_2^2 / n_2)^2 }{n_2 - 1} }
</math>

Trust me bro. Remember to round down to use t-table. With this degree of freedom, we can use sample variance to estimate the distribution.

[[Category:Sample Statistics]]

Sampling Distribution

2024-03-19T17:04:32Z

Admin: /* Confidence Interval */

Let there be <math>Y_1, Y_2, \ldots, Y_n </math>, where each
<math>Y_i</math> is a randomv variable from the population.

Every Y have the same mean and distribution that we don't know.

<math>
E(Y_i) = \mu, Var(Y_i) = \sigma^2
</math>

We then have the sample mean

<math>
\bar{Y} = \frac{1}{n} \sum_{i = 1}^n Y_i
</math>

The sample mean is expected to be <math>\mu</math> through a pretty easy
direct proof

The variance of the sample mean is <math>\frac{\sigma^2}{n}</math>, also
through a pretty easy direct proof.

= Central limit theorem =
The '''central limit theorem''' states that the distribution of the sample mean follows normal distribution.

<math>\bar{Y} \sim N(\mu, \frac{\sigma^2}{n})</math>

As long as the following two conditions are satisfied, CLT applies, regardless of the population's distribution.

# The population distribution of <math>Y</math> is normal, ''or''
# The sample size for each <math>Y_i</math> is large <math>n>30</math>

By extension, we also have the distribution of the sum.

<math>S \sim N(\mu_S = n\mu, \sigma_S = \sqrt{n \sigma})</math>

where <math>S = \sum Y_i </math>

== Proportion Approximation ==
The sampling distribution of the proportion of success of ''n'' bernoulli random variables (<math>\hat{p}</math>) can also be approximated to a normal distribution under the CLT.

Consider [[Discrete Random Variable#Bernoulli|bernoulli]] random variables

<math>Y_1 \ldots Y_n \sim E(Y_i) = p, Var(Y_i) = p(1 - p)</math>

The proportion of success <math>\hat{p}</math> is the sum over the count, so the expected probability is

<math>E(\hat{p}) = E \left(\frac{1}{n} \sum Y_i \right) = p</math>

and the variance of <math>\hat{p}</math> is

<math>Var \left(\frac{1}{n} \sum Y_i \right) = \frac{1}{n^2} Var \left(\sum Y_i \right) = \frac{p (1 - p)}{n}</math>

With a large sample size ''n'', we can appoximate this to a normal distribution. Notably, the criteria for a large ''n'' is different from that of the continuous random variable.

<math>np > 5, n(1 - p) > 5</math>

then we have

<math>\hat{p} \sim N \left( p, \frac{p(1-p)}{n} \right)</math>

The reasoning behind the weird criteria relates to the [[Discrete Random Variable|binomial distribution]]. It's not very elaborated on in the lecture, but <math>np</math> is the mean of the binomial (i.e. the expected number of successes). The criteria essentially makes sure that no negative values are plausible in the approximation; with a small mean and a large variance, the left side of a normal approximation goes into the negative, but bernoulli/binomial must always be positive.

== Binomial Approximation ==
I'm short on time. This is based on the above section and a bit of math.

<math>Y \sim N(np, np(1 - p))</math>

= Confidence Interval =
'''Estimation''' is the guess for the unknown parameter. A '''point estimate''' is a "best guess" of the population parameter, where as the '''confidence interval''' is the range of reasonable values that are intended to contain the '''parameter of interest''' with a certain '''degree of confidence''', calculated with

''(point estimate - margin of error, point estimate + margin of error)''

== Standard Error ==
The '''standard error''' measures how much error we expect to make when estimating <math>\mu_Y</math> by <math>\bar{y}</math>.

== Constructing CIs ==
By CLT, <math>\bar{Y} \sim N(\mu, \frac{\sigma^2}{n} )</math>. The
confidence interval is the range of plausible <math>\bar{Y}</math>.

If we define the middle 90% to be plausible, to find the
confidence interval, simply find the 5th and 95th percentile.

Generalized, if we want a confidence interval of the middle <math>(1 - \alpha) 100%</math>, have a confidence interval of

<math>
\bar{y} \pm Z_{\alpha / 2} \frac{\sigma}{ \sqrt{n} }
</math>

where <math>\bar{y}</math> is the sample mean and <math>Z_{x}</math> is the z score of the x-th percentile.

= T-Distribution =
[[File:T distribution table.png|thumb|T distribution table]]

CLT has several restrictions, the biggest one being a large sample size.
'''T-'''

Since we don't know the population variance <math>\sigma^2</math>, we
have to use the sample variance <math>s</math> to estimate it. This
introduces more uncertainty, accounted for by the '''t-distribution.'''

T-distribution is the distribution of sample mean based on population
mean, sample variance and ''degrees of freedom'' (covered later). It
looks very similar to normal distribution.

When the sample size <math>n</math> is small, there is greater
uncertainty in the estimates. T-di

<math>
t_{\alpha/2} > Z_{\alpha/2}
</math>

The spread of t-distribution depends on the '''degrees of freedom''',
which is based on sample size. When looking up the table, round down df.

<math>
\upsilon = n - 1
</math>

As the sample size increases, degrees of freedom increase, the spread of
t-distribution decreases, and t-distribution approaches normal
distribution.

Based on CLT and normal distribution, we had the confidence interval

<math>
\bar{Y} \pm Z_{\alpha / 2} \frac{\sigma}{ \sqrt{n} }
</math>

Now, based on T-distribution, we have the CI

<math>
\bar{Y} \pm t_{\alpha / 2} \frac{s}{ \sqrt{n} }
</math>

==== Find Sample Size ====
To calculate sample size needed depending on desired
error margin and sample variance by assuming that <math>\upsilon = \infty</math>

<math>
n = \frac{Z^2_{\alpha/2} s^2}{E^2}
</math>

We want to always ''round up'' to stay within the error margin.

I don't really know why.

= Sampling Distribution of Difference =

By linear combination of RVs, sampling distribution of <math>\bar{Y_1} - \bar{Y_2}</math> is

<math>
( \bar{Y_1} - \bar{Y_2} ) \sim N(\mu_1 - \mu_2, \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2})
</math>

However, we do not know the population variance <math>\sigma^2</math>. If the CLT assumptions hold, then we have

<math>
\upsilon = \frac{ \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2 }{ \frac{(s_1^2 / n_1)^2 }{n_1 - 1} + \frac{(s_2^2 / n_2)^2 }{n_2 - 1} }
</math>

Trust me bro. Remember to round down to use t-table. With this degree of freedom, we can use sample variance to estimate the distribution.

[[Category:Sample Statistics]]

Sampling Distribution

2024-03-19T16:59:02Z

Admin: /* Binomial Normal Approximation */

Let there be <math>Y_1, Y_2, \ldots, Y_n </math>, where each
<math>Y_i</math> is a randomv variable from the population.

Every Y have the same mean and distribution that we don't know.

<math>
E(Y_i) = \mu, Var(Y_i) = \sigma^2
</math>

We then have the sample mean

<math>
\bar{Y} = \frac{1}{n} \sum_{i = 1}^n Y_i
</math>

The sample mean is expected to be <math>\mu</math> through a pretty easy
direct proof

The variance of the sample mean is <math>\frac{\sigma^2}{n}</math>, also
through a pretty easy direct proof.

= Central limit theorem =
The '''central limit theorem''' states that the distribution of the sample mean follows normal distribution.

<math>\bar{Y} \sim N(\mu, \frac{\sigma^2}{n})</math>

As long as the following two conditions are satisfied, CLT applies, regardless of the population's distribution.

# The population distribution of <math>Y</math> is normal, ''or''
# The sample size for each <math>Y_i</math> is large <math>n>30</math>

By extension, we also have the distribution of the sum.

<math>S \sim N(\mu_S = n\mu, \sigma_S = \sqrt{n \sigma})</math>

where <math>S = \sum Y_i </math>

== Proportion Approximation ==
The sampling distribution of the proportion of success of ''n'' bernoulli random variables (<math>\hat{p}</math>) can also be approximated to a normal distribution under the CLT.

Consider [[Discrete Random Variable#Bernoulli|bernoulli]] random variables

<math>Y_1 \ldots Y_n \sim E(Y_i) = p, Var(Y_i) = p(1 - p)</math>

The proportion of success <math>\hat{p}</math> is the sum over the count, so the expected probability is

<math>E(\hat{p}) = E \left(\frac{1}{n} \sum Y_i \right) = p</math>

and the variance of <math>\hat{p}</math> is

<math>Var \left(\frac{1}{n} \sum Y_i \right) = \frac{1}{n^2} Var \left(\sum Y_i \right) = \frac{p (1 - p)}{n}</math>

With a large sample size ''n'', we can appoximate this to a normal distribution. Notably, the criteria for a large ''n'' is different from that of the continuous random variable.

<math>np > 5, n(1 - p) > 5</math>

then we have

<math>\hat{p} \sim N \left( p, \frac{p(1-p)}{n} \right)</math>

The reasoning behind the weird criteria relates to the [[Discrete Random Variable|binomial distribution]]. It's not very elaborated on in the lecture, but <math>np</math> is the mean of the binomial (i.e. the expected number of successes). The criteria essentially makes sure that no negative values are plausible in the approximation; with a small mean and a large variance, the left side of a normal approximation goes into the negative, but bernoulli/binomial must always be positive.

== Binomial Approximation ==
I'm short on time. This is based on the above section and a bit of math.

<math>Y \sim N(np, np(1 - p))</math>

= Confidence Interval =
'''Estimation''' is the guess for the unknown parameter. A '''point estimate''' is a "best guess" of the population parameter, where as the '''confidence interval''' is the range of reasonable values that are intended to contain the '''parameter of interest''' with a certain '''degree of confidence''', calculated with

''(point estimate - margin of error, point estimate + margin of error)''

==== Constructing CIs ====
By CLT, <math>\bar{Y} \sim N(\mu, \frac{\sigma^2}{n} )</math>. The
confidence interval is the range of plausible <math>\bar{Y}</math>.

If we define the middle 90% to be plausible, to find the
confidence interval, simply find the 5th and 95th percentile.

Generalized, if we want a confidence interval of the middle <math>(1 - \alpha) 100%</math>, have a confidence interval of

<math>
\bar{y} \pm Z_{\alpha / 2} \frac{\sigma}{ \sqrt{n} }
</math>

where <math>\bar{y}</math> is the sample mean and <math>Z_{x}</math> is the z score of the x-th percentile.

= T-Distribution =
[[File:T distribution table.png|thumb|T distribution table]]

CLT has several restrictions, the biggest one being a large sample size.
'''T-'''

Since we don't know the population variance <math>\sigma^2</math>, we
have to use the sample variance <math>s</math> to estimate it. This
introduces more uncertainty, accounted for by the '''t-distribution.'''

T-distribution is the distribution of sample mean based on population
mean, sample variance and ''degrees of freedom'' (covered later). It
looks very similar to normal distribution.

When the sample size <math>n</math> is small, there is greater
uncertainty in the estimates. T-di

<math>
t_{\alpha/2} > Z_{\alpha/2}
</math>

The spread of t-distribution depends on the '''degrees of freedom''',
which is based on sample size. When looking up the table, round down df.

<math>
\upsilon = n - 1
</math>

As the sample size increases, degrees of freedom increase, the spread of
t-distribution decreases, and t-distribution approaches normal
distribution.

Based on CLT and normal distribution, we had the confidence interval

<math>
\bar{Y} \pm Z_{\alpha / 2} \frac{\sigma}{ \sqrt{n} }
</math>

Now, based on T-distribution, we have the CI

<math>
\bar{Y} \pm t_{\alpha / 2} \frac{s}{ \sqrt{n} }
</math>

==== Find Sample Size ====
To calculate sample size needed depending on desired
error margin and sample variance by assuming that <math>\upsilon = \infty</math>

<math>
n = \frac{Z^2_{\alpha/2} s^2}{E^2}
</math>

We want to always ''round up'' to stay within the error margin.

I don't really know why.

= Sampling Distribution of Difference =

By linear combination of RVs, sampling distribution of <math>\bar{Y_1} - \bar{Y_2}</math> is

<math>
( \bar{Y_1} - \bar{Y_2} ) \sim N(\mu_1 - \mu_2, \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2})
</math>

However, we do not know the population variance <math>\sigma^2</math>. If the CLT assumptions hold, then we have

<math>
\upsilon = \frac{ \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2 }{ \frac{(s_1^2 / n_1)^2 }{n_1 - 1} + \frac{(s_2^2 / n_2)^2 }{n_2 - 1} }
</math>

Trust me bro. Remember to round down to use t-table. With this degree of freedom, we can use sample variance to estimate the distribution.

[[Category:Sample Statistics]]

Sampling Distribution

2024-03-19T16:41:52Z

Admin: /* Binomial Normal Approximation */

Let there be <math>Y_1, Y_2, \ldots, Y_n </math>, where each
<math>Y_i</math> is a randomv variable from the population.

Every Y have the same mean and distribution that we don't know.

<math>
E(Y_i) = \mu, Var(Y_i) = \sigma^2
</math>

We then have the sample mean

<math>
\bar{Y} = \frac{1}{n} \sum_{i = 1}^n Y_i
</math>

The sample mean is expected to be <math>\mu</math> through a pretty easy
direct proof

The variance of the sample mean is <math>\frac{\sigma^2}{n}</math>, also
through a pretty easy direct proof.

= Central limit theorem =
The '''central limit theorem''' states that the distribution of the sample mean follows normal distribution.

<math>\bar{Y} \sim N(\mu, \frac{\sigma^2}{n})</math>

As long as the following two conditions are satisfied, CLT applies, regardless of the population's distribution.

# The population distribution of <math>Y</math> is normal, ''or''
# The sample size for each <math>Y_i</math> is large <math>n>30</math>

By extension, we also have the distribution of the sum.

<math>S \sim N(\mu_S = n\mu, \sigma_S = \sqrt{n \sigma})</math>

where <math>S = \sum Y_i </math>

= Confidence Interval =
'''Estimation''' is the guess for the unknown parameter. A '''point estimate''' is a "best guess" of the population parameter, where as the '''confidence interval''' is the range of reasonable values that are intended to contain the '''parameter of interest''' with a certain '''degree of confidence''', calculated with

''(point estimate - margin of error, point estimate + margin of error)''

==== Constructing CIs ====
By CLT, <math>\bar{Y} \sim N(\mu, \frac{\sigma^2}{n} )</math>. The
confidence interval is the range of plausible <math>\bar{Y}</math>.

If we define the middle 90% to be plausible, to find the
confidence interval, simply find the 5th and 95th percentile.

Generalized, if we want a confidence interval of the middle <math>(1 - \alpha) 100%</math>, have a confidence interval of

<math>
\bar{y} \pm Z_{\alpha / 2} \frac{\sigma}{ \sqrt{n} }
</math>

where <math>\bar{y}</math> is the sample mean and <math>Z_{x}</math> is the z score of the x-th percentile.

= Binomial Normal Approximation =
The sampling distribution of the proportion of success of ''n'' bernoulli random variables (<math>\hat{p}</math>) can also be approximated to a normal distribution under the CLT.

Consider [[Discrete Random Variable#Bernoulli|bernoulli]] random variables

<math>Y_1 \ldots Y_n \sim E(Y_i) = p, Var(Y_i) = p(1 - p)</math>

The proportion of success <math>\hat{p}</math> is the sum over the count, so the expected probability is

<math>E(\hat{p}) = E \left(\frac{1}{n} \sum Y_i \right) = p</math>

and the variance of <math>\hat{p}</math> is

<math>Var \left(\frac{1}{n} \sum Y_i \right) = \frac{1}{n^2} Var \left(\sum Y_i \right) = \frac{p (1 - p)}{n}</math>

With a large sample size ''n'', we can appoximate this to a normal distribution. Notably, the criteria for a large ''n'' is different from that of the continuous random variable.

<math>np > 5, n(1 - p) > 5</math>

then we have

<math>\hat{p} \sim N \left( p, \frac{p(1-p)}{n} \right)</math>

The reasoning behind the weird criteria relates to the [[Discrete Random Variable|binomial distribution]]. It's not very elaborated on in the lecture, but <math>np</math> is the mean of the binomial (i.e. the expected number of successes). The criteria essentially makes sure that no negative values are plausible in the approximation; with a small mean and a large variance, the left side of a normal approximation goes into the negative, but bernoulli/binomial should always be positive by common sense.

= T-Distribution =
[[File:T distribution table.png|thumb|T distribution table]]

CLT has several restrictions, the biggest one being a large sample size.
'''T-'''

Since we don't know the population variance <math>\sigma^2</math>, we
have to use the sample variance <math>s</math> to estimate it. This
introduces more uncertainty, accounted for by the '''t-distribution.'''

T-distribution is the distribution of sample mean based on population
mean, sample variance and ''degrees of freedom'' (covered later). It
looks very similar to normal distribution.

When the sample size <math>n</math> is small, there is greater
uncertainty in the estimates. T-di

<math>
t_{\alpha/2} > Z_{\alpha/2}
</math>

The spread of t-distribution depends on the '''degrees of freedom''',
which is based on sample size. When looking up the table, round down df.

<math>
\upsilon = n - 1
</math>

As the sample size increases, degrees of freedom increase, the spread of
t-distribution decreases, and t-distribution approaches normal
distribution.

Based on CLT and normal distribution, we had the confidence interval

<math>
\bar{Y} \pm Z_{\alpha / 2} \frac{\sigma}{ \sqrt{n} }
</math>

Now, based on T-distribution, we have the CI

<math>
\bar{Y} \pm t_{\alpha / 2} \frac{s}{ \sqrt{n} }
</math>

==== Find Sample Size ====
To calculate sample size needed depending on desired
error margin and sample variance by assuming that <math>\upsilon = \infty</math>

<math>
n = \frac{Z^2_{\alpha/2} s^2}{E^2}
</math>

We want to always ''round up'' to stay within the error margin.

I don't really know why.

= Sampling Distribution of Difference =

By linear combination of RVs, sampling distribution of <math>\bar{Y_1} - \bar{Y_2}</math> is

<math>
( \bar{Y_1} - \bar{Y_2} ) \sim N(\mu_1 - \mu_2, \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2})
</math>

However, we do not know the population variance <math>\sigma^2</math>. If the CLT assumptions hold, then we have

<math>
\upsilon = \frac{ \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2 }{ \frac{(s_1^2 / n_1)^2 }{n_1 - 1} + \frac{(s_2^2 / n_2)^2 }{n_2 - 1} }
</math>

Trust me bro. Remember to round down to use t-table. With this degree of freedom, we can use sample variance to estimate the distribution.

[[Category:Sample Statistics]]

Discrete Random Variable

2024-03-19T16:34:58Z

Admin: /* Bernoulli */

[[Category:Statistics]]
[[Category:Distribution (Statistics)]]
A random variable is '''discrete''' if the values it can take on within an interval is ''finite''.

= PMF and CDF =
The '''probability mass function (PMF)''' describes the probability distribution over a discrete random variable.

<math>p(x) = P(X = x)</math>

The '''cumulative distribution function (CDF)''' specifies the probability of an observation being equal to or less than a given value.

<math>F(x) = P(X \leq x)</math>

We usually have tables for these in the case of discrete random variables.

= Statistics =
Expected value (mean):

<math>
\mu = E(X) = \sum x_i P(X = x_i)
</math>

= Distributions =

== Bernoulli ==
The '''bernoulli distribution''' describes the random variable of an experiment that has two outcomes and is performed once. The outcomes are either ''success'' or ''failure''.

<math>
X \sim Bernoulli(p)
</math>

=== PMF ===

<math>
p(1) = p, p(0) = 1 - p
</math>

=== Statistics ===

<math>
\mu = p
</math>

<math>
\sigma^2_X = p (1 - p)
</math>

== Binomial ==

Repeating a bernoulli experiment <math>n</math> times and we get a '''binomial random variable'''.

Consider an experiment with exactly two possible outcomes, conducted n times independently. The variable of interest <math>X</math> is the number of successful trials. The distribution relies on the number of trials and the probability of success.

<math>
X \sim Binomial(n, p)
</math>

=== PMF ===

<math>
(^n_x) p^x (1 - p)^{n - x}
</math>

=== Statistics ===

<math>
\mu = np
</math>

<math>
\sigma^2 = np (1 - p)
</math>

= Poisson =
The '''poisson distribution''' is used when we know the ''average rate of occurrence for a particular event over a particular time period.''

* There must be '''fixed interval''' of the time or space
* Events happen with a '''known average rate''' independent of time or the last event.
* The average rate of occurrence per unit of time/sace is the '''rate parameter''' <math>\lambda</math>

Poisson distribution approximates binomial distribution when ''n'' is large and ''p'' is small, used to model rare events. Normally it is used to measure the number of events in a unit time, whereas [[Continuous Random Variable#Exponential Distribution|exponential distribution]] models the amount of waiting time until an event.

I'm sleepy I'll write the details later... zzz...

Sampling Distribution

2024-03-19T16:26:15Z

Admin: /* Confidence Interval */

Let there be <math>Y_1, Y_2, \ldots, Y_n </math>, where each
<math>Y_i</math> is a randomv variable from the population.

Every Y have the same mean and distribution that we don't know.

<math>
E(Y_i) = \mu, Var(Y_i) = \sigma^2
</math>

We then have the sample mean

<math>
\bar{Y} = \frac{1}{n} \sum_{i = 1}^n Y_i
</math>

The sample mean is expected to be <math>\mu</math> through a pretty easy
direct proof

The variance of the sample mean is <math>\frac{\sigma^2}{n}</math>, also
through a pretty easy direct proof.

= Central limit theorem =
The '''central limit theorem''' states that the distribution of the sample mean follows normal distribution.

<math>\bar{Y} \sim N(\mu, \frac{\sigma^2}{n})</math>

As long as the following two conditions are satisfied, CLT applies, regardless of the population's distribution.

# The population distribution of <math>Y</math> is normal, ''or''
# The sample size for each <math>Y_i</math> is large <math>n>30</math>

By extension, we also have the distribution of the sum.

<math>S \sim N(\mu_S = n\mu, \sigma_S = \sqrt{n \sigma})</math>

where <math>S = \sum Y_i </math>

= Confidence Interval =
'''Estimation''' is the guess for the unknown parameter. A '''point estimate''' is a "best guess" of the population parameter, where as the '''confidence interval''' is the range of reasonable values that are intended to contain the '''parameter of interest''' with a certain '''degree of confidence''', calculated with

''(point estimate - margin of error, point estimate + margin of error)''

==== Constructing CIs ====
By CLT, <math>\bar{Y} \sim N(\mu, \frac{\sigma^2}{n} )</math>. The
confidence interval is the range of plausible <math>\bar{Y}</math>.

If we define the middle 90% to be plausible, to find the
confidence interval, simply find the 5th and 95th percentile.

Generalized, if we want a confidence interval of the middle <math>(1 - \alpha) 100%</math>, have a confidence interval of

<math>
\bar{y} \pm Z_{\alpha / 2} \frac{\sigma}{ \sqrt{n} }
</math>

where <math>\bar{y}</math> is the sample mean and <math>Z_{x}</math> is the z score of the x-th percentile.

= Binomial Normal Approximation =
The sampling distribution of the mean of ''n'' bernoulli random variables (<math>\hat{p}</math>) can also be approximated to a normal distribution under the CLT.

Consider [[Discrete Random Variable#Bernoulli|bernoulli]] random variables

<math>Y_1 \ldots Y_n \sim E(Y_i) = p, Var(Y_i) = p(1 - p)</math>

The probability of success <math>\hat{p}</math> is the sum over the count, so the expected probability is

<math>E(\hat{p}) = E \left(\frac{1}{n} \sum Y_i \right) = p</math>

and the variance of <math>\hat{p}</math> is

<math>Var \left(\frac{1}{n} \sum Y_i \right) = \frac{1}{n^2} Var \left(\sum Y_i \right) = \frac{p (1 - p)}{n}</math>

With a large sample size ''n'', we can appoximate this to a normal distribution. Notably, the criteria for a large ''n'' is different from that of the continuous random variable.

= T-Distribution =
[[File:T distribution table.png|thumb|T distribution table]]

CLT has several restrictions, the biggest one being a large sample size.
'''T-'''

Since we don't know the population variance <math>\sigma^2</math>, we
have to use the sample variance <math>s</math> to estimate it. This
introduces more uncertainty, accounted for by the '''t-distribution.'''

T-distribution is the distribution of sample mean based on population
mean, sample variance and ''degrees of freedom'' (covered later). It
looks very similar to normal distribution.

When the sample size <math>n</math> is small, there is greater
uncertainty in the estimates. T-di

<math>
t_{\alpha/2} > Z_{\alpha/2}
</math>

The spread of t-distribution depends on the '''degrees of freedom''',
which is based on sample size. When looking up the table, round down df.

<math>
\upsilon = n - 1
</math>

As the sample size increases, degrees of freedom increase, the spread of
t-distribution decreases, and t-distribution approaches normal
distribution.

Based on CLT and normal distribution, we had the confidence interval

<math>
\bar{Y} \pm Z_{\alpha / 2} \frac{\sigma}{ \sqrt{n} }
</math>

Now, based on T-distribution, we have the CI

<math>
\bar{Y} \pm t_{\alpha / 2} \frac{s}{ \sqrt{n} }
</math>

==== Find Sample Size ====
To calculate sample size needed depending on desired
error margin and sample variance by assuming that <math>\upsilon = \infty</math>

<math>
n = \frac{Z^2_{\alpha/2} s^2}{E^2}
</math>

We want to always ''round up'' to stay within the error margin.

I don't really know why.

= Sampling Distribution of Difference =

By linear combination of RVs, sampling distribution of <math>\bar{Y_1} - \bar{Y_2}</math> is

<math>
( \bar{Y_1} - \bar{Y_2} ) \sim N(\mu_1 - \mu_2, \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2})
</math>

However, we do not know the population variance <math>\sigma^2</math>. If the CLT assumptions hold, then we have

<math>
\upsilon = \frac{ \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2 }{ \frac{(s_1^2 / n_1)^2 }{n_1 - 1} + \frac{(s_2^2 / n_2)^2 }{n_2 - 1} }
</math>

Trust me bro. Remember to round down to use t-table. With this degree of freedom, we can use sample variance to estimate the distribution.

[[Category:Sample Statistics]]

Discrete Random Variable

2024-03-19T16:14:18Z

Admin: /* Binomial */

[[Category:Statistics]]
[[Category:Distribution (Statistics)]]
A random variable is '''discrete''' if the values it can take on within an interval is ''finite''.

= PMF and CDF =
The '''probability mass function (PMF)''' describes the probability distribution over a discrete random variable.

<math>p(x) = P(X = x)</math>

The '''cumulative distribution function (CDF)''' specifies the probability of an observation being equal to or less than a given value.

<math>F(x) = P(X \leq x)</math>

We usually have tables for these in the case of discrete random variables.

= Statistics =
Expected value (mean):

<math>
\mu = E(X) = \sum x_i P(X = x_i)
</math>

= Distributions =

== Bernoulli ==
The '''bernoulli distribution''' describes the random variable of an experiment that has two outcomes and is performed once. The outcomes are either ''success'' or ''failure''.

<math>
X \sim Bernoulli(p)
</math>

=== PMF ===

<math>
p(1) = p, p(0) = 1 - p
</math>

=== Statistics ===

<math>
\mu = p
</math>

<math>
\sigma^2_X = p (1 - p)
</math>

== Binomial ==

Repeating a bernoulli experiment <math>b</math> times and we get a '''binomial random variable'''.

Consider an experiment with exactly two possible outcomes, conducted n times independently.

<math>
X \sim Binomial(n, p)
</math>

I'm sleep I'll write the details later. It should be on the equation sheet.

=== PMF ===

<math>
(^n_x) p^x (1 - p)^{n - x}
</math>

=== Statistics ===

<math>
\mu = np
</math>

<math>
\sigma^2 = np (1 - p)
</math>

= Poisson =
The '''poisson distribution''' is used when we know the ''average rate of occurrence for a particular event over a particular time period.''

* There must be '''fixed interval''' of the time or space
* Events happen with a '''known average rate''' independent of time or the last event.
* The average rate of occurrence per unit of time/sace is the '''rate parameter''' <math>\lambda</math>

Poisson distribution approximates binomial distribution when ''n'' is large and ''p'' is small, used to model rare events. Normally it is used to measure the number of events in a unit time, whereas [[Continuous Random Variable#Exponential Distribution|exponential distribution]] models the amount of waiting time until an event.

I'm sleepy I'll write the details later... zzz...

Sampling Distribution

2024-03-19T16:03:03Z

Admin: /* Central limit theorem */

Let there be <math>Y_1, Y_2, \ldots, Y_n </math>, where each
<math>Y_i</math> is a randomv variable from the population.

Every Y have the same mean and distribution that we don't know.

<math>
E(Y_i) = \mu, Var(Y_i) = \sigma^2
</math>

We then have the sample mean

<math>
\bar{Y} = \frac{1}{n} \sum_{i = 1}^n Y_i
</math>

The sample mean is expected to be <math>\mu</math> through a pretty easy
direct proof

The variance of the sample mean is <math>\frac{\sigma^2}{n}</math>, also
through a pretty easy direct proof.

= Central limit theorem =
The '''central limit theorem''' states that the distribution of the sample mean follows normal distribution.

<math>\bar{Y} \sim N(\mu, \frac{\sigma^2}{n})</math>

As long as the following two conditions are satisfied, CLT applies, regardless of the population's distribution.

# The population distribution of <math>Y</math> is normal, ''or''
# The sample size for each <math>Y_i</math> is large <math>n>30</math>

By extension, we also have the distribution of the sum.

<math>S \sim N(\mu_S = n\mu, \sigma_S = \sqrt{n \sigma})</math>

where <math>S = \sum Y_i </math>

= Confidence Interval =
'''Estimation''' is the guess for the unknown parameter. A '''point estimate''' is a "best guess" of the population parameter, where as the '''confidence interval''' is the range of reasonable values that are intended to contain the '''parameter of interest''' with a certain '''degree of confidence''', calculated with

''(point estimate - margin of error, point estimate + margin of error)''

==== Constructing CIs ====
By CLT, <math>\bar{Y} \sim N(\mu, \frac{\sigma^2}{n} )</math>. The
confidence interval is the range of plausible <math>\bar{Y}</math>.

If we define the middle 90% to be plausible, to find the
confidence interval, simply find the 5th and 95th percentile.

Generalized, if we want a confidence interval of the middle <math>(1 - \alpha) 100%</math>, have a confidence interval of

<math>
\bar{y} \pm Z_{\alpha / 2} \frac{\sigma}{ \sqrt{n} }
</math>

where <math>\bar{y}</math> is the sample mean and <math>Z_{x}</math> is the z score of the x-th percentile.

= T-Distribution =
[[File:T distribution table.png|thumb|T distribution table]]

CLT has several restrictions, the biggest one being a large sample size.
'''T-'''

Since we don't know the population variance <math>\sigma^2</math>, we
have to use the sample variance <math>s</math> to estimate it. This
introduces more uncertainty, accounted for by the '''t-distribution.'''

T-distribution is the distribution of sample mean based on population
mean, sample variance and ''degrees of freedom'' (covered later). It
looks very similar to normal distribution.

When the sample size <math>n</math> is small, there is greater
uncertainty in the estimates. T-di

<math>
t_{\alpha/2} > Z_{\alpha/2}
</math>

The spread of t-distribution depends on the '''degrees of freedom''',
which is based on sample size. When looking up the table, round down df.

<math>
\upsilon = n - 1
</math>

As the sample size increases, degrees of freedom increase, the spread of
t-distribution decreases, and t-distribution approaches normal
distribution.

Based on CLT and normal distribution, we had the confidence interval

<math>
\bar{Y} \pm Z_{\alpha / 2} \frac{\sigma}{ \sqrt{n} }
</math>

Now, based on T-distribution, we have the CI

<math>
\bar{Y} \pm t_{\alpha / 2} \frac{s}{ \sqrt{n} }
</math>

==== Find Sample Size ====
To calculate sample size needed depending on desired
error margin and sample variance by assuming that <math>\upsilon = \infty</math>

<math>
n = \frac{Z^2_{\alpha/2} s^2}{E^2}
</math>

We want to always ''round up'' to stay within the error margin.

I don't really know why.

= Sampling Distribution of Difference =

By linear combination of RVs, sampling distribution of <math>\bar{Y_1} - \bar{Y_2}</math> is

<math>
( \bar{Y_1} - \bar{Y_2} ) \sim N(\mu_1 - \mu_2, \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2})
</math>

However, we do not know the population variance <math>\sigma^2</math>. If the CLT assumptions hold, then we have

<math>
\upsilon = \frac{ \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2 }{ \frac{(s_1^2 / n_1)^2 }{n_1 - 1} + \frac{(s_2^2 / n_2)^2 }{n_2 - 1} }
</math>

Trust me bro. Remember to round down to use t-table. With this degree of freedom, we can use sample variance to estimate the distribution.

[[Category:Sample Statistics]]

Continuous Random Variable

2024-03-19T07:50:09Z

Admin: /* Normal Random Variable */

[[Category:Distribution (Statistics)]]
Continuous random variables have an inifinite number of values for any
given interval. While similar, the approach to analysis is very
different from discrete variables
* Summation becomes integration
* Probability becomes area under a curve

= Probability Density Function <math> f(x) </math> =

The probability density function (pdf) maps a continuous variable to a
probability density.

As the name "density" suggests, the area under the pdf curve between a
range is the probability of the variable being in that range.

<math>
P(c \leq x \leq d) = \int_c^d f(x) dx = F(d) - F(c)
</math>

Total area under the curve must be <math> 1 </math>, as chances of
events happening is 100% if the range includes all possible events.

<math>
\int_{-\infty}^\infty f(x) dx = 1
</math>

There is no area under a single point

<math>
P(X = a) = 0
</math>

= Mean and Variance =

The mean and variance calculations are pretty much the same as that of
[[Discrete Random Variable|discrete random variables]], except the summations are swapped out for
integrals.

<math>
E(X) = \mu_X = \int_{-\infty}^\infty x f(x) dx
</math>

<math>
Var(X) = \sigma^2_X = \int_{-\infty}^\infty (x - \mu_X)^2 f(x) dx
= \int_{-\infty}^\infty x^2 f(x) dx - \mu_X^2
</math>

= Median and Percentile =

The a-th percentileis the point at which a percent the area under the
curve is to one side. You want <math> P(X \leq x) </math> to be a%, the
calculation of which is in the page above.

By the same logic, the quartiles are at 25%, 50%, and 75% accordingly.

= Uniform Distribution <math> X \sim Uniform(a, b) </math> =
Uniform random variable is described by two parameters: <math> a </math>
is minimum, and <math> b </math> is maximum. It has a rectangular
distribution, where every point has the same probability density.

==== PDF ====

<math>
f(x) = \begin{cases}
\frac{ 1 }{ b - a } & a \leq x \leq b \\
0 & \text{otherwise}
\end{cases}
</math>

==== CDF ====

<math>
F(x) = \begin{cases}
0 & x < a \\
\frac{ x - a }{ b - a } & a \leq x \leq b \\
1 & x > b
\end{cases}
</math>

==== Mean ====

<math>
\mu_X = \frac{ a + b }{ 2 }
</math>

==== Variance ====

<math>
\sigma^2 = \frac{ 1 }{ 12 } (b - a)^2
</math>

= Exponential Distribution =

The exponential distribution models events that occurs
* Continuously
* Independently
* At a constant average rate

It takes in one parameter: <math>\lambda</math>, the '''rate parameter.''' Defined by the mean below, it is the ''average rate per unit time/space.''

Exponential distribution has the '''memoryless property''': the
probability to an event does not change no matter how much time has
passed.

In probability terms, the probability that we must wait an
additional <math>t</math> units given that we have waited <math>s</math>
units

<math>
P(T > t + s | T > s) = P(T > t) = e^{-\lambda t}
</math>

Notably, it models time until some event has happened, in contrast to [[Discrete Random Variable#Poisson|poisson distribution]], which measures the number of events in a unit time.

==== PDF ====

<math>
f(x) = \begin{cases}
\lambda e ^{ - \lambda x } & a \leq x \leq b \\
0 & \text{otherwise}
\end{cases}
</math>

==== CDF ====

<math>
F(x) = 1 - e^{- \lambda x}
</math>

==== Mean ====

Integration by parts

<math>
\mu_X = \frac{1}{\lambda}
</math>

==== Variance ====

Integration by parts

<math>
\sigma^2 = \frac{ 1 }{ \lambda^2 }
</math>

== Exponential and Poisson ==

Exponential distribution and poisson RVs are related:
* <math>X \sim Poisson(\lambda)</math>: the number of events in a unit time
* <math>X \sim Exp(\lambda)</math>: waiting time until an event

= Normal Random Variable =
[[File:Z score table.png|thumb|Z score table]]
'''Normal random variables''' (aka. Gaussian RV) are the most widly used continuous RV in
statistics, characterizing many natural phenomenons. It is the famous
bell curve.

They are characterized by two parameters: mean and variance.

<math>
Y \sim N(\mu_Y, \sigma^2_Y)
</math>

Normal random variables are perfectly symmetric at the mean.

==== Standardizing Normal Distribution ====

Standardization of a data means to make its mean 0 and its standard
deviation 1. We do this by subtracting the mean and dividing by the
standard deviation:

<math>
Z = \frac{Y - \mu}{\sigma}
</math>

Intuitively, this moves the dataset and changes the scale. We do this to
simplify probability calculations.

==== Z score ====

The z-score is the number of standard deviations above or below the
mean. A positive z score is above, and a negative is below.

<math>
z = \frac{y - \mu}{\sigma}
</math>

==== PDF ====

The pdf for normal random variable is the following.

<math>
f(y) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2} \frac{(y -
\mu)^2}{\sigma^2}}
</math>

After standardizing the normal RV, we can use the following instead.

<math>
f(y) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2} z^2}
</math>

where <math>z</math> is the z-score covered in the last section.

<math>
f(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} z^2}
</math>

==== Quantiles ====

Quantiles are points dividing the range of a probability
distribution. Quartiles and precentiles are types of quantiles.

For normal distributions, there are special points (critical values)
that correspond to particular probabilities: <math>z_a</math>, where
<math>a</math> is the probability in the right tail.

==== Standard Normal Table ====

The standard normal table calculate lower tail values based on the
standard normal distribution (i.e. area under the curve left of the
point).

==== Linear Combinations of Independent Normal RV ====

<math>
W = aX + bY
</math>

<math>
W \sim N(a\mu)X + b\mu_y, a^2 \sigma^2_X + b^2 \sigma^2_y)
</math>

= Other distributions =

[[Two Numerical RVs]]

[[Category:Statistics]]

Continuous Random Variable

2024-03-19T07:49:17Z

Admin: /* Exponential Distribution */

[[Category:Distribution (Statistics)]]
Continuous random variables have an inifinite number of values for any
given interval. While similar, the approach to analysis is very
different from discrete variables
* Summation becomes integration
* Probability becomes area under a curve

= Probability Density Function <math> f(x) </math> =

The probability density function (pdf) maps a continuous variable to a
probability density.

As the name "density" suggests, the area under the pdf curve between a
range is the probability of the variable being in that range.

<math>
P(c \leq x \leq d) = \int_c^d f(x) dx = F(d) - F(c)
</math>

Total area under the curve must be <math> 1 </math>, as chances of
events happening is 100% if the range includes all possible events.

<math>
\int_{-\infty}^\infty f(x) dx = 1
</math>

There is no area under a single point

<math>
P(X = a) = 0
</math>

= Mean and Variance =

The mean and variance calculations are pretty much the same as that of
[[Discrete Random Variable|discrete random variables]], except the summations are swapped out for
integrals.

<math>
E(X) = \mu_X = \int_{-\infty}^\infty x f(x) dx
</math>

<math>
Var(X) = \sigma^2_X = \int_{-\infty}^\infty (x - \mu_X)^2 f(x) dx
= \int_{-\infty}^\infty x^2 f(x) dx - \mu_X^2
</math>

= Median and Percentile =

The a-th percentileis the point at which a percent the area under the
curve is to one side. You want <math> P(X \leq x) </math> to be a%, the
calculation of which is in the page above.

By the same logic, the quartiles are at 25%, 50%, and 75% accordingly.

= Uniform Distribution <math> X \sim Uniform(a, b) </math> =
Uniform random variable is described by two parameters: <math> a </math>
is minimum, and <math> b </math> is maximum. It has a rectangular
distribution, where every point has the same probability density.

==== PDF ====

<math>
f(x) = \begin{cases}
\frac{ 1 }{ b - a } & a \leq x \leq b \\
0 & \text{otherwise}
\end{cases}
</math>

==== CDF ====

<math>
F(x) = \begin{cases}
0 & x < a \\
\frac{ x - a }{ b - a } & a \leq x \leq b \\
1 & x > b
\end{cases}
</math>

==== Mean ====

<math>
\mu_X = \frac{ a + b }{ 2 }
</math>

==== Variance ====

<math>
\sigma^2 = \frac{ 1 }{ 12 } (b - a)^2
</math>

= Exponential Distribution =

The exponential distribution models events that occurs
* Continuously
* Independently
* At a constant average rate

It takes in one parameter: <math>\lambda</math>, the '''rate parameter.''' Defined by the mean below, it is the ''average rate per unit time/space.''

Exponential distribution has the '''memoryless property''': the
probability to an event does not change no matter how much time has
passed.

In probability terms, the probability that we must wait an
additional <math>t</math> units given that we have waited <math>s</math>
units

<math>
P(T > t + s | T > s) = P(T > t) = e^{-\lambda t}
</math>

Notably, it models time until some event has happened, in contrast to [[Discrete Random Variable#Poisson|poisson distribution]], which measures the number of events in a unit time.

==== PDF ====

<math>
f(x) = \begin{cases}
\lambda e ^{ - \lambda x } & a \leq x \leq b \\
0 & \text{otherwise}
\end{cases}
</math>

==== CDF ====

<math>
F(x) = 1 - e^{- \lambda x}
</math>

==== Mean ====

Integration by parts

<math>
\mu_X = \frac{1}{\lambda}
</math>

==== Variance ====

Integration by parts

<math>
\sigma^2 = \frac{ 1 }{ \lambda^2 }
</math>

== Exponential and Poisson ==

Exponential distribution and poisson RVs are related:
* <math>X \sim Poisson(\lambda)</math>: the number of events in a unit time
* <math>X \sim Exp(\lambda)</math>: waiting time until an event

= Normal Random Variable =
[[File:Z score table.png|thumb|Z score table]]
Normal random variables are the most widly used continuous RV in
statistics, characterizing many natural phenomenons. It is the famous
bell curve.

They are characterized by two parameters: mean and variance.

<math>
Y \sim N(\mu_Y, \sigma^2_Y)
</math>

Normal random variables are perfectly symmetric at the mean.

==== Standardizing Normal Distribution ====

Standardization of a data means to make its mean 0 and its standard
deviation 1. We do this by subtracting the mean and dividing by the
standard deviation:

<math>
Z = \frac{Y - \mu}{\sigma}
</math>

Intuitively, this moves the dataset and changes the scale. We do this to
simplify probability calculations.

==== Z score ====

The z-score is the number of standard deviations above or below the
mean. A positive z score is above, and a negative is below.

<math>
z = \frac{y - \mu}{\sigma}
</math>

==== PDF ====

The pdf for normal random variable is the following.

<math>
f(y) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2} \frac{(y -
\mu)^2}{\sigma^2}}
</math>

After standardizing the normal RV, we can use the following instead.

<math>
f(y) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2} z^2}
</math>

where <math>z</math> is the z-score covered in the last section.

<math>
f(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} z^2}
</math>

==== Quantiles ====

Quantiles are points dividing the range of a probability
distribution. Quartiles and precentiles are types of quantiles.

For normal distributions, there are special points (critical values)
that correspond to particular probabilities: <math>z_a</math>, where
<math>a</math> is the probability in the right tail.

==== Standard Normal Table ====

The standard normal table calculate lower tail values based on the
standard normal distribution (i.e. area under the curve left of the
point).

==== Linear Combinations of Independent Normal RV ====

<math>
W = aX + bY
</math>

<math>
W \sim N(a\mu)X + b\mu_y, a^2 \sigma^2_X + b^2 \sigma^2_y)
</math>

= Other distributions =

[[Two Numerical RVs]]

[[Category:Statistics]]

Discrete Random Variable

2024-03-19T07:46:34Z

Admin: /* Poisson */

[[Category:Statistics]]
[[Category:Distribution (Statistics)]]
A random variable is '''discrete''' if the values it can take on within an interval is ''finite''.

= PMF and CDF =
The '''probability mass function (PMF)''' describes the probability distribution over a discrete random variable.

<math>p(x) = P(X = x)</math>

The '''cumulative distribution function (CDF)''' specifies the probability of an observation being equal to or less than a given value.

<math>F(x) = P(X \leq x)</math>

We usually have tables for these in the case of discrete random variables.

= Statistics =
Expected value (mean):

<math>
\mu = E(X) = \sum x_i P(X = x_i)
</math>

= Distributions =

== Bernoulli ==
The '''bernoulli distribution''' describes the random variable of an experiment that has two outcomes and is performed once. The outcomes are either ''success'' or ''failure''.

<math>
X \sim Bernoulli(p)
</math>

=== PMF ===

<math>
p(1) = p, p(0) = 1 - p
</math>

=== Statistics ===

<math>
\mu = p
</math>

<math>
\sigma^2_X = p (1 - p)
</math>

== Binomial ==

Repeating a bernoulli experiment <math>b</math> times and we get a '''binomial random variable'''.

Consider an experiment with exactly two possible outcomes, conducted n times independently.

<math>
X \sim Binomial(n, p)
</math>

I'm sleep I'll write the details later. It should be on the equation sheet.

= Poisson =
The '''poisson distribution''' is used when we know the ''average rate of occurrence for a particular event over a particular time period.''

* There must be '''fixed interval''' of the time or space
* Events happen with a '''known average rate''' independent of time or the last event.
* The average rate of occurrence per unit of time/sace is the '''rate parameter''' <math>\lambda</math>

Poisson distribution approximates binomial distribution when ''n'' is large and ''p'' is small, used to model rare events. Normally it is used to measure the number of events in a unit time, whereas [[Continuous Random Variable#Exponential Distribution|exponential distribution]] models the amount of waiting time until an event.

I'm sleepy I'll write the details later... zzz...

Continuous Random Variable

2024-03-19T07:46:12Z

Admin: /* Probability Distribution Function f ( x ) {\displaystyle f(x)} */

[[Category:Distribution (Statistics)]]
Continuous random variables have an inifinite number of values for any
given interval. While similar, the approach to analysis is very
different from discrete variables
* Summation becomes integration
* Probability becomes area under a curve

= Probability Density Function <math> f(x) </math> =

The probability density function (pdf) maps a continuous variable to a
probability density.

As the name "density" suggests, the area under the pdf curve between a
range is the probability of the variable being in that range.

<math>
P(c \leq x \leq d) = \int_c^d f(x) dx = F(d) - F(c)
</math>

Total area under the curve must be <math> 1 </math>, as chances of
events happening is 100% if the range includes all possible events.

<math>
\int_{-\infty}^\infty f(x) dx = 1
</math>

There is no area under a single point

<math>
P(X = a) = 0
</math>

= Mean and Variance =

The mean and variance calculations are pretty much the same as that of
[[Discrete Random Variable|discrete random variables]], except the summations are swapped out for
integrals.

<math>
E(X) = \mu_X = \int_{-\infty}^\infty x f(x) dx
</math>

<math>
Var(X) = \sigma^2_X = \int_{-\infty}^\infty (x - \mu_X)^2 f(x) dx
= \int_{-\infty}^\infty x^2 f(x) dx - \mu_X^2
</math>

= Median and Percentile =

The a-th percentileis the point at which a percent the area under the
curve is to one side. You want <math> P(X \leq x) </math> to be a%, the
calculation of which is in the page above.

By the same logic, the quartiles are at 25%, 50%, and 75% accordingly.

= Uniform Distribution <math> X \sim Uniform(a, b) </math> =
Uniform random variable is described by two parameters: <math> a </math>
is minimum, and <math> b </math> is maximum. It has a rectangular
distribution, where every point has the same probability density.

==== PDF ====

<math>
f(x) = \begin{cases}
\frac{ 1 }{ b - a } & a \leq x \leq b \\
0 & \text{otherwise}
\end{cases}
</math>

==== CDF ====

<math>
F(x) = \begin{cases}
0 & x < a \\
\frac{ x - a }{ b - a } & a \leq x \leq b \\
1 & x > b
\end{cases}
</math>

==== Mean ====

<math>
\mu_X = \frac{ a + b }{ 2 }
</math>

==== Variance ====

<math>
\sigma^2 = \frac{ 1 }{ 12 } (b - a)^2
</math>

= Exponential Distribution =

The exponential distribution models events that occurs
* Continuously
* Independently
* At a constant average rate

It takes in one parameter: <math>\lambda</math>, the rate parameter. It
is defined by the mean below.

Exponential distribution has the '''memoryless property''': the
probability to an event does not change no matter how much time has
passed.

In probability terms, the probability that we must wait an
additional <math>t</math> units given that we have waited <math>s</math>
units

<math>
P(T > t + s | T > s) = P(T > t) = e^{-\lambda t}
</math>

Notably, it models time until some event has happened, in contrast to [[Discrete Random Variable#Poisson|poisson distribution]], which measures the number of events in a unit time.

==== PDF ====

<math>
f(x) = \begin{cases}
\lambda e ^{ - \lambda x } & a \leq x \leq b \\
0 & \text{otherwise}
\end{cases}
</math>

==== CDF ====

<math>
F(x) = 1 - e^{- \lambda x}
</math>

==== Mean ====

Integration by parts

<math>
\mu_X = \frac{1}{\lambda}
</math>

==== Variance ====

Integration by parts

<math>
\sigma^2 = \frac{ 1 }{ \lambda^2 }
</math>

== Exponential and Poisson ==

Exponential distribution and poisson RVs are related:
* <math>X \sim Poisson(\lambda)</math>: the number of events in a unit time
* <math>X \sim Exp(\lambda)</math>: waiting time until an event

= Normal Random Variable =
[[File:Z score table.png|thumb|Z score table]]
Normal random variables are the most widly used continuous RV in
statistics, characterizing many natural phenomenons. It is the famous
bell curve.

They are characterized by two parameters: mean and variance.

<math>
Y \sim N(\mu_Y, \sigma^2_Y)
</math>

Normal random variables are perfectly symmetric at the mean.

==== Standardizing Normal Distribution ====

Standardization of a data means to make its mean 0 and its standard
deviation 1. We do this by subtracting the mean and dividing by the
standard deviation:

<math>
Z = \frac{Y - \mu}{\sigma}
</math>

Intuitively, this moves the dataset and changes the scale. We do this to
simplify probability calculations.

==== Z score ====

The z-score is the number of standard deviations above or below the
mean. A positive z score is above, and a negative is below.

<math>
z = \frac{y - \mu}{\sigma}
</math>

==== PDF ====

The pdf for normal random variable is the following.

<math>
f(y) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2} \frac{(y -
\mu)^2}{\sigma^2}}
</math>

After standardizing the normal RV, we can use the following instead.

<math>
f(y) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2} z^2}
</math>

where <math>z</math> is the z-score covered in the last section.

<math>
f(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} z^2}
</math>

==== Quantiles ====

Quantiles are points dividing the range of a probability
distribution. Quartiles and precentiles are types of quantiles.

For normal distributions, there are special points (critical values)
that correspond to particular probabilities: <math>z_a</math>, where
<math>a</math> is the probability in the right tail.

==== Standard Normal Table ====

The standard normal table calculate lower tail values based on the
standard normal distribution (i.e. area under the curve left of the
point).

==== Linear Combinations of Independent Normal RV ====

<math>
W = aX + bY
</math>

<math>
W \sim N(a\mu)X + b\mu_y, a^2 \sigma^2_X + b^2 \sigma^2_y)
</math>

= Other distributions =

[[Two Numerical RVs]]

[[Category:Statistics]]

Discrete Random Variable

2024-03-19T07:37:05Z

Admin: /* Binomial */

[[Category:Statistics]]
[[Category:Distribution (Statistics)]]
A random variable is '''discrete''' if the values it can take on within an interval is ''finite''.

= PMF and CDF =
The '''probability mass function (PMF)''' describes the probability distribution over a discrete random variable.

<math>p(x) = P(X = x)</math>

The '''cumulative distribution function (CDF)''' specifies the probability of an observation being equal to or less than a given value.

<math>F(x) = P(X \leq x)</math>

We usually have tables for these in the case of discrete random variables.

= Statistics =
Expected value (mean):

<math>
\mu = E(X) = \sum x_i P(X = x_i)
</math>

= Distributions =

== Bernoulli ==
The '''bernoulli distribution''' describes the random variable of an experiment that has two outcomes and is performed once. The outcomes are either ''success'' or ''failure''.

<math>
X \sim Bernoulli(p)
</math>

=== PMF ===

<math>
p(1) = p, p(0) = 1 - p
</math>

=== Statistics ===

<math>
\mu = p
</math>

<math>
\sigma^2_X = p (1 - p)
</math>

== Binomial ==

Repeating a bernoulli experiment <math>b</math> times and we get a '''binomial random variable'''.

Consider an experiment with exactly two possible outcomes, conducted n times independently.

<math>
X \sim Binomial(n, p)
</math>

I'm sleep I'll write the details later. It should be on the equation sheet.

= Poisson =
The '''poisson distribution''' is used when we know the ''average rate of occurrence for a particular event over a particular time period.''

* There must be '''fixed interval''' of the time or space
* Events happen with a '''known average rate''' independent of time or the last event.
* The average rate of occurrence per unit of time/sace is the '''rate parameter''' <math>\lambda</math>

Poisson distribution approximates binomial distribution when ''n'' is large and ''p'' is small, used to model rare events.

I'm sleepy I'll write the details later... zzz...

Discrete Random Variable

2024-03-19T07:32:27Z

Admin:

Random Variable

2024-03-19T06:39:01Z

Admin: /* Linear Combinations */

A '''random''' variable is a numerical variable whose outcome is the result of a random process (i.e. we don't know what will happen for certain). Notably, the numerical interpretation of the outcome of an [[Probability#Experiment and Events|experiment]] is a random variable. They come in two types, deserving their own pages. See

= Statistics =
Random variables has several statistics that we care about. See pages [[Continuous Random Variable]] and [[Discrete Random Variable]] for how to calculate these values.

When we are interested in the average outcome of a random variable, we look at the '''expected value''' (mean): a weighted average of the possible outcomes.

When we are interested in the variability of a random variable, we look at the '''variance''' and the '''standard deviation''': the latter being the expected difference from mean, and the former being the square of the latter (not really used for interpretation, more for math).

= Properties of Statistics =
I'm just gonna start writing equations.

<math>
E(c) = c
</math>

<math>
E(aX) = aE(X)
</math>

<math>
E(aX + c) = aE(X) + c
</math>

<math>
Var(X) = E((X - \mu)^2) = E(X^2) - E(X)^2
</math>

Cool name for this: the ''law of the unconscious statistician'' for how obvious it is
<math>
E(g(X)) = \sum g(x_i) P(X = x_i)
</math>

<math>
Var(c) = 0
</math>

<math>
Var(aX) = a^2 Var(X)
</math>

<math>
Var(aX + c) = a^2 Var(X)
</math>

= Linear Combinations =
A '''linear combination''' of random variables <math>X</math> and <math>Y</math> is given as

<math>aX + bY</math>

where ''a'' and ''b'' are constants. They are ''not'' [[bivariate]]

The expectation is

<math>
E(aX + bY) = aE(X) + bE(Y)
</math>

whereas the variance is

<math>
Var(aX + bY) = a^2 Var(X) + ab Cov(X,Y) + b^2 Var(Y)
</math>

I'm sleepy so I'm not gonna derive this one.

Random Variable

2024-03-19T06:38:04Z

Admin: /* Linear Combinations */

A '''random''' variable is a numerical variable whose outcome is the result of a random process (i.e. we don't know what will happen for certain). Notably, the numerical interpretation of the outcome of an [[Probability#Experiment and Events|experiment]] is a random variable. They come in two types, deserving their own pages. See

= Statistics =
Random variables has several statistics that we care about. See pages [[Continuous Random Variable]] and [[Discrete Random Variable]] for how to calculate these values.

When we are interested in the average outcome of a random variable, we look at the '''expected value''' (mean): a weighted average of the possible outcomes.

When we are interested in the variability of a random variable, we look at the '''variance''' and the '''standard deviation''': the latter being the expected difference from mean, and the former being the square of the latter (not really used for interpretation, more for math).

= Properties of Statistics =
I'm just gonna start writing equations.

<math>
E(c) = c
</math>

<math>
E(aX) = aE(X)
</math>

<math>
E(aX + c) = aE(X) + c
</math>

<math>
Var(X) = E((X - \mu)^2) = E(X^2) - E(X)^2
</math>

Cool name for this: the ''law of the unconscious statistician'' for how obvious it is
<math>
E(g(X)) = \sum g(x_i) P(X = x_i)
</math>

<math>
Var(c) = 0
</math>

<math>
Var(aX) = a^2 Var(X)
</math>

<math>
Var(aX + c) = a^2 Var(X)
</math>

= Linear Combinations =
A '''linear combination''' of random variables <math>X</math> and <math>Y</math> is given as

<math>aX + bY</math>

where ''a'' and ''b'' are constants. They are ''not'' [[bivariate]]

The expectation is

<math>
E(aX + bY) = aE(X) + bE(Y)
</math>

<math>
Var(aX + bY) = a^2 Var(X) + ab Cov(X,Y) + b^2 Var(Y)
</math>

Random Variable

2024-03-19T06:36:49Z

Admin:

Random Variable

2024-03-19T06:35:04Z

Admin: /* Statistics */

Random Variable

2024-03-19T06:30:05Z

Admin: Created page with "A '''random''' variable is a numerical variable whose outcome is the result of a random process (i.e. we don't know what will happen for certain). Notably, the numerical interpretation of the outcome of an experiment is a random variable. They come in two types, deserving their own pages. See = Statistics = Random variables has several statistics that we care about. See pages Continuous Random Variable and Discrete Random Var..."

Variable (Statistics)

2024-03-19T06:25:29Z

Admin: /* Random Variable */

In statistics, a '''variable''' is a characteristic of a subject that varies in a ''non-random'' way.

= Overview and Related Definitions =

At the top level of statistics, we investigate a '''population''': a set of units that we are interested in studying.

Populations are almost always impossible to study due to their massive size and other constraints. Therefore, we take a '''sample''': a subset of the population.

A '''subject''' is a unit that we study in a population or a sample, and a '''variable''' is a particular characteristic of the subject that we are interested in studying.

= Types of Variables =

There are two types of variables, each with two sub-categories that have useful properties:

* '''Quantitative/Numerical''' variables are measured with numbers. Their sum has meaning.
** '''Continuous''' numerical variables can theoretically take on any number within an interval, whereas
** '''Discrete''' numerical variables have natural gaps
* '''Qualitative/Categorical''' variables are measured as labels. Their sum does not have meaning.
** '''Ordinal''' categorical variables have a natural ordering, whereas
** '''Nominal''' categorical variables do not

It is possible for a categorical variable to be denoted with numbers; a common example would be an ID number. The biggest difference between categorical variables denoted as numbers and numerical variables is the fact that the sum/mean of categorical variables does not have meaning, whereas that of numerical variables do.

= Notation =

A capitalized character (usually <math>X, Y, Z</math>) is used to denote ''all possible values'' of a variable. This is called a '''major'''. When we say "variable", we usually mean this.

A lower case character corresponding to the major is used to denote a specific value of that major (such as <math>x, y, z</math>. This is called a '''statistic'''.

= Random Variable =
A '''random''' variable is a numerical variable whose outcome is the result of a random process (i.e. we don't know what will happen for certain). See [[Random Variable]].
[[Category:Statistics]]

Bivariate

2024-03-19T06:24:03Z

Admin:

[[Category:Distribution (Statistics)]][[Category:Statistics]]
'''Bivariate''' data consider two variables instead of the usual one; each value of one of the variables is paired with a value of the other variable. We will be using <math>X, Y</math> to denote the two random variables throughout this page.

= Summary Statistics =
To summarize bivariate data, we use covariance and correlation in addition to the statistics detailed in [[Summary Statistics]].

== Covariance ==
The '''covariance''' measures the total variation of two RVs and their centers. It indicates the relationship of two variables whenever one changes, measuring how much the two vary together.

We have ''sample covariance''

<math>s^2_{X, Y} = \hat{cov}(X, Y) = \frac{1}{n - 1} \sum(x_i - \bar{x}) (y_i - \bar{y}) = \frac{1}{n - 1} \left( \sum x_i y_i - n \bar{x} \bar{y} \right)</math>

A good way of thinking about covariance is by cases:

If x ''increases'' as y ''increases'', the signs of both terms of the covariance calculation is the same. Therefore, covariance is ''positive''.

If x ''decreases'' as y ''increases'', the signs are different. Therefore, covariance is ''negative''.

If x does not clearly vary with y, the signs are sometimes different, sometimes the same. Overall, it should cancel out to ''zero.''

== Correlation ==
The '''correlation''' of two random variables measures the '''line'''
dependent''' between <math>X</math> and <math>Y</math>'''

<math>
Cor(X, Y) = \rho = \frac{Cov(X,Y)}{sd(X) sd(Y)}
</math>

Correlation is always between -1 and 1. When r = 1, the relationship between X and Y is '''perfect positive linear'''. When r = -1, it is '''perfect negative linear'''. If it is 0, there is no linear relationship. This doesn't mean that there is no relationship. Notably, any symmetric scatter plot has a correlation of 0.

= Bivariate Normal =

The '''bivariate normal''' (aka. bivariate gaussian) is one special type
of continuous random variable.

<math>(X, Y)</math> is ''bivariate normal'' if

# The marginal PDF of both X and Y are normal
# For any <math>x</math>, the condition PDF of <math>Y</math> given <math>X = x</math> is Normal
** Works the other way around: Bivariate gaussian means that condition is satisfied

== Predicting Y given X ==

Given bivariate normal, we can predict one variable given another.
Let us try estimating the expected Y given X is x

<math>
E(Y| X = x)
</math>

There are three main methods
* Scatter plot approximation
* Joint PDF
* 5 statistics

=== 5 Parameters ===

We need to know 5 parameters about <math>X</math> and <math>Y</math>

<math>E(X), sd(X), E(Y), sd(Y), \rho</math>

If <math>X, Y</math> follows bivariate normal distribution, then we
have

<math>
\left( \frac{E(Y|X = x) - E(Y)}{sd(Y)} \right) = \rho \left( \frac{x -
E(X)}{sd(X)} \right)
</math>

The left side is the ''predicted Z-score for Y'', and the right side is
''the product of correlation and Z-score of X = x''

The variance is given by

<math>
Var(Y | X = x) = (1 - \rho^2) Var(Y)
</math>

Due to the range of <math>\rho</math>, the variance of Y given X is
always smaller than the actual variance. The standard deviation is just
rooted that.

== Regression Effect ==
[[File:Regression Effect Scatter Plot.png|thumb|Regression effect demonstrated by SD line and Regression line]]
The '''regression effect''' is the phenomenon that the best prediction
of <math>Y</math> given <math>X = x</math> is less rare for
<math>Y</math> than <math>x</math>; Future predictions regress to
mediocrity.

When you plot all the predicted <math>E(Y|X = x)</math>, you get the
'''linear regression line'''. The regression effect can be demonstrated
by also plotting the SD line (where the correlation is not applied).

= Linear Regression =

== Assumption ==

# X and Y have a linear relationship
# A random sample of pairs was taken
# All pairs of data are independent
# The variance of the error is constant. <math>Var(\epsilon) = \sigma_\epsilon^2</math>
# The average of the errors is zero. <math>E(\epsilon) = 0</math>
# The errors are normally distributed.

<math>
\varepsilon \sim^{iid} N(0, \sigma_\epsilon^2), Y_i \sim^{iid} N(\beta_0
+ \beta_1 x_i, \sigma_\epsilon^2)
</math>

== Procedure ==

<math>
y_i = \beta_0 + \beta_1 x_i + \epsilon_i
</math>

where the <math>\beta_0, \beta_1</math> are '''regression
coefficients''' (slope, intercept) based on the population, and
<math>\epsilon_i</math> is error for the i-th subject.

We want to estimate the regression coefficients.

Let <math>\hat{y_i}</math> be an estimation of <math>y_i</math>; a
prediction at <math>X = x</math>, with

<math>
\hat{y_i} = \hat{\beta_0} + \hat{\beta_1} x_i
</math>

We can measure the vertical error <math>e_i = y_i - \hat{y_i}</math>

The overall error is the sum of squared errors <math>SSE = \sum_i^n
e_i^2</math>. The best fit line is the line minimizing SSE.

Using calculus, we can find that the line has the following scope and
intercept:

<math>
\hat{\beta_1} = r \frac{s_y}{s_x}
</math>

where <math>r</math> is the strength of linear relationship, and
<math>s_x, s_y</math> is the deviations of the sample. They are
basically the sample versions of <math>\rho, \sigma</math>

<math>
\hat{\beta_0} = \bar{Y} - \hat{\beta_1} \bar{X}
</math>

== Interpretation ==

<math>\beta_1</math> (the slope) is the estimated change in
<math>Y</math> when <math>X</math> changes by one unit.

<math>\beta_0</math> (the intercept) is the estimated average of
<math>Y</math> when <math>X = 0</math>. If <math>X</math> cannot be 0,
this may not have a practical meaning.

<math>r^2</math> ('''coefficient of determination''') measures how good
the line fits the data.

<math>
r^2 = \frac{\sum (\hat{y_i} - \bar{Y})^2 }{\sum (y_i - \bar{Y})^2}
</math>

The bottom is total variance. The top is reduced. The value is the
proportion of variance in <math>y</math> that is explained by the linear
relationship between <math>X</math> and <math>Y</math>.