Latest revision as of 19:32, 17 May 2024

Exploratory data analysis (EDA) is the first step in the Machine Learning pipeline. It allows us to make informed decisions about tools used to analyze the data.

Dataset description

During the EDA phase, we can make choices regarding which model to use depending on the attributes of a dataset.

Other tasks

EDA also detects unwanted values/noise that lead to inaccurate predictions.

Outlier

Outliers in a dataset are those samples that are abnormally extreme (i.e. far from the other data points). They can affect the overall accuracy of the model trained on the data. They are anomalies, causing biased or incorrect data.

During the EDA phase, these outliers are detected

Background knowledge such as impossible values like negative age
Visualization such as scatter plot
Data Analysis such as box plot
ML algorithms such as One-Class-SVM

and strategies are used to deal with them

Correct values
Delete values

Skewed data tend to have outliers on the side it's skewed to.

Data Normalization

Normalization is a technique used to scale numeric data to the same scale. It

helps in faster convergence
prevents some features from dominating others due to scale
improves the overall performance of the model

There are several strategies to do this

Min-max scaling: $X_{norm}={\frac {X-X_{min}}{X_{max}-X_{min}}}$
Z-score scaling: $X_{\text{norm}}={\frac {X-\mu }{\sigma }}$

@@ Line 1: / Line 1: @@
 '''Exploratory data analysis (EDA)''' is the first step in the Machine Learning pipeline. It allows us to make informed decisions about tools used to analyze the data.
-* Look at features of data
+= Dataset description =
-* Look at correlated features
+During the EDA phase, we can make choices regarding which model to use depending on the [[Dataset#Usage attributes|attributes of a dataset]].
-* Find trends and unusual characteristics
+= Other tasks =
+EDA also detects '''unwanted values/noise''' that lead to inaccurate predictions.
+= Outlier =
+Outliers in a dataset are those samples that are abnormally extreme (i.e. far from the other data points). They can affect the overall accuracy of the model trained on the data. They are '''anomalies''', causing biased or incorrect data.
+During the EDA phase, these outliers are detected
+* ''Background knowledge'' such as impossible values like negative age
+* ''Visualization'' such as scatter plot
+* [[Summary Statistics#Outliers|''Data Analysis'']] such as box plot
+* ''ML algorithms'' such as One-Class-SVM
+and strategies are used to deal with them
+* Correct values
+* Delete values
+''Skewed data'' tend to have outliers on the side it's skewed to.
+= Data Normalization =
+'''Normalization''' is a technique used to scale numeric data to the same scale. It
+* helps in faster convergence
+* prevents some features from dominating others due to scale
+* improves the overall performance of the model
+There are several strategies to do this
+* Min-max scaling: <math>X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}</math>
+* Z-score scaling: <math>X_{\text{norm}} = \frac{X - \mu} {\sigma}</math>
+[[Category:Machine Learning]]

Anonymous

Search

Exploratory data analysis: Difference between revisions

Namespaces

More

Page actions

Latest revision as of 19:32, 17 May 2024

Contents

Dataset description

Other tasks

Outlier

Data Normalization

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Exploratory data analysis: Difference between revisions

Latest revision as of 19:32, 17 May 2024

Dataset description

Other tasks

Outlier

Data Normalization

Navigation

Wiki tools

Page tools

Categories