Exploratory data analysis (EDA) is the first step in the Machine Learning pipeline. It allows us to make informed decisions about tools used to analyze the data.

Look at features of data
Look at correlated features
Find trends and unusual characteristics

Dataset description

EDA also detects unwanted values/noise that lead to inaccurate predictions.

During the EDA phase, we can make choices regarding which model to use depending on the dataset, such as visualizing the data and then checking linearity.e

Outlier

Outliers in a dataset are those samples that are abnormally extreme (i.e. far from the other data points). They can affect the overall accuracy of the model trained on the data. They are anomalies, causing biased or incorrect data.

During the EDA phase, these outliers are detected

Background knowledge such as impossible values like negative age
Visualization such as scatter plot
Data Analysis such as box plot
ML algorithms such as One-Class-SVM

and strategies are used to deal with them

Correct values
Delete values

Skewed data tend to have outliers on the side it's skewed to.

Data Normalization

Normalization is a technique used to scale numeric data to the same scale. It

helps in faster convergence
prevents some features from dominating others due to scale
improves the overall performance of the model

There are several strategies to do this

Min-max scaling: $X_{norm}={\frac {X-X_{min}}{X_{max}-X_{min}}}$
Z-score scaling: $X_{\text{norm}}={\frac {X-\mu }{\sigma }}$

Anonymous

Search

Exploratory data analysis

Namespaces

More

Page actions

Outlier

Data Normalization

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Exploratory data analysis

Outlier

Data Normalization

Navigation

Wiki tools

Page tools