Exploratory data analysis: Difference between revisions
m (Rice moved page Exploratory Data Analysis to Exploratory data analysis) |
No edit summary |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
'''Exploratory data analysis (EDA)''' is the first step in the Machine Learning pipeline. It allows us to make informed decisions about tools used to analyze the data. | '''Exploratory data analysis (EDA)''' is the first step in the Machine Learning pipeline. It allows us to make informed decisions about tools used to analyze the data. | ||
= Dataset description = | |||
During the EDA phase, we can make choices regarding which model to use depending on the [[Dataset#Usage attributes|attributes of a dataset]]. | |||
Dataset | |||
= Other tasks = | |||
EDA also detects '''unwanted values/noise''' that lead to inaccurate predictions. | EDA also detects '''unwanted values/noise''' that lead to inaccurate predictions. | ||
= Outlier = | = Outlier = | ||
Line 38: | Line 35: | ||
* Min-max scaling: <math>X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}</math> | * Min-max scaling: <math>X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}</math> | ||
* Z-score scaling: <math>X_{\text{norm}} = \frac{X - \mu} {\sigma}</math> | * Z-score scaling: <math>X_{\text{norm}} = \frac{X - \mu} {\sigma}</math> | ||
[[Category:Machine Learning]] |
Latest revision as of 19:32, 17 May 2024
Exploratory data analysis (EDA) is the first step in the Machine Learning pipeline. It allows us to make informed decisions about tools used to analyze the data.
Dataset description
During the EDA phase, we can make choices regarding which model to use depending on the attributes of a dataset.
Other tasks
EDA also detects unwanted values/noise that lead to inaccurate predictions.
Outlier
Outliers in a dataset are those samples that are abnormally extreme (i.e. far from the other data points). They can affect the overall accuracy of the model trained on the data. They are anomalies, causing biased or incorrect data.
During the EDA phase, these outliers are detected
- Background knowledge such as impossible values like negative age
- Visualization such as scatter plot
- Data Analysis such as box plot
- ML algorithms such as One-Class-SVM
and strategies are used to deal with them
- Correct values
- Delete values
Skewed data tend to have outliers on the side it's skewed to.
Data Normalization
Normalization is a technique used to scale numeric data to the same scale. It
- helps in faster convergence
- prevents some features from dominating others due to scale
- improves the overall performance of the model
There are several strategies to do this
- Min-max scaling:
- Z-score scaling: