Exploratory data analysis: Difference between revisions
No edit summary |
No edit summary |
||
Line 8: | Line 8: | ||
EDA also detects '''unwanted values/noise''' that lead to inaccurate predictions. | EDA also detects '''unwanted values/noise''' that lead to inaccurate predictions. | ||
EDA | During the EDA phase, we can make choices regarding which model to use depending on the dataset, such as visualizing the data and then checking linearity.e | ||
= Outlier = | |||
Outliers in a dataset are those samples that are abnormally extreme (i.e. far from the other data points). They can affect the overall accuracy of the model trained on the data. They are '''anomalies''', causing biased or incorrect data. | |||
During the EDA phase, these outliers are detected | |||
* ''Background knowledge'' such as impossible values like negative age | |||
* ''Visualization'' such as scatter plot | |||
* [[Summary Statistics#Outliers|''Data Analysis'']] such as box plot | |||
* ''ML algorithms'' such as One-Class-SVM | |||
and strategies are used to deal with them | |||
* Correct values | |||
* Delete values | |||
''Skewed data'' tend to have outliers on the side it's skewed to. |
Revision as of 18:41, 3 April 2024
Exploratory data analysis (EDA) is the first step in the Machine Learning pipeline. It allows us to make informed decisions about tools used to analyze the data.
- Look at features of data
- Look at correlated features
- Find trends and unusual characteristics
Dataset description
EDA also detects unwanted values/noise that lead to inaccurate predictions.
During the EDA phase, we can make choices regarding which model to use depending on the dataset, such as visualizing the data and then checking linearity.e
Outlier
Outliers in a dataset are those samples that are abnormally extreme (i.e. far from the other data points). They can affect the overall accuracy of the model trained on the data. They are anomalies, causing biased or incorrect data.
During the EDA phase, these outliers are detected
- Background knowledge such as impossible values like negative age
- Visualization such as scatter plot
- Data Analysis such as box plot
- ML algorithms such as One-Class-SVM
and strategies are used to deal with them
- Correct values
- Delete values
Skewed data tend to have outliers on the side it's skewed to.