Dataset: Difference between revisions
(Created page with "In machine learning, a model operates on a '''dataset'''. = Attributes of a dataset = The '''completeness''' of a dataset is the extent to which it contains all relevant '''features''' necessary for a given task. A dataset needs to have a sufficient number of observations, measured by the '''size''' of the dataset. The '''validity''' of the dataset is how accurate, clean, and relevant the data in the dataset is. A dataset can be '''high dimensional''', meaning that i...") |
|||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
In machine learning, a model operates on a '''dataset'''. | In machine learning, a model operates on a '''dataset'''. | ||
= | = Performance attributes = | ||
Several attributes determine how good a dataset is for a problem. | |||
The '''completeness''' of a dataset is the extent to which it contains all relevant '''features''' necessary for a given task. | The '''completeness''' of a dataset is the extent to which it contains all relevant '''features''' necessary for a given task. | ||
Line 7: | Line 9: | ||
The '''validity''' of the dataset is how accurate, clean, and relevant the data in the dataset is. | The '''validity''' of the dataset is how accurate, clean, and relevant the data in the dataset is. | ||
= Usage attributes = | |||
Some attributes of the dataset determines the way we use them. | |||
A dataset can be '''high dimensional''', meaning that it has very high number of features, which can make calculations difficult. | A dataset can be '''high dimensional''', meaning that it has very high number of features, which can make calculations difficult. | ||
The '''linearity''' of a dataset is a big factor in determining what model to use. | |||
''[[Outliers]]'' are samples that show abnormal distance from other samples. They impact the accuracy of the model. | |||
The ''[[Skewness|skewness]]'' of a dataset determines the direction of the outliers. It also impacts model accuracy. | |||
[[Category:Machine Learning]] | [[Category:Machine Learning]] |
Latest revision as of 06:37, 26 April 2024
In machine learning, a model operates on a dataset.
Performance attributes
Several attributes determine how good a dataset is for a problem.
The completeness of a dataset is the extent to which it contains all relevant features necessary for a given task.
A dataset needs to have a sufficient number of observations, measured by the size of the dataset.
The validity of the dataset is how accurate, clean, and relevant the data in the dataset is.
Usage attributes
Some attributes of the dataset determines the way we use them.
A dataset can be high dimensional, meaning that it has very high number of features, which can make calculations difficult.
The linearity of a dataset is a big factor in determining what model to use.
Outliers are samples that show abnormal distance from other samples. They impact the accuracy of the model.
The skewness of a dataset determines the direction of the outliers. It also impacts model accuracy.