Dataset

From Rice Wiki

In machine learning, a model operates on a dataset.

Performance attributes

Several attributes determine how good a dataset is for a problem.

The completeness of a dataset is the extent to which it contains all relevant features necessary for a given task.

A dataset needs to have a sufficient number of observations, measured by the size of the dataset.

The validity of the dataset is how accurate, clean, and relevant the data in the dataset is.

Usage attributes

Some attributes of the dataset determines the way we use them.

A dataset can be high dimensional, meaning that it has very high number of features, which can make calculations difficult.

The linearity of a dataset is a big factor in determining what model to use.

Outliers are samples that show abnormal distance from other samples. They impact the accuracy of the model.

The skewness of a dataset determines the direction of the outliers. It also impacts model accuracy.