Machine Learning: Difference between revisions
No edit summary |
No edit summary |
||
Line 8: | Line 8: | ||
First, ''preprocessing'' (such as data cleaning and sampling) is done to make the data useful. | First, ''preprocessing'' (such as data cleaning and sampling) is done to make the data useful. | ||
Then, ''exploratory data analysis (EDA)'' allows us to understand the data and determine what types of algorithm to employ. | Then, ''exploratory data analysis (EDA),'' such as data visualization, allows us to understand the data and determine what types of algorithm to employ. | ||
Thirdly, ''feature selection'' selects the important feature to reduce overfitting and improve accuracy of the model. | |||
= Constructing ML Models = | |||
The '''training dataset''' is the data used for training a model, containing a set of observations composed of a set of features. In a regression/classification model, we gain a hypothesis function <math>y = g(x)</math> such that it predicts the target variable ''y''. | |||
The '''test dataset''' is the data used to test the performance of the trained model. | |||
[[Category:Computer Science]] | [[Category:Computer Science]] |
Revision as of 18:43, 1 April 2024
Rule-based systems follows a set of pre-defined rules defined by experts to cover all scenarios to automate the decision making process. This is not sufficient for complex systems. Machine learning (ML) builds models to identify and predict patterns, make decisions, and automate processes.
Flow
Unstructured: Images, sentences
Unstructured data is not understood by machines without some algorithms to process it. Structured data is machine-readable.
First, preprocessing (such as data cleaning and sampling) is done to make the data useful.
Then, exploratory data analysis (EDA), such as data visualization, allows us to understand the data and determine what types of algorithm to employ.
Thirdly, feature selection selects the important feature to reduce overfitting and improve accuracy of the model.
Constructing ML Models
The training dataset is the data used for training a model, containing a set of observations composed of a set of features. In a regression/classification model, we gain a hypothesis function such that it predicts the target variable y.
The test dataset is the data used to test the performance of the trained model.