Machine learning algorithms must act on data. The nature of data determines whether the applied machine learning algorithms are suitable, and the quality of data determines the performance of algorithms. Therefore, it is important to study and analyze data. This article, as the first part of the study data series, lists four of the most popular machine learning datasets.
Iris
Iris, also known as Iris flower dataset, is a type of multi-variable analysis dataset. Which of the three types of Iris flower belong to (setosa, versicolour, virginica) are predicted by four attributes: the length, the width, and the length of the petals.
Dataset features: |
Multi-Variable |
Number of records: |
150 |
Fields: |
Life |
Attribute features: |
Real Number |
Attribute quantity: |
4 |
Donation date |
1988-07-01 |
Related Applications: |
Category |
Missing Value? |
None |
Website hits: |
563347 |
Adult
This data is extracted from the U.S. Census database in 1994 and can be used to predict whether the income of residents exceeds 50 K $/year. The dataset variable shows whether the annual income exceeds 50 K $. The attribute variables include age, type of work, education, occupation, race, and other important information. It is worth mentioning that, there are 7 category variables in 14 property variables.
Dataset features: |
Multi-Variable |
Number of records: |
48842 |
Fields: |
Society |
Attribute features: |
Type, integer |
Attribute quantity: |
14 |
Donation date |
1996-05-01 |
Related Applications: |
Category |
Missing Value? |
Yes |
Website hits: |
393977 |
Wine
This dataset contains 178 records from 3 wines of different origins. The 13 properties are 13 Chemical Ingredients of wine. Chemical analysis can be used to infer the origin of the wine. It is worth mentioning that all attribute variables are continuous variables.
Dataset features: |
Multi-Variable |
Number of records: |
178 |
Fields: |
Physical |
Attribute features: |
Integer, real number |
Attribute quantity: |
13 |
Donation date |
1991-07-01 |
Related Applications: |
Category |
Missing Value? |
None |
Website hits: |
337319 |
Car Evaluation
This is a dataset about automobile evaluation. The class variables are automobile evaluation. (unacc, ACC, good, and vgood) represent (unacceptable, acceptable, good, and very good ), the six Property variables are "purchase price", "maintenance fee", "number of doors", "Number of people allowed", "trunk size", and "security 」. It is worth mentioning that all the six attribute variables are ordered class variables. For example, the "Number of people allowed" value can be "2, 4, more", and the "Security" value can be "low, Med, high 」.
Dataset features: |
Multi-Variable |
Number of records: |
1728 |
Fields: |
N/A |
Attribute features: |
Category Type |
Attribute quantity: |
6 |
Donation date |
1997-06-01 |
Related Applications: |
Category |
Missing Value? |
None |
Website hits: |
272901 |
Summary
By comparing the differences between the above four datasets, we can simply summarize: when a large amount of data needs to be tested, we can think of "adult"; when we want to study the correlation between variables, we can select only the "Iris" and "Wine" of the integer or real number as the variable values. To study logistic regression, we can select only two types of "adult" for the class variable values 」; to study class variable conversion, we can select the "car evaluation" where the attribute variable is an ordered class 」. For more attempts, you need to know more about these datasets.
The preceding Dataset: Http://archive.ics.uci.edu/ml/