There is data to be trained when doing machine learning, but fortunately Sklearn provides a number of well-labeled datasets for us to train.
This section looks at what data sets are available for training in Sklearn.
This data is located in Datasets, at the URL: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets
Room Rate Data
Loading Boston house price data, which can be used for linear regression:
Sklearn.datasets.load_boston:http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_ Boston.html#sklearn.datasets.load_boston
Loading mode is:
fromimport= load_boston()print(boston.data.shape)
The shape of this data set is:
(50613)
That is, 506 rows, 13 columns, and here 13 columns are 13 attributes that affect the house price, specifically which properties can be printed by the following code:
print(boston.feature_names)
The output is:
[‘CRIM‘‘ZN‘‘INDUS‘‘CHAS‘‘NOX‘‘RM‘‘AGE‘‘DIS‘‘RAD‘‘TAX‘‘PTRATIO‘‘B‘‘LSTAT‘]
Specific representative what meaning, either own guess, or online check it, I do not explain, I guess a few: rm:room number, that is, the house of several rooms, age:age (intervention), do not know guess right, we have to practice.
You say how I know this data set has the Feature_names attribute, I do not know, I just put the above Boston the whole print out to see that there is this attribute.
Forecast Price case
fromSklearn.datasetsImportLoad_boston fromSklearn.linear_modelImportLinearregression fromSklearn.model_selectionImportTrain_test_split# Load Rate dataBoston=Load_boston () data_x=Boston.datadata_y=Boston.target# Split Training sets and test setsX_train, X_test, Y_train, y_test=Train_test_split (data_x, data_y, test_size=0.3)# Create a linear regression modelModel=Linearregression ()# Training ModelModel.fit (X_train, Y_train)# Print out the top 5 price data for the forecastPrint("Top 5 Price data for forecasts:")Print(Model.predict (x_test) [:5])# Print out the first 5 data of the real room rate in the test setPrint("The first 5 data of real house prices in the test set:")Print(y_test[:5])
Output:
预测的前517.44807408 27.78251433 18.8344117 17.85437188 34.47632703]测试集中实际房价前514.3 22.3 22.6 20.6 34.9]
Taking the first data in this result set as an example, we predict that the price of a house is 174,000, while the actual price is 143,000.
But to be honest, the above price data can only be used for testing algorithms, we really want to predict the price, the original data is not so full and regular, so in machine learning, data collection and cleaning is also a very important work, dirty live dirty also have to dry, the light has the algorithm is useless.
The data in front of the flower is already spoken, and there is no repetition here.
Handwritten digital recognition data
and handwritten numeral recognition, this is also very common: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html# Sklearn.datasets.load_digits
Create Sample Data
You can also generate some virtual data, which is located in the API document of the official Website Samples Generator section:
The case source code is:
fromimport make_regressionimportas plt# 创建100个样本,1个属性值的数据,输出一个目标值,同时也设置了噪音= make_regression(n_samples=100, n_features=1, n_targets=1, noise=10)print(X.shape)print(y.shape)# 对X,y画散点图,看看长啥模样的plt.scatter(X, y)plt.show()
The output data is:
(1001)(100,)
That is, the X value has 100 rows and 1 columns, and the Y value is the value of 100 rows.
The output graph is:
Looks close to a straight line.
Sklearn database-"Old fish learning Sklearn"