In machine learning tasks, data is often preprocessed. such as scale transformation, standardization, binary, regularization. As to which method is more effective, it is related to the distribution of data and the adoption of algorithms. Different algorithms have different assumptions about the data, may require different transformations, and sometimes do not need to be transformed, may also get relatively better results. Therefore, it is recommended to use a variety of data transformation metho
clarifying that this is a classification problem, we can select some sort of classification model (oralgorithm), the model is studied by using the training data, and then the corresponding classification is given for each test sample.Results.
Machine learning classification algorithms are numerous, in the next study we will introduce the classical classification algorithm, such as K nearest neighbor, decision tree and naive Bayesian principle and implementation.
Basic Classification Model:
K Ne
Required to go directly to the successful installation processTrial and Error CourseI recently saw the Scikit-learn library, I think in Python called to do some testing, and so very convenient to start configuration, As a result of the previous installation of python2.7, it is intended to be configured in the previous version, from the online various posts to try the method, finally prompted to install the success, but in the run import Sklearn always
think, because it provides three distributed data structures: arrayRDD, sparseRDD, dictRDD, and scikit-learn, to apply to transformed RDD.
GitHub-databricks/spark-sklearn: Scikit-learn integration package for Spark
Finally, let's talk about the Spark-sklearn developed by databricks. The development is not sufficient, and the functions are very limited. You can use grid search to perform cross-validation
data structures: series and Dataframe . The Series is an object similar to an NumPy array, consisting of a set of data (various numpy data types) and a set of data labels (that is, indexes) associated with it. Indexes and values can be specified separately using index and value. If no index is specified, 0 to N-1 indexes are created automatically. The Dataframe is a tabular structure that contains an ordered set of columns, each of which can be of different data types. Both row and colum
industry for image classification with KNN,SVM,BP neural networks. Gain deep learning experience. Explore Google's machine learning framework TensorFlow.
Below is the detailed implementation details. System Design
In this project, 5 algorithms for experiments are KNN, SVM, BP Neural Network, CNN and Migration Learning. We used the following three ways to experiment KNN, SVM, BP Neural network is what we can learn in school. Powerful and easy to deploy. So the first step, we mainly use
weighted, and the controlexponential function of how weights are attenuated by distance
The Infogainattributeeval evaluator evaluates attributes by measuring the property information gain of the categoryThe Gainratioattributeeval evaluator evaluates the properties by measuring the gain rate for the corresponding classThe Sysmmtricaluncertattributeeval evaluator evaluates attributes by measuring the symmetry of the corresponding class of uncertaintiesThe Onerattributeeval evaluator uses the accu
Some characteristics may be continuous variables, such as the height of the person, the length of the object, these features can be converted to discrete values, such as if the height is below 160cm, the characteristic value is 1, between 160cm and 170cm, the eigenvalues are 2; above 170cm, the eigenvalues are 3. It is also possible to convert the height to 3 features, namely F1, F2, F3, if height is below 160cm, the values of these three characteristics are 1, 0, 0, if height is above 17
data
Convert the Pclass variable to three Summy Variables
Convert sex into a 0-1 variable
Subdf = df [['pclass ', 'sex', 'age'] y = df. imputer in replicated ved # sklearn can also be age = subdf ['age']. fillna (value = subdf. age. mean () # sklearn OneHotEncoder can also be pclass = pd. get_dummies (subdf ['pclass '], prefix = 'pclass') sex = (subdf ['sex'] = 'male '). astype ('int') X = pd. concat ([
algorithmDivides n sample points into K classes so that each point is closest to itCentroid(The mean of all the sample points in a class) corresponds to the class, which is used as the clustering standard.
For the algorithm principles, see http://www.aboutyun.com/thread-18178-1-1.html#]
K-means algorithm calculation steps
Obtain K initial centers: Randomly extract K points from the data as the center of the initial cluster to represent each class
Divide each vertex into the corresponding cla
This is a creation in
Article, where the information may have evolved or changed.
This series of tutorials is suitable for machine learning, even the arts sen Oh. There will be no mathematical formula, I promise! Tutorials are based on the Sklearn Python machine learning Library.
Open the veil of machine learning, that's it.
The first thing is to have a sharp weapon. 工欲善其事 its prerequisite. Look at my next article, download the software, and then we c
Recently began to learn to use Scikit-learn, every day to write about what you learned, not only can remind yourself what to learn a day, but also convenient to review.Install Scikit-learn on my virtual machine Ubuntu, the installation process is simple. Since I often use Python, my Ubuntu virtual machine has already installed Pip, NumPy, scipy,matplotlib, Cython dependent libraries. Source Address: Https://github.com/scikit-learn/scikit-learnAfter installing the dependent library, download Scik
followed 0 mean, STD was 1
2-and 3-dimensional data features can be shown better in graphs
Manifold Learning Method : Sklearn.manifold.TSNE
The method is powerful, but for statistical analysis it is more difficult to control. In the case of numbers, 64-D features, mapping 2-D can achieve visualization:
# Fit and transform with a Tsne
>>> from sklearn.manifold import tsne >>> Tsne = Tsne (n_components = 2, random_state = 0) GT;GT;G T x_2d = Tsne. Fit_transform (X) >>> # Visualize the data >>>
()
plt.show ()
The image is then displayed as follows:3. Start experimenting with various regression methods
To speed up the test, a function is written that takes the object of a different regression class, and then it draws the image and gives the score.The functions are basically as follows:
def try_different_method (CLF):
clf.fit (x_train,y_train)
score = Clf.score (X_test, y_test)
result = Clf.predict (x_test)
plt.figure ()
Plt.plot (Np.arange (len (Result)), y_tes
1. Confusion Matrixis a confusion matrix of two types of problems in which the output uses a different category labelCommonly used metrics to measure classification performance are:
The correct rate (Precision), which is equal to tp/(TP+FP), gives the ratio of the true positive example in the sample that is predicted to be a positive example.
recall Rate (Recall), which he equals to tp/(TP+FN), gives the true positive example of the predicted positive example as the proportion of al
the positive case), and TPR is very high, assuming threshold=0.1, fpr=tpr= 1, you can draw the first point in the coordinate system (threshold); In the same vein, with the increase of the FPR and TPR will gradually decrease, and finally even 0. The shape of the curve is usually as follows:If the classifier works well, the curve should be close to the upper-left corner (TPR large, FPR small).The AUC (area under Curve) is defined as the size of the ROC
by CTR, accuracy, AUC and other methods. But the evaluation of marketing will be more complex, with the following four aspects of the problem:1) To improve overall profits and customer satisfaction as the goal, and even consider the long-term benefits, not a simple two classification or regression problem2) marketing usually includes multiple decisions and more complex policy processes than recommended, with the ultimate effect depending on the optim
Import Fregata.spark.data.LibSvmReader Import fregata.spark.metrics.classification. {Areaunderroc, accuracy} import fregata.spark.model.classification.LogisticRegression import Org.apache.spark.
{sparkconf, sparkcontext}/** * Created by all on 2016/12/8. */Object Fregatafirsttest {def main (args:array[string]): unit = {val conf=new sparkconf (). Setappname ("Test"). Set Master ("local") Val sc=new sparkcontext (conf)//Read data val (_,traindata) =libsvmreader.read (SC,/fre) via the Fregata API
and whether the diagnosis needs further correction during the establishment of Credit scorecard, the scorecard effectiveness must be examined through the following model verification.
Gini (Gini) coefficient
Kolmogrov-smirnov Value (hereinafter referred to as K-s value)
Area on the ROC curve (areas under ROC CURVE;AUC)
One, the Gini (Gini) coefficient:The mid-downward bend curve, known as Lorenz's Curve, is a standard chart used
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.