Clustering is to divide similar objects into different groups or more subsets by static classification. Members in the same subset have similar attributes. Cluster analysis can be regarded as an unsupervised learning technology.
In the Spark 2.0 version (not MLlib based on the RDD API), there are four clustering methods:
(2) Latent Dirichlet allocation (LDA)
(3) Bisecting k-means (two-point k-means algorithm)
(4) Gaussian Mixture Model (GMM).
There are six clustering methods in MLLib based on RDD API:
(2) Gaussian mixture
(3) Power iteration clustering (PIC)
(4) Latent Dirichlet allocation (LDA)**
(5) Bisecting k-means
(6) Streaming k-means
More Power iteration clustering (PIC) and Streaming k-means
Commonly used is the K-means algorithm.
The K-means algorithm (K-Means) is a partitioning clustering method. The idea of the algorithm is to find the cluster center by iteration to minimize the sum of squared errors of the mean values of each sample and the class.
KMeans is an iterative solution clustering algorithm, which belongs to the Partitioning type of clustering method, which first creates K partitions and then iteratively transfers samples from one partition to another to improve the quality of the final cluster.
The K-Means clustering algorithm can easily model clustering problems. The K-Means clustering algorithm is easy to understand and can run in parallel in a distributed environment. Learning K-Means clustering algorithm can more easily understand the advantages and disadvantages of clustering algorithms and the efficiency of other algorithms for specific data.
K in the K-Means clustering algorithm is the number of clusters, and user input is forced in the algorithm. If you cluster news into categories such as politics, economics, culture, etc., you can choose a number from 10 to 20 as K. Because the number of such top categories is small. If you want to classify these news in detail, it is no problem to choose a number from 50 to 100. The K-Means clustering algorithm can be divided into three steps.
The first step is to find randomly selected K samples for the points to be clustered as the initial cluster center;
The second step is to calculate the distance of each point cluster center, and cluster each point into the cluster closest to the point;
The third step is to calculate the coordinate average of all points in the cluster and use this average as the new cluster center point.
Repeat the second step until the cluster center no longer performs a wide range of movements, or the number of clusters reaches the required level.
Clustering algorithm usage scenario
1. Commercial location based on user location information
With the rapid development of information technology, mobile devices and mobile Internet have spread to thousands of households. When the user uses the mobile network, the user's location information is naturally left. With the continuous improvement of GIS geographic information technology in recent years, combining user location and GIS geographic information will bring innovative applications. For example, Baidu and Wanda cooperate to improve the business benefits by targeting location users and combining Wanda's merchant information to push location marketing services to users.
It is hoped that a new restaurant location will be provided for a chain of catering establishments through the location information of a large number of mobile device users.
2. Chinese address standardization processing
The address is a variable that covers a wealth of information. However, due to the complexity of Chinese processing and the non-standardization of Chinese Chinese address naming for a long time, the rich information contained in the address cannot be mined by deep analysis. Through the standardization of the address, the multi-dimensional metrics mining analysis based on address becomes possible, which provides a richer method and means for e-commerce application mining under different scene modes, so it has important practical significance.
3. Non-human malicious traffic identification
In the first quarter of 2016, Facebook issued a report saying that the traffic quality test results of its Atlas DSP platform for half a year showed that the malicious traffic caused by robot simulation and black IP was as high as 75%. In the first half of 2016, the AdMaster anti-cheat solution was confirmed. On average, there can be up to 28% of cheating traffic per day. The problem of low-quality false traffic has always existed, and this is also the problem that the digital marketing industry has been playing in the past decade. Based on AdMaster massive monitoring data, more than 50% of the projects are suspected of cheating; in different projects, cheating traffic accounts for 5% to 95% of advertising; among them, vertical and network media have the highest proportion of cheating traffic; The proportion of cheating traffic is significantly higher than that of mobile and smart TV platforms. Advertising monitoring behavior data is increasingly used for modeling and decision making, such as drawing user portraits, identifying corresponding users across devices, and the like. Cheating, malicious exposure, web crawlers, misleading clicks, and even subjective behaviors that are controlled by users without being perceived by the user, cause huge noise to the data, causing great training for the model. influences.
It is hoped that based on given data, a model will be established to identify and mark cheating traffic, remove data noise, and thus better use the data to maximize the interests of advertisers.
Collaborative Filtering (CF, WIKI) is defined as: simply using the preferences of a group of people with similar interests and common experience to recommend interesting information to users, and individuals to provide information through a cooperative mechanism. A considerable degree of response (such as scoring) is recorded and recorded for filtering purposes, which in turn helps others to filter information. Responses are not necessarily limited to particular interests, and records of particularly uninteresting information are also important.
Collaborative filtering is often applied to recommendation systems. These techniques are designed to complement the missing parts of the user-commodity correlation matrix.
MLlib currently supports model-based collaborative filtering where users and commodities are expressed by a small set of recessive factors, and these factors are also used to predict missing elements. MLLib uses alternating least squares (ALS) to learn these recessive factors.
The user's preference for the item or information may include the user's rating of the item, the user's view of the item, the user's purchase record, etc., depending on the application itself. In fact, the preference information of these users can be divided into two categories:
Explicit User Feedback: This type of information is provided by the user to explicitly provide feedback, such as the user's rating of the item or a review of the item, in addition to the user's natural browsing on the website or using the website.
Implicit User Feedback: This type of data is generated by the user when using the website, implicitly reflecting the user's preferences for the item, such as the user purchasing an item, the user viewing the information of an item, and so on.
Explicit user feedback can accurately reflect the user's real preference for the item, but requires the user to pay extra price; and the implicit user behavior, through some analysis and processing, can also reflect the user's preferences, but the data is not very accurate. There is a large noise in the analysis of some behaviors. But as long as you choose the right behavior characteristics, implicit user feedback can also get good results, but the choice of behavior characteristics may be very different in different applications. For example, on the e-commerce website, the purchase behavior is actually An implicit feedback that is good for user preferences.
The recommendation engine may use a part of the data source according to different recommendation mechanisms, and then based on the data, analyze certain rules or directly predict the user's preferences for other items. This recommendation engine can recommend items that he may be interested in when the user enters.
MLlib currently supports a collaborative filtering-based model in which users and products are described by a set of potential factors that can be used to predict missing items. In particular, we implement the Alternating Least Squares (ALS) algorithm to learn these potential factors. The implementation in MLlib has the following parameters:
NumBlocks is the number of chunks used for parallelization calculations (automatically configured when set to -1);
Rank is the number of recessive factors in the model;
Iterations is the number of iterations;
Lambda is the regularization parameter of ALS;
implicitPrefs determines whether to use the explicit feedback ALS version or the implicit feedback data set version;
Alpha is a parameter for the implicit feedback ALS version, which determines the benchmark for the strength of the preferred behavior.
Collaborative filtering algorithm
1. The e-commerce platform bought XX also bought XX, combination with the package, just take a look at the function.
2. Personalized recommendation for today's headlines.
3. Groups with the same interest in Douban.
4. Movie recommendation system.
5. Baidu map based on the location of nearby food