Machine learning is a science of artificial intelligence that can be studied by computer algorithms that are automatically improved by experience.
Machine learning is a multidisciplinary field that involves computers, informatics, mathematics, statistics, neuroscience, and more.
Machine learning is the core technology of big data, and its essence is based on empirical algorithm processing. Machine learning emphasizes three key words: algorithm, experience, and performance. The process is shown in the following figure.
On the basis of the data, the model is constructed by an algorithm and the model is evaluated. If the performance of the assessment meets the requirements, the model is used to test other data; if the requirements are not met, the algorithm is adjusted to re-establish the model and evaluate again. This cycle, and finally get a satisfactory experience to deal with other data.
Machine learning techniques and methods have been successfully applied to many fields, such as today's headline personality recommendation system, ant Jinfu's financial anti-fraud, Xunfei's speech recognition, natural language processing and google machine translation, pattern recognition, intelligent control , spam, etc.
Classification of machine learning
Supervised learning
Supervisor is to learn a model from a given training data set, and then use this model to predict, then compare the predicted results with the actual results, and continuously adjust the forecasting model until an expected accuracy is reached.
Common algorithms include regression analysis and statistical classification. Supervised learning is often used to train neural networks and decision trees. They are highly dependent on a predetermined classification system. Such as spam, news content classification.
Unsupervised learning
The training set for unsupervised learning has no artificially labeled results, and the learning model is designed to infer some of the inherent structure of the data. Common application scenarios include learning of association rules and clustering.
The goal of this type of learning is not to maximize the utility function, but to find the approximate point in the training data. Clustering often finds fairly good visual classifications that match hypotheses. For example, aggregates based on demographics may form a rich aggregate and a poor aggregate in a population.
Semi-supervised learning
Between supervised learning and unsupervised learning, production should consider how to use a small number of labeled samples and a large number of unlabeled samples for training and classification. The learning algorithm attempts to model the unidentified data, and then predicts the identified data, such as the graph theory reasoning algorithm or the Laplace support vector machine.
Common algorithms for machine learning
Regression algorithm
Least squares, logistic regression, stepwise regression, multivariate adaptive regression regression splines, and scatter predictions.
Instance-based algorithm
Often referred to as "winners take all" to learn. Models are often used to solve problems. Such models often select a batch of sample data and then compare the new data with the sample data based on some approximation. In this way, find the best match.
Decision tree learning
Based on the attributes of the data, a tree structure is used to establish a decision model, which is often used to solve classification and regression problems.
Bayesian learning
Mainly used to solve classification and regression problems. Naive Bayesian algorithm.
Clustering, classification algorithm
Clustering and classification are two commonly used algorithms in machine learning. Clustering separates data into different sets and classifies new types of data. The following two types of algorithms are introduced.
(1) What is clustering?
Clustering refers to grouping data objects into multiple classes or clusters. Its goal is to have higher similarity between objects in the same cluster, and objects in different clusters have larger differences. .
In fact, clustering is a common behavior in people's daily life. The so-called "objects are gathered together, people are divided into groups". The core idea is grouping. People constantly improve the clustering mode to learn how to distinguish between things and people.
(2) What is classification
There are many data warehouses, databases, or other information bases that provide the knowledge needed to make decisions for business, research, and other activities. Classification and prediction are two forms of data analysis that can be used to extract trends that can describe important data sets or predict future data.
Classification is used to predict the discrete categories of data objects (Categorical Label); Prediction is used to predict the continuous value of data objects.
Classification process: new sample → feature selection → classification → evaluation
Training process: training set → feature selection → training → classifier
Initially, most of the classification applications for machine learning were based on these methods and algorithms built on a memory basis. Currently, data mining methods require the ability to process large-scale data collection based on external storage while being scalable.
Machine learning library Spark MLLib
MLlib is Spark's Machine Learning library, designed to simplify the engineering practice of machine learning and easily scale to larger scales. Machine learning requires multiple iterations. If you use the Hadoop computing framework, you will have to perform disk read and write tasks for each calculation, which will result in very large I/O and CPU consumption. Spark is a natural advantage based on memory-based calculations. Moreover, its RDD can seamlessly share data and operations with other sub-frames and libraries such as Spark SQL, Spark Streaming, and GraphX. For example, MLlib can directly use the data provided by SparkSQL, or can directly join with GraphX graph calculation.
MLlib's position in the spark ecosystem
Spark MLlib architecture
It can be seen from the architecture diagram that MLlib mainly consists of three parts:
The underlying foundation: including Spark's runtime library, matrix library and vector library;
Algorithm Library: Algorithms that include generalized linear models, recommendation systems, clustering, decision trees, and evaluation;
Utilities: Includes functions such as generation of test data and reading of external data.
The figure below is the core content of the MLlib algorithm library.
MLlib consists of some common learning algorithms and tools, including classification, regression, clustering, collaborative filtering, dimensionality reduction, etc., as well as the underlying optimization primitives and high-level pipeline APIs.
Specifically, it mainly includes the following aspects:
1. Algorithm tools: commonly used learning algorithms such as classification, regression, clustering and collaborative filtering;
2. Characterization tools: feature extraction, transformation, dimensionality reduction, and selection tools;
3. Pipeline: A tool for building, evaluating, and adjusting machine learning pipelines;
4. Persistence: save and load algorithms, models and pipelines;
5. Utilities: tools for linear algebra, statistics, data processing, etc.
Spark divides the machine learning algorithm into two modules:
Training module: output model parameters through training samples;
Prediction module: Initialize with model parameters, predict test samples, and output predicted values.
Parsing of classic algorithms in MLLib
Classification
Classification is an important machine learning and data mining technique. The purpose of classification is to construct a classification function or classification model (also often referred to as a classifier) based on the characteristics of the data set, which can map a sample of an unknown category to a technique in a given category.
The specific rules for classification can be described as follows:
Given a set of training data T (Training set), each record of T contains a number of attributes (Features) to form a feature vector, represented by the vector x = (x1, x2, .., xn). Xi can have different value ranges. When the value range of an attribute is a continuous field, the attribute is a continuous attribute (Numerical Attribute), otherwise it is a discrete attribute (Discrete Attribute). The category attribute is represented by C=c1, c2, ..ck, that is, the data set has k different categories. Then, T implies a mapping function from vector X to category attribute C: f(X)↦C . The purpose of classification is to analyze the input data and find an accurate description or model for each class through the characteristics of the data in the training set. This method (model) is used to represent the implicit function.
The process of constructing a classification model is generally divided into two stages of training and testing. The data set is randomly divided into a training data set and a test data set before constructing the model. The training data set is first used to construct the classification model, and then the test data set is used to evaluate the classification accuracy of the model. If the accuracy of the model is considered acceptable, the model can be used to classify other data tuples. In general, the cost of the testing phase is much lower than the training phase.
The classification algorithm of MLlib classification algorithm is based on different ideas, and the algorithms are also different, such as support vector machine SVM, decision tree algorithm, Bayesian algorithm, KNN algorithm and so on. The Spark.mllib package supports a variety of classification methods, including two categories, multiple classifications and regression analysis. The following table lists the algorithms supported for each type of problem.
The specific content of each algorithm is not detailed here because of too much content.
Classification algorithm usage scenario
1. Citizens travel by bus forecast
Based on the massive public transport data records, it is hoped to explore the behavior patterns of citizens in public transportation. Taking the public transportation bus route selection forecast as the direction, it is expected to analyze the behavior patterns of fixed crowds in public transportation by analyzing the historical bus card transaction data of some bus routes in Guangdong Province, and analyze the speculative passengers' travel habits and preferences to establish model prediction. What bus lines will be used in the next week to provide passengers with a symmetrical, safe and comfortable travel environment, and use data to guide future city smart travel.
2. Personal credit evaluation based on operator data
As a network service provider, operators have accumulated a large amount of basic user information and behavioral characteristics data, such as terminal data, package consumption data, communication data, and so on. The real-name policy ensures that the operator's user data can match the user's true identity and accurately and objectively reflect the user's behavior. The widely-covered network infrastructure provides the conditions for accumulating large amounts of real-time data that feed back information and characteristics of the user's dimensions in real time.
In China, personal credit evaluation mainly refers to the central bank's personal credit report, but for many users who do not have a personal credit record, financial institutions want to know that their credit record costs are high, and traditional credit evaluation methods are difficult to meet the current A variety of emerging needs. Unlike other big data services, financial business requires high authenticity, credibility and timeliness of data, which is the value of operator data.
It is expected to use the operator's user data to provide a comprehensive personal credit assessment.
3. Product image classification
Jingdong contains millions of pictures of products, and applications such as “photographing” and “finding the same” must classify the pictures of the products provided by users. At the same time, extracting product image features can be provided to recommendations, advertisements, etc., to improve the effect of recommendation/advertising.
It is hoped that the image data can be learned to achieve the purpose of classifying and dividing the image.
4. Advertising click behavior prediction
The user may have an advertisement exposure or click behavior during the browsing process. Forecasting ad clicks can guide advertisers to target ad serving and optimization, maximizing the return on advertising.
It is hoped that based on the 1 million random users' advertising exposure and click logs within a six-month timeframe, including ad monitoring point data, it is predicted whether each user will click on each monitoring point within 8 days.
5. Spam SMS recognition based on text content
Spam messages have increasingly become a problem that plagues operators and mobile phone users, seriously affecting people's normal lives, infringing on the social image of operators and jeopardizing social stability. The lawless elements use technology to continuously update the spam message format and the transmission route is very extensive. The traditional filtering based on strategy and keywords has limited effect. Many spam messages “escape” filtering and continue to reach the mobile phone terminal.
It is hoped to intelligently identify spam messages and their variants based on SMS text content, combined with machine learning algorithms and big data analysis mining.
6. Searching for the portrait of Sogou user in big data precision marketing
The old saying "the things are gathered together, the people are divided into groups" not only reveals the self-organization tendency of things and people, but also implies the intrinsic relationship between "clustering" and "crowd". In the modern digital advertising system, it is a premise that it is bigger than any big data. In the modern advertising system, the multi-level user image construction algorithm is one of the basic technologies to achieve accurate advertising. Among them, the advertising targeting technology based on population attributes is a key technology generally applicable to brand display advertising and precision bidding advertising. In the search auction system, users obtain relevant information by inputting specific query words in the search engine. Therefore, the user's historical query terms are closely related to the user's basic attributes and potential needs.
It is hoped that the query word based on the user history for one month and the user's population attribute label (including gender, age, education) are used as training data, and the classification algorithm is constructed by machine learning and data mining technology to determine the population attribute of the newly added user.