Extracts from big Data algorithms

Source: Internet
Author: User
Tags id3

Extract preprocessing of Big data algorithm
    1. Extraction
    2. Cleaning
Analysis method
    1. Aggregation: Clustering is similar to classification, but unlike the purpose of classification, a group of data is divided into several categories for similarity and differentiation of data. The similarity between the data belonging to the same category is very large, but the similarity between the different categories is very small, and the cross-class data correlation is very low.
    2. Classification: Classification is to find out the common characteristics of a set of data objects in a database and classify them into different classes according to the classification model, which aims to map the data items in the database to the given category through the classification models.
    3. Regression analysis: Regression analysis reflects the property values of the data in the database, and the relationship between the data mappings through the function to discover the dependency between the attribute values. It can be applied to the study of the prediction of the data series and related relations.
    4. Association: Association rules are associations or interrelationships that are hidden between data items, that is, the appearance of other data items can be deduced based on the appearance of one data item. The mining process of association rules consists of two stages: the first stage is to find all the high-frequency project groups from the massive raw data; the second extreme is to generate association rules from these high-frequency project groups.
Specific algorithms
  1. C4.5
    C4.5 algorithm is a classification decision tree algorithm in machine learning algorithm, and its core algorithm is ID3 algorithm. The C4.5 algorithm inherits the advantages of the ID3 algorithm and improves the ID3 algorithm in the following ways:

    1. By using the information gain rate to select the attribute, the disadvantage of selecting the attribute with the information gain is overcome.
    2. Pruning in the process of tree construction;
    3. be able to complete discrete processing of continuous attributes;
    4. Ability to process incomplete data.

    The C4.5 algorithm has the following advantages: The resulting classification rules are easy to understand and the accuracy rate is high. The disadvantage is that in the process of constructing the tree, the data sets need to be scanned and sorted several times, which results in the inefficiency of the algorithm.

  2. Thek-meansalgorithm is the K-means algorithm
    The k-meansalgorithm algorithm is a clustering algorithm that divides n objects into K-partitions based on their attributes.

  3. Supportvectormachines
    Support Vector machine, English for Supportvectormachine, called SV Machine (generally referred to as SVM in the paper). It is a kind of supervised learning method, which is widely used in statistical classification and regression analysis. Support Vector machines map vectors to a higher dimensional space, where a maximum interval of hyperspace is established in this space. On both sides of the super plane that separates the data, there are two super-planes that are parallel to each other. The separation of the superelevation plane maximizes the distance of two parallel super-planes. It is assumed that the larger the distance or gap between parallel planes, the smaller the total error of the classifier. An excellent guide is C.j.cburges's "pattern Recognition Support vector machine Guide". Vanderwalt and Barnard compare support vector machines to other classifiers.

  4. Theapriorialgorithm
    Apriori algorithm is one of the most influential algorithms for mining Boolean association rule frequent itemsets. The core is the recursive algorithm based on the two-stage frequency set theory. The association rule belongs to single-dimension, single-Layer and Boolean association rules in classification. In this case, all itemsets with support degrees greater than the minimum support are called frequent itemsets, or frequency sets.

  5. Maximum expectation (EM) algorithm
    In statistical computation, the maximal expectation (em,expectation–maximization) algorithm is the algorithm for finding the maximum likelihood estimation of parameters in the probability (probabilistic) model, The probabilistic model relies on invisible hidden variables (latentvariabl). Maximum expectations are often used in the field of machine learning and computer vision Data aggregation (dataclustering).

  6. PageRank
    PageRank is an important part of Google's algorithm. September 2001 was awarded the U.S. patent, the patent owner is one of Google's Founders Larry Page (Larrypage). As a result, the page in PageRank is not a webpage, it refers to Paige, that is, the hierarchical method is named after page.
    PageRank measures the value of the website based on the number and quality of the external links and internal links of the site. The concept behind PageRank is that each link to a page is a poll of that page, and the more links it has, the more votes are being voted on by other sites. This is called "link popularity" – A measure of how many people are willing to hook up their site to your site. The concept of PageRank is quoted as a quote from an academic paper-the more times people are quoted, the more authoritative it is to judge the paper.

  7. AdaBoost
    AdaBoost is an iterative algorithm whose core idea is to train different classifiers (weak classifiers) for the same training set, and then set up these weak classifiers to form a stronger final classifier (strong classifier). The algorithm itself is achieved by changing the distribution of data, which determines the weights of each sample based on the correctness of the classification of each sample in each training set and the accuracy of the last population classification. The new data set that modifies the weights is sent to the lower classifier for training, and finally the classifier that is trained at the end of each training is fused as the final decision classifier.

  8. Knn:k-nearestneighborclassification
    K Nearest neighbor (K-NEARESTNEIGHBOR,KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms. The idea of this approach is that if a sample is in the K most similar in the feature space (that is, the nearest neighbor in the feature space) Most of the samples belong to a category, then the sample belongs to that category.

  9. Naivebayes
    In many classification models, the two most widely used classification models are decision tree Model (Decisiontreemodel) and naive Bayesian model (NAIVEBAYESIANMODEL,NBC). Naive Bayesian model originates from classical mathematics theory, has a solid mathematical foundation, and stable classification efficiency. At the same time, the NBC model has few parameters to estimate, less sensitive to missing data, and simpler algorithm. In theory, the NBC model has the smallest error rate compared to other classification methods. But this is not always the case, because the NBC model assumes that the properties are independent of each other, and this hypothesis is often not true in practice, which has a certain effect on the correct classification of the NBC model. When the number of attributes is more or the correlation between attributes is large, the efficiency of the NBC model is inferior to the decision tree model. The performance of the NBC model is best when the attribute correlation is small.

  10. CART: Classification and regression tree
    Cart,classificationandregressiontrees. There are two key ideas under the classification tree. The first is the idea of recursively dividing an argument space; the second idea is to use validation data to prune

Extracts from big Data algorithms

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.