Spark sreaming and Mllib machine learning

Source: Internet
Author: User
Tags pyspark

Spark sreaming and Mllib machine learning

Originally this article is prepared for 5.15 more, but the last week has been busy visa and work, no time to postpone, now finally have time to write learning Spark last part of the content.

第10-11 is mainly about spark streaming and Mllib. We know that Spark is doing a good job of working with data offline, so how does it behave on real-time data? In actual production, we often need to deal with the received data, such as real-time machine learning model application, automatic anomaly detection, real-time tracking page access statistics applications. Spark streaming can be a good solution to these similar problems.

To learn about spark streaming, you only need to know the following points:

  • DStream
    • Concept: Discrete stream (discretized stream), which is data over time. A sequence of rdd consisting of each time interval. Dstream can be created from multiple input sources, such as Flume, Kafka, or HDFs.
    • Operations: Conversion and output, support for RDD-related operations, added "sliding window" equals time-related operations.

The following diagram illustrates the workflow of Spark streaming:

As you can see, Spark streaming treats streaming calculations as a series of small, continuous batches of batch processing. It reads data from various input sources and groups the data into small batches, and new batches are created at uniform intervals. At the beginning of each time interval, a new batch is created, and the data received within that interval is added to the batch. At the end of the time interval, the batch stops growing.

Conversion Actions

  • Stateless conversion operations: The simple rddtransformation are applied to each batch individually, and the processing of each batch does not depend on the data from the previous batch. Includes map (), filter (), Reducebykey (), and so on.
  • Stateful conversion operations: you need to use data from previous batches or intermediate results to calculate the current batch of data. Includes conversion actions based on sliding windows, and conversion actions to track status changes (Updatestatebykey ())

Stateless conversion Operations

  Stateful conversion operations

Windows mechanism (a picture of thousands of words)

It should be easy to read, and here's an example (written in Java):

  

  Updatestatebykey () Conversion operations

It is mainly used to access state variables for Dstream of key-value pairs. The first one is given a dstream that consists of a (key, event) pair, and a function that specifies how a new event of a personal play updates each key corresponding to the state, and it can build a new dstream for (key, state). Popular Point said, join us want to know what a user recently visited the 10 pages, you can set the key to the user ID, and then Updatestatebykey () can track each user recently visited 10 pages, this list is the "status" object. Specifically, Updatestatebykey () provides an update (events,oldstate) function that receives the time associated with a key and the corresponding state before the key, and then returns the new state of the key.

  • Events: the time list () that is received in the current batch may be empty.
  • Oldstate: is an optional state object that is stored in option and can be empty if a key does not have a previous state.
  • NewState: Returned by function, also in option form. If an empty option is returned, it indicates that you want to delete the state.

The result of Updatestatebykey () is a new dstream, in which the internal RDD sequence is composed of the corresponding (key, state) pairs of each time interval.

Next, let's talk about the input source

  • Core Data sources: file streams, including text formats and arbitrary hadoop input formats
  • Additional data sources: Kafka and Flume are more commonly used, the following will talk about Kafka input
  • Multiple data sources and cluster size

The specific operations of Kafka are as follows:

Machine learning based on Mllib

In general, we commonly used algorithms are single-machine running, but want to run on the cluster, you can not take these algorithms directly to use. First, the data format is different, we are generally discrete or continuous type of data, data types are generally array, list, dataframe more, in TXT, CSV and other formats stored, but on spark, data is in the form of RDD, How to convert Ndarray to Rdd is a problem; In addition, even if we convert the data into the RDD format, the algorithm will be different. For example, you now have a bunch of data, stored in the RDD format, and then set up partitions, each partition to store some data to run the algorithm, you can think of each partition as a single running program, but all the partitions after the run? How do you combine the results? Direct averaging? Or is it another way? So, the algorithm that runs on the cluster must be a specially written distributed algorithm. And some algorithms are not distributed to run. The Mllib also contains only parallel algorithms that can run well on the cluster.

Data types for Mllib

  • Vector: vectors (mllib.linalg.Vectors) support dense and sparse (dense vectors and sparse vectors). The difference between the first and the previous values is stored, which only stores non-0 values to save space.
  • Labeledpoint: (mllib.regression) represents a labeled data point that contains a feature vector with a label, noting that the label is converted to a floating-point type by stringindexer conversion.
  • Rating: (mllib.recommendation), user ratings for a product, for product recommendations
  • Various model classes: Each model is the result of a training algorithm, and there is generally a predict () method that can be used to predict a new data point or an RDD that consists of data points.

  

In general, most algorithms operate directly with an RDD consisting of vectors, labledpoint, or rating, and typically we need conversion operations to build the RDD after reading data from external data. The specific clustering and classification algorithm principle is not much to say, you can go to see mllib online documents to see. Here's an example----the process of spam categorization:

Steps:

1. Convert data to a string Rdd

2. Feature extraction, convert text data to numerical features, return a vector rdd

3. Running the model on the training set, using the classification algorithm

4. Evaluate the effect on the test system

Specific code:

 1 from pyspark.mllib.regression import labeledpoint 2 from pyspark.mllib.feature import HASHINGTF 3 from pyspark.mllib.ca  Lssification Import LOGISTICREGRESSIONWITHSGD 4 5 spam = Sc.textfile ("Spam.txt") 6 normal = Sc.textfile ("Normal.txt") 7 8 #创建一个HashingTF实例来把邮件文本映射为包含10000个特征的向量 9 tf = HASHINGTF (Numfeatures = 10000) #各邮件都被切分为单词, each word back mapped to a feature 11 Spamfeatures = Spam.map (Lambda email:tf.transform ("Email.split ("))) Normalfeatures = Normal.map (lambda email: Tf.transform (Email.split ("))) #创建LabeledPoint数据集分别存放阳性 (spam) and negative (normal mail) example of Positiveexamples = Spamfeatures.map (Lambda Features:labeledpoint (1,features)) Negativeexamples = Normalfeatures.map (Lambda features: Labeledpoint (0,features)) Trainingdata = Positiveexamples.union (negativeexamples) trainingdata.cache# Because the logistic regression is an iterative algorithm, the cached data RDD19 #使用SGD算法运行逻辑回归21 model = Logisticregressionwithsgd.train (trainingdata) 22 23 # Test the positive (spam) and negative (normal mail) examples separately postest = Tf.transform ("O M G GET cheap stuff by sending money to ...". Split (" ") Negtest = Tf.transform (" Hi Dad, I stared studying Spark the other ... ". Split (")) "print" prediction for positive Test examples:%g "%model.predict (postest) print" Prediction for negative test examples:%g "%model.predict (negtest)

Spark sreaming and Mllib machine learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.