Originally this article is prepared for 5.15 more, but the last week has been busy visa and work, no time to postpone, now finally have time to write learning Spark last part of the content.
第10-11 is mainly about spark streaming and Mllib. We know that Spark is doing a good job of working with data offline, so how does it behave on real-time data? In actual production, we often need to deal with the received data, such as real-time machine learning model application, automatic anomaly detection, real-time tracking page access statistics applications. Spark streaming can be a good solution to these similar problems.
To learn about spark streaming, you only need to know the following points:
- DStream
- Concept: Discrete stream (discretized stream), which is data over time. A sequence of rdd consisting of each time interval. Dstream can be created from multiple input sources, such as Flume, Kafka, or HDFs.
- Operations: Conversion and output, support for RDD-related operations, added "sliding window" equals time-related operations.
The following diagram illustrates the workflow of Spark streaming:
As you can see, Spark streaming treats streaming calculations as a series of small, continuous batches of batch processing. It reads data from various input sources and groups the data into small batches, and new batches are created at uniform intervals. At the beginning of each time interval, a new batch is created, and the data received within that interval is added to the batch. At the end of the time interval, the batch stops growing.
Conversion Actions
- Stateless conversion operations: The simple rddtransformation are applied to each batch individually , and the processing of each batch does not depend on the data from the previous batch. Includes map (), filter (), Reducebykey (), and so on.
- Stateful conversion operations: you need to use data from previous batches or intermediate results to calculate the current batch of data. Includes conversion actions based on sliding windows, and conversion actions to track status changes (Updatestatebykey ())
Stateless conversion Operations
Stateful conversion operations
Windows mechanism (a picture of thousands of words)
It should be easy to read, and here's an example (written in Java):
Updatestatebykey () Conversion operations
It is mainly used to access state variables for Dstream of key-value pairs. The first one is given a dstream that consists of a (key, event) pair, and a function that specifies how a new event of a personal play updates each key corresponding to the state, and it can build a new dstream for (key, state). Popular Point said, join us want to know what a user recently visited the 10 pages, you can set the key to the user ID, and then Updatestatebykey () can track each user recently visited 10 pages, this list is the "status" object. Specifically, Updatestatebykey () provides an update (events,oldstate) function that receives the time associated with a key and the corresponding state before the key, and then returns the new state of the key.
- Events: the time list () that is received in the current batch may be empty.
- Oldstate: is an optional state object that is stored in option and can be empty if a key does not have a previous state.
- NewState: Returned by function, also in option form. If an empty option is returned, it indicates that you want to delete the state.
The result of Updatestatebykey () is a new dstream, in which the internal RDD sequence is composed of the corresponding (key, state) pairs of each time interval.
Next, let's talk about the input source
- Core Data sources: file streams, including text formats and arbitrary hadoop input formats
- Additional data sources: Kafka and Flume are more commonly used, the following will talk about Kafka input
- Multiple data sources and cluster size
The specific operations of Kafka are as follows:
Machine learning based on Mllib
In general, we commonly used algorithms are single-machine running, but want to run on the cluster, you can not take these algorithms directly to use. First, the data format is different, we are generally discrete or continuous type of data, data types are generally array, list, dataframe more, in TXT, CSV and other formats stored, but on spark, data is in the form of RDD, How to convert Ndarray to Rdd is a problem; In addition, even if we convert the data into the RDD format, the algorithm will be different. For example, you now have a bunch of data, stored in the RDD format, and then set up partitions, each partition to store some data to run the algorithm, you can think of each partition as a single running program, but all the partitions after the run? How do you combine the results? Direct averaging? Or is it another way? So, the algorithm that runs on the cluster must be a specially written distributed algorithm. And some algorithms are not distributed to run. The Mllib also contains only parallel algorithms that can run well on the cluster.
Data types for Mllib
- Vector: vectors (mllib.linalg.Vectors) support dense and sparse (dense vectors and sparse vectors). The difference between the first and the previous values is stored, which only stores non-0 values to save space.
- Labeledpoint: (mllib.regression) represents a labeled data point that contains a feature vector with a label, noting that the label is converted to a floating-point type by stringindexer conversion.
- Rating: (mllib.recommendation), user ratings for a product, for product recommendations
- Various model classes: Each model is the result of a training algorithm, and there is generally a predict () method that can be used to predict a new data point or an RDD that consists of data points.
In general, most algorithms operate directly with an RDD consisting of vectors, labledpoint, or rating, and typically we need conversion operations to build the RDD after reading data from external data. The specific clustering and classification algorithm principle is not much to say, you can go to see mllib online documents to see. Here's an example----the process of spam categorization:
Steps:
1. Convert data to a string Rdd
2. Feature extraction, convert text data to numerical features, return a vector rdd
3. Running the model on the training set, using the classification algorithm
4. Evaluate the effect on the test system
Specific code:
1 fromPyspark.mllib.regressionImportLabeledpoint2 fromPyspark.mllib.featureImportHASHINGTF3 fromPyspark.mllib.calssificationImportLOGISTICREGRESSIONWITHSGD4 5Spam = Sc.textfile ("Spam.txt")6normal = Sc.textfile ("Normal.txt")7 8 #Create a HASHINGTF instance to map the message text to a vector containing 10,000 features9tf = HASHINGTF (Numfeatures = 10000)Ten #Each message is cut into words, and each word back is mapped to a feature OneSpamfeatures = Spam.map (LambdaEmail:tf.transform (Email.split (" "))) ANormalfeatures = Normal.map (LambdaEmail:tf.transform (Email.split (" "))) - - #examples of creating labeledpoint datasets that hold positive (spam) and negative (normal mail) respectively thePositiveexamples = Spamfeatures.map (LambdaFeatures:labeledpoint (1, features)) -Negativeexamples = Normalfeatures.map (LambdaFeatures:labeledpoint (0,features)) -Trainingdata =positiveexamples.union (negativeexamples) -Trainingdata.cache#because the logistic regression is an iterative algorithm, the cached data Rdd + - #using the SGD algorithm to run logistic regression +Model =Logisticregressionwithsgd.train (trainingdata) A at #test with positive (spam) and negative (normal mail) examples separately -Postest = Tf.transform ("O M G GET cheap stuff by sending ...". Split (" ")) -Negtest = Tf.transform ("Hi Dad, I stared studying Spark the other ...". Split (" ")) - Print "prediction for positive test examples:%g"%model.predict (postest) - Print "prediction for negative test examples:%g"%model.predict (Negtest)
This example is very simple, speaking is also very limited, we suggest that according to their own needs, directly see Mllib official documents, about the cluster, the classification is very detailed.
"Original" Learning Spark (Python version) learning notes (iv)----spark sreaming and Mllib machine learning