Original address: http://blog.jobbole.com/?p=89446
I first heard of spark at the end of 2013, when I was interested in Scala, and Spark was written in Scala. After a while, I made an interesting data science project, and it tried to predict surviving on the Titanic . This proves to be a good way to learn more about spark content and programming. I highly recommend this project to any spark developer who is pursuing and thinking about how to get started.
Today, Spark has been used by many giants, including Amazon, ebay, and Yahoo! Many organizations run spark on clusters with thousands of nodes. According to the Spark FAQ, the largest known spark cluster has more than 8,000 nodes. Spark is really a technology worth thinking about and learning about.
This article introduces you to spark, including use cases and examples. The information comes from the Apache Spark website and a book on Big data analysis that spark– lightning fast.
What is Apache Spark? A Brief introduction
Spark is an Apache project that is advertised as "lightning-fast Cluster Computing". It has a thriving open source community and is currently the most active Apache project.
Spark provides a faster, more general-purpose data processing platform. Compared to Hadoop, Spark can make your program run 100 times times faster in-memory or 10 times times faster on disk. Last year, in the Daytona Graysort game, Spark beat Hadoop, which used only one-tenth of the machines, but ran 3 times times faster. Spark has also become the fastest open Source engine for sorting petabytes of data.
1234 |
sparkContext.textFile("hdfs://...") .flatMap(line => line.split(" ")) .map(word => (word, 1)).reduceByKey(_ + _) .saveAsTextFile("hdfs://...") |
Spark also makes it possible for us to write code more quickly, as if there are more than 80 high-level operators working on it for you. To illustrate this, let's look at the "Hello world!" in Big Data : Examples of the number of words counted. In MapReduce, we need to write about 50 lines of code to do this, but for Spark (and Scala), you can do it as simple as this:
1234 |
sparkContext.textFile("hdfs://...") .flatMap(line => line.split(" ")) .map(word => (word, 1)).reduceByKey(_ + _) .saveAsTextFile("hdfs://...") |
Another important part of learning how to use Apache Spark is the interactive shell (REPL), which is out of the box. By using REPL, we can test the output of each line of code without having to first write and execute the entire job. This allows you to get working code faster, and point-to-point data analysis becomes possible.
Spark also offers some other key features:
- APIs for Scala, Java, and Python are now available, and support for other languages such as R is forthcoming.
- Integrates well with the Hadoop ecosystem and data sources (HDFS, Amazon S3, Hive, HBase, Cassandra, etc.).
- Can be run on a cluster managed by Hadoop yarn or Apache Mesos, or on a separate cluster.
The spark core consists of a set of powerful, high-level libraries that can be seamlessly applied to the same application. Currently these libraries include Sparksql, Spark streaming, MLlib (for machine learning), and GRAPHX, which we'll describe later for each library. Other spark libraries and extensions are also in the process of development.
Spark Core
Spark core is a basic engine for massively parallel and distributed data processing. It is primarily responsible for:
- Memory management and failure recovery
- schedule, distribute, and monitor jobs on the cluster
- Interacting with the storage system
Spark introduces a concept called Elastic distributed Data Set (Rdd,resilient distributed DataSet), which is an immutable, fault-tolerant, distributed collection of objects that we can manipulate in parallel. An RDD can contain objects of any type that are created when the external dataset is loaded or when the collection is distributed from the driver application.
The RDD supports two types of operations:
- A transformation is an operation (such as a map, a filter, a join, a union, and so on) that performs an operation on an RDD and then creates a new rdd to hold the result.
- An action is an operation (for example, merge, Count, first, and so on) that performs some sort of calculation on an RDD and then returns the result.
In spark, conversions are "lazy", meaning that they do not immediately calculate results. Instead, they simply "remember" the action to be performed and the set of data to be manipulated (such as a file). The conversion is actually evaluated only when the behavior is invoked, and the result is returned to the drive program. This design allows spark to run more efficiently. For example, if a large file is to be converted in various ways and the file is passed to the first behavior, then spark will only process the first line of the file and return the result instead of processing the entire file.
By default, this RDD is likely to be recalculated when you run a behavior on a converted rdd. However, you can also persist an RDD from the beginning in memory by using persistent or cached methods, so that spark will keep these elements on the cluster, and the query will be much faster when you next query it.
Sparksql
Sparksql is a component of spark that enables us to query data through SQL or Hive query language. It was originally from the Apache Hive Project, which was used to run on spark (instead of mapreduce) and is now integrated into the spark heap. In addition to providing support for a wide variety of data sources, it also makes it possible to weave code transformations together with SQL queries, which ultimately creates a very powerful tool. The following is an example of a query that is compatible with hive:
12345678 |
// sc is an existing SparkContext. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH ‘examples/src/main/resources/kv1.txt‘ INTO TABLE src") // Queries are expressed in HiveQL sqlContext.sql("FROM src SELECT key, value").collect().foreach(println) |
Spark Streaming
Spark streaming supports real-time processing of streaming data, such as log files for the product Environment Web server (such as Apache Flume and HDFS/S3), social media such as Twitter, and a variety of message queues like Kafka. Behind this, spark streaming receives input data and divides it into different batches, followed by the spark engine to process the batches and generate the final flow based on the results in the batch. The entire process is shown below.
The Spark streaming API can be very tightly matched to the Spark core API, which makes it easy for programmers to work in the oceans of batch data and streaming data.
MLlib
Mllib is a machine learning library that provides a wide variety of algorithms for classifying, regression, clustering, collaborative filtering, etc. on a cluster (you can get more information from the Toptal article on machine learning). Some of these algorithms can also be applied to streaming data, such as the use of ordinary least squares or K-mean clustering (and many more) to calculate linear regression. Apache Mahout (a machine learning library for Hadoop) has moved away from MapReduce to join Spark MLlib.
GraphX
Graphx is a library for working with diagrams and performing parallel operations based on graphs. It provides a unified tool for ETL, exploratory analysis and iterative graph computing. In addition to the built-in operations for graph processing, GRAPHX also provides a library for common graph algorithms such as PageRank.
How to use Apache Spark: event probing use case
Now that we've answered, "What's Apache spark?" "The question, then, is to think about what kind of problems or challenges to use spark to solve most efficiently."
Recently, I stumbled across an article about detecting earthquakes by analyzing the way Twitter flows. It shows that this technology can be quicker than the Japan Meteorological Agency to inform you of where the earthquake occurred in Japan. Although the article used a different technique, I think it's a good example of how we can use spark with a simple code snippet that doesn't require a "glue code".
First of all, we need to deal with tweets and filter out those related to "earthquakes" or "vibrations". We can easily achieve this by using the spark streaming method, as follows:
12 |
TwitterUtils.createStream(...) .filter(_.getText.contains("earthquake") || _.getText.contains("shaking")) |
Then we need to run some semantic analysis on the tweets to determine whether they represent an earthquake that is currently occurring. For example, like "Earthquake!" "or" Now is shaking "such a tweets, may be considered as a positive match, and like" to participate in an earthquake conference "or" Yesterday's earthquake is really scary "such as tweets, is not. The author of this article uses a support vector machine (SVM) to achieve this. We use the same approach here, but we can also try out the streaming version. An example of a code that uses Mllib is shown below:
1234567891011121314151617181920212223242526 |
// We would prepare some earthquake tweet data and load it in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "sample_earthquate_tweets.txt") // Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
// Clear the default threshold.
model.clearThreshold()
// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
|
If we are satisfied with the predicted proportions of this model, we can continue to go down, whenever we find an earthquake, we have to respond. In order to detect an earthquake, we need to have a certain number of forward tweets (for example, density) within a specified time window (as described in the article). Please note that for tweets with Twitter location service information, we are also able to extract location information from earthquakes. With this only later, we can use Sparksql to query the existing hive tables (save those interested in receiving earthquake notifications) to get the user's email address and send them some personalized warning messages as follows:
12345 |
// sc is an existing SparkContext. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) // sendEmail is a custom function sqlContext.sql("FROM earthquake_warning_users SELECT firstName, lastName, city, email") .collect().foreach(sendEmail) |
Other Apache spark use cases
Of course, Spark's potential use cases go far beyond earthquake predictions.
Here is a sample of some other use cases (far from the list, of course), which require fast processing of a wide variety of big data, and spark is well suited for handling these use cases:
In the game area, if you can process and discover patterns from the undercurrent of events in real-time games and respond quickly, this ability can lead to a lucrative business that includes player retention, targeted ads, automatic complexity, and more.
In the field of e-commerce, real-time transactions can be uploaded to a stream aggregation algorithm like the K-mean or a synergistic filtering algorithm like ALS. The resulting results may be combined with other unstructured data sources, such as customer comments or product reviews. Over time, we can use it to improve and improve the system's recommended functionality.
In the area of finance or security, the spark technology stack can be used for fraud or intrusion detection systems or risk-based authentication systems. By analyzing large-scale compression logs, combined with external data sources such as leaked data and leaked accounts (which can refer to https://haveibeenpwned.com/), some information such as IP address or time obtained from the connection/request, We can achieve a very good result.
Conclusion
In short, spark can help us simplify the processing of computationally intensive tasks and challenges that require processing large amounts of real-time or compressed data. This data includes both structured data and unstructured data. Spark can seamlessly integrate with other complex capabilities, such as machine learning, graph algorithms, and so on. Spark turns big data processing into "ground gas." Hurry up and try it.
Translation About Apache Spark Primer