Seven tools to build the spark big data engine

Source: Internet
Author: User
Tags sparkr

Spark is rolling a storm in the field of data processing. Let's take a look at some of the key tools that have helped Spark's big data platform through this article.

Spark Eco-system sentient beings
Apache Spark not only makes big data processing faster, but also makes big data processing easier, more powerful, and more convenient. Spark is not just a technology, it combines many parts, new features and performance improvements are being added, and each part is constantly being perfected.
This article describes each of the major parts of the spark ecosystem: the function of each part, why it is important, how it is developed, where it is not satisfactory, and in which direction it may go.

?Spark Core
Spark Core is just the name of Spark core. In addition to coordinating and dispatching jobs, spark core provides a basic abstraction mechanism for data processing in spark, called Elastic distributed Data Set (RDD).
The RDD performs two actions on the data: conversion and manipulation. The former transforms the data and provides them as a newly innovative RDD, which calculates the results based on the existing RDD (such as the number of objects).
Spark is fast because transitions and operations are kept in memory. Operations are evaluated slowly, which means that the operation is performed only when the relevant data is needed, but it is difficult to figure out what is running slowly.
The speed of spark is constantly improving. While Java's memory management often poses problems for spark, project tungsten plans to avoid the JVM's memory and garbage collection subsystem to improve memory efficiency.

?Spark API
Spark is primarily written in Scala, so the main API for Spark has long supported Scala. But another three of the more widely used languages are also supported: Java (which Spark also relies on), Python, and R.
In general, you'd better choose the language you're best at, because the features you need are likely to be supported directly in that language. There is only one exception: In contrast, the support for machine learning in Sparkr is not very supportive, and currently there is only a small batch of algorithms available for use. But this situation is bound to change in the future.

?Spark SQL
Never underestimate the ability or convenience to execute SQL queries against bulk data. Spark SQL provides a common mechanism for executing SQL queries (and requesting column-Dataframe) for data provided by spark, including queries that are piped through ODBC/JDBC connectors. You don't even need a regular data source. This feature was added in Spark 1.6: Supports querying flat files in a supported format, just like Apache drill.
Spark SQL is not actually used to update data, because that is contrary to the entire meaning of spark. The resulting data can be written back into a new spark data source (such as a new parquet table), but update queries are not supported. Don't expect such features to come soon; the improvements that focus on spark SQL are mostly used to improve their performance because it is also the basis for spark streaming.
?
?Spark Streaming
Spark's design allows it to support many processing methods, including stream processing ――spark streaming hence the name. The traditional idea about spark steaming is that it's half-baked, which means you won't be using it unless you need an instant delay, or if you haven't invested in another streaming solution like Apache Storm.
But Storm is losing popularity, and the long-term use of Storm's tweets has replaced its own project Heron. In addition, Spark 2.0 promises to introduce a new "structured data Flow" model for interactive Spark SQL queries on real-time data, including a machine learning library using Spark. It remains to be seen whether its performance is high enough to beat rivals, but it deserves serious consideration.

?MLlib (machine learning)
Machine learning technology is known as both magical and difficult. Spark allows you to run many common machine learning algorithms for data in spark, making these types of analysis much easier and easier for spark users to use.
The number of available algorithms in mllib increases with each revised version of the framework. That being said, some types of algorithms are not-for example, any algorithm that involves deep learning. Third parties are using Spark's popularity to fill this void; For example, Yahoo can perform deep learning with Caffeonspark, which leverages the Caffe deep learning system through spark.

?GraphX (graphical calculation)
Mapping the relationship between millions of entities usually requires a graph that describes the interrelationships between those entities. Spark's GRAPHX API allows you to perform graphical operations on your data using a set of spark methods, so the heavy lifting of building and transforming such graphics is offloaded to spark. GRAPHX also includes several common algorithms for working with data, such as PageRank or label propagation (label propagation).
For now, one of the main limitations of GRAPHX is that it is best suited for static graphics. Processing graphics that add new vertices can have a serious impact on performance. In addition, if you are already using a mature graphics database solution, GRAPHX is unlikely to replace it.

?Sparkr (R on Spark)
The R language provides an environment for statistical numerical analysis and machine learning work. Spark added the ability to support R in June 2015 to match its capabilities to support Python and Scala.
In addition to providing a single language for potential spark developers, Sparkr also allows R programmers to do many things that could not be done before, such as accessing a data set that exceeds the memory capacity of one machine, or using multiple processes easily or running analytics on multiple machines at the same time.
Sparkr also allows R programmers to take full advantage of the Mllib machine learning module in spark to create a generic linear model. Unfortunately, not all Mllib features are supported in Sparkr, but each successive revision of Spark is filling the gap in R support.

? Original title: 7 Tools to fire up Spark's big data engine
"51CTO translation, cooperation site reprint please specify the original translator and source for 51cto.com"

Seven tools to build the spark big data engine

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.