the Features and Functions of Spark

Last Update:2020-05-27 Source: Internet

Author: User

Keywords spark spark features spark functions

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The reason why Spark is so popular is mainly because it has different features from other big data platforms, mainly as follows.

1. Lightweight and fast processing

Speed is often put first in big data processing. Spark allows applications in traditional Hadoop clusters to run at 100 times the speed in memory, even 10 times faster on disk. Spark achieves performance improvement by reducing disk IO. They put all the intermediate data in memory. Spark uses RDD (ResilientDistributedDatasets) data abstraction, which allows it to store data in memory and persist it to disk only when needed. This approach greatly reduces disk reads and writes during data processing and greatly reduces runtime.

2. Easy to use

Spark supports multiple languages. Spark allows Java, Scala, Python, and R (the latest support for Spark 1.4), which allows more developers to work in their familiar language environment, popularizing the scope of application of Spark, it comes with more than 80 high-level Operators allow interactive queries in the shell, and its multiple use mode features make the application more flexible.

3. Support complex queries

In addition to simple map and reduce operations, Spark also supports complex queries such as filter, foreach, reduceByKey, aggregate, and SQL queries and streaming queries. What makes Spark more powerful is that users can seamlessly match these functions in the same workflow. For example, Spark can obtain streaming data through SparkStreaming (Section 1.2.2 describes SparkStreaming in detail), and then perform real-time SQL query on the data or Use the MLlib library for system recommendation, and the integration of these complex services is not complicated, because they are based on the RDD abstract data set to convert in different business processes, the conversion cost is small, reflecting the characteristics of the unified engine to solve different types of work scenarios . Streaming technology, MLlib library and RDD will be detailed in the following chapters.

4. Real-time stream processing

Compared to MapReduce, which can only handle offline data, Spark can also support real-time streaming computing. SparkStreaming is mainly used to process data in real time. Of course, after YARN, Hadoop can also use other tools for streaming computing. For SparkStreaming, Cloudera, a famous big data product development company, once commented on SparkStreaming as follows:

1) Simple, lightweight and with a powerful API, SparksStreaming allows users to quickly develop streaming applications.

2) Strong fault tolerance, unlike other streaming solutions, such as the use of Storm requires additional configuration, and Spark does not require additional code and configuration, because directly using its upper application framework SparkStreaming can do a lot of recovery and delivery work, let Spark's stream computing is more suitable for different needs.

3) Good integration, reuse the same code for stream processing and batch processing, and even save stream data to historical data (such as HDFS).

5. Integration with existing Hadoop data

Spark can not only run independently (using standalone mode), but also run in the current YARN management cluster. It can also read any existing Hadoop data, which is a very big advantage, it can run on any Hadoop data source, such as HBase, HDFS, etc. If appropriate, this feature allows users to easily migrate existing Hadoop applications.

6. Active and growing community

Spark originated in 2009, and 730 engineers from more than 50 organizations have contributed code. The number of lines of code in 2015 has increased by nearly three times compared to June 2014 (the data is derived from the data released by SparkSummit2015), which is amazing growth of.

The function of Spark

Why is Spark being used by so many companies at this stage? From the perspective of demand, the amount of data in the information industry continues to accumulate and expand. Traditional single machines cannot be processed due to their own software and hardware limitations. A system that can store and analyze large amounts of data is needed. On the other hand, large Internet companies such as Google and Yahoo Because the amount of business data is growing very fast, strong demand has promoted the development of data storage and computing analysis system technologies. At the same time, the company's requirements for efficient and real-time big data processing technology are becoming higher and higher. Spark is in such a demand-oriented background Appeared, the purpose of its design is to be able to quickly deal with big data problems in a variety of scenarios, can efficiently mine the value in big data, so as to provide decision support for business development.

At present, Spark has been widely used in e-commerce, telecommunications, video entertainment, retail, business analysis, and finance. In the application section of the fourth part of this book, you can see the application sharing of Spark companies in these fields. Readers can learn from one. A glimpse of Spark's powerful capabilities.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More