The reason why
Spark is so popular is mainly because it has different features from other
big data platforms, mainly as follows.
1. Lightweight and fast processing
Speed is often put first in big data processing. Spark allows applications in traditional Hadoop clusters to run at 100 times the speed in memory, even 10 times faster on disk. Spark achieves performance improvement by reducing disk IO. They put all the intermediate data in memory. Spark uses RDD (ResilientDistributedDatasets) data abstraction, which allows it to store data in memory and persist it to disk only when needed. This approach greatly reduces disk reads and writes during data processing and greatly reduces runtime.
2. Easy to use
Spark supports multiple languages. Spark allows Java, Scala, Python, and R (the latest support for Spark 1.4), which allows more developers to work in their familiar language environment, popularizing the scope of application of Spark, it comes with more than 80 high-level Operators allow interactive queries in the shell, and its multiple use mode features make the application more flexible.
3. Support complex queries
In addition to simple map and reduce operations, Spark also supports complex queries such as filter, foreach, reduceByKey, aggregate, and SQL queries and streaming queries. What makes Spark more powerful is that users can seamlessly match these functions in the same workflow. For example, Spark can obtain streaming data through SparkStreaming (Section 1.2.2 describes SparkStreaming in detail), and then perform real-time SQL query on the data or Use the MLlib library for system recommendation, and the integration of these complex services is not complicated, because they are based on the RDD abstract data set to convert in different business processes, the conversion cost is small, reflecting the characteristics of the unified engine to solve different types of work scenarios . Streaming technology, MLlib library and RDD will be detailed in the following chapters.
4. Real-time stream processing
Compared to MapReduce, which can only handle offline data, Spark can also support real-time streaming computing. SparkStreaming is mainly used to process data in real time. Of course, after YARN, Hadoop can also use other tools for streaming computing. For SparkStreaming, Cloudera, a famous big data product development company, once commented on SparkStreaming as follows:
1) Simple, lightweight and with a powerful API, SparksStreaming allows users to quickly develop streaming applications.
2) Strong fault tolerance, unlike other streaming solutions, such as the use of Storm requires additional configuration, and Spark does not require additional code and configuration, because directly using its upper application framework SparkStreaming can do a lot of recovery and delivery work, let Spark's stream computing is more suitable for different needs.
3) Good integration, reuse the same code for stream processing and batch processing, and even save stream data to historical data (such as HDFS).
5. Integration with existing Hadoop data
Spark can not only run independently (using standalone mode), but also run in the current YARN management cluster. It can also read any existing Hadoop data, which is a very big advantage, it can run on any Hadoop data source, such as HBase, HDFS, etc. If appropriate, this feature allows users to easily migrate existing Hadoop applications.
6. Active and growing community
Spark originated in 2009, and 730 engineers from more than 50 organizations have contributed code. The number of lines of code in 2015 has increased by nearly three times compared to June 2014 (the data is derived from the data released by SparkSummit2015), which is amazing growth of.
The function of Spark
Why is Spark being used by so many companies at this stage? From the perspective of demand, the amount of data in the information industry continues to accumulate and expand. Traditional single machines cannot be processed due to their own software and hardware limitations. A system that can store and analyze large amounts of data is needed. On the other hand, large Internet companies such as Google and Yahoo Because the amount of business data is growing very fast, strong demand has promoted the development of data storage and computing analysis system technologies. At the same time, the company's requirements for efficient and real-time big data processing technology are becoming higher and higher. Spark is in such a demand-oriented background Appeared, the purpose of its design is to be able to quickly deal with big data problems in a variety of scenarios, can efficiently mine the value in big data, so as to provide decision support for business development.
At present, Spark has been widely used in e-commerce, telecommunications, video entertainment, retail, business analysis, and finance. In the application section of the fourth part of this book, you can see the application sharing of Spark companies in these fields. Readers can learn from one. A glimpse of Spark's powerful capabilities.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.