Spark is a memory-based, open-source cluster computing system designed for faster data analysis. Spark was developed using Scala by Matei, AMP Labs, University of California, Berkeley. The core part of the code is only 63 Scala files, which is very lightweight. Spark provides an open source clustered computing environment similar to Hadoop, but Spark performs better on some workloads based on memory and iteratively optimized designs.
In the first half of 2014, Spark's open source ecosystem has grown dramatically, becoming one of the most active open source projects in the big data space and is now active in many of the big big data companies such as Hortonworks, IBM, Cloudera, MapR and Pivotal. So what is Spark attracted so much attention, here we look to the Dzone 6 summary.
The following is the translation
1. Lightweight fast processing. With big data processing in mind, speed is often the number one priority, and we often look for tools to process our data as quickly as we can. Spark allows applications in a Hadoop cluster to run 100 times faster in memory and can run 10 times faster on disk. Spark achieves performance gains by reducing disk IOs, which put all intermediate processing data in memory.
Spark uses the concept of Resilient Distributed Dataset (RDD), which allows it to store data in transparent memory and persist to disk only when needed. This approach greatly reduces the disk read and write data processing, significantly reducing the time required.
2. Easy to use, Spark supports multiple languages. Spark allows Java, Scala, and Python, which allow developers to work in their familiar locale. It comes with more than 80 high-level operators that allow interactive queries in the shell.
3. Support for complex queries. In addition to the simple "map" and "reduce" operations, Spark also supports SQL queries, streaming queries, and complex queries, such as out-of-box machine-learning machine graph algorithms. At the same time, users can seamlessly match these abilities in the same workflow.
4. Real-time stream processing. MapReduce can only handle offline data, Spark supports real-time flow calculation. Spark relies on Spark Streaming to process data in real-time, and of course, Hadoop can also use other tools for streaming after YARN. For Spark Streaming, Cloudera's rating is:
Simple: Lightweight and powerful API, Sparks Streaming allows you to quickly develop streaming applications. Fault Tolerance: Unlike other streaming solutions such as Storm, Spark Streaming can do a great deal of recovery and delivery without additional code and configuration. Integration: The same code is reused for stream processing and batch processing, and even stream data can be saved to historical data.
5. Can be integrated with Hadoop and saved Hadoop data. Spark operates independently and can read any existing Hadoop data in addition to the current YARN cluster management. This is a big advantage, it can run on any Hadoop data source, such as HBase, HDFS and so on. This feature allows users to easily migrate existing Hadoop applications, if appropriate.
6. Active and unlimited community. Spark, which originated in 2009 and now has over 250 coders contributed by more than 50 agencies, has almost three times the code line count in June last year, an enviable growth.
Original link: http: //www.csdn.net/article/2014-08-07/2821098-6-sparkling-features-of-apache-spark