The Apache Software Foundation has officially announced that Spark's first production release is ready, and this analytics software can greatly speed up operations on the Hadoop data-processing platform.
As a software project with the reputation of a "Hadoop Swiss Army Knife", Apache Spark can help users create performance-efficient data analysis operations that are faster than they would otherwise have been on standard Apache Hadoop mapreduce.
Replace MapReduce
The industry is now widely critical of mapreduce that the batch approach it takes in executing its operations is a performance bottleneck in the Hadoop cluster-which means that the real-time data analysis mechanism is simply not possible.
The advent of Spark provides an ideal alternative to mapreduce, which performs processing jobs in five seconds or less, and through micro-batch bursts. It also provides more robust performance than real-time, stream-oriented Hadoop frameworks such as Twitter storm.
Cloudera announced to abandon MapReduce, embrace spark
April 25, 2014, Cloudera formally announced the abandonment of MapReduce, into the arms of Spark.
51CTO an interview with the former Intel Institute, now databricks researcher Liancheng. The theory has shown that the MapReduce model can simulate all distributed computing (but it may not be efficient to simulate it), Mr. Lin said. Spark based on Rdd can express the MapReduce model easily and completely, and it is more efficient than mapreduce because of the higher efficiency abstraction of distributed data sharing. For more information, please read the original interview >>
Spark can be used to handle a variety of job types, including real-time data analysis, and a deeper computing job with the software library, such as machine learning and graphics processing.
With spark, developers can write data analysis jobs in languages such as Java, Scala, or Python, and use more than 80 advanced operators.
Changes brought about by Spark 1.0
In version 1.0, Apache Spark is currently able to provide a stable API, the application programming interface, that developers can use to dock their applications with spark. The existing standard library has also been greatly enhanced.
Another new feature in the Spark 1.0 version is the provision of Spark SQL components for accessing structured data, which allows users to simultaneously query both structured and unstructured data in the analysis effort.
The Apache Spark is fully compatible with Hadoop's Distributed File System (HDFS), as well as with other Hadoop components-including yarn (full name verb Another Resource negotiator) and HBase distributed databases- -Parallel collaboration.
What is Spark
Spark was originally developed by the amp (algorithmic, machine and human) lab at the University of California, Berkeley, and Apache was included in the incubator training program in June 2013. Several it vendors, including Cloudera, Pivotal, IBM, Intel, and MAPR, have introduced spark to their hadoop stacks. As a company built by some spark developers, Databricks is specifically responsible for providing business support services for the software.
In addition to the companies mentioned earlier, Yahoo and NASA also use the software to perform routine data manipulation tasks.
Like all other Apache software, Apache Spark is also based on the Apache License 2.0 version.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.