The Big data field of the 2014, Apache Spark (hereinafter referred to as Spark) is undoubtedly the most attention. Spark, from the hand of the family of Berkeley Amplab, at present by the commercial company Databricks escort. Spark has become one of ASF's most active projects since March 2014, and has received extensive support in the industry-Spark, released in December 2014, TLP The 1.2 version contains more than 1000 commits from 172-bit contributor contributions. In 20,141, the Spark has released a total of 9 versions (including the landmark version of 1.0 released at the end of May), and its community activity is evident. It is worth mentioning that, in November 2014, Databricks completed a Daytona gray category of sort Benchmark based on AWS and created a new record for the test. This article will generalize the development of spark in the 2014.
Spark 2014, a spark has become prairie
First of all, spark meetings and related exchanges. At present, the world's most authoritative spark field meeting is undoubtedly spark Summit, has been successfully held in 2013 and 2014 consecutive two sessions, engineers from around the world to participate in sharing their own spark use cases. In view of the current Spark, Spark Summit will be held in 2015 Spark Summit East and Spark Summit West two times. At home, the first Chinese Spark Technology Summit (Spark Summit China) was held in Beijing in April 2014, according to statistics, the country's major internet companies almost all attended the meeting. So, you can expect this year's spark Summit to bring what kind of surprise. In addition to such a relatively large meeting, Spark Meetup also held irregularly throughout the world, as of this writing, there have been 33 cities from 13 different countries have held Spark Meetup, the domestic has been held Spark Meetup the city has four, respectively, Beijing, Hangzhou , Shanghai and Shenzhen. In addition to offline communication, online classes will be organized for those who are not able to communicate online. From this, we can see that the exchange activities about spark in 2014 are very frequent, which is very helpful to promote the development of spark.
Secondly, in 2014, the major manufacturers have announced the cooperation with Databricks. Among them, Cloudera announced at the end of 2013 will add spark in its release, and then more enterprises to join in, such as DataStax, MAPR, Pivotal and Hortonworks and so on. This shows that spark has been recognized by many large data enterprises, and these enterprises do their own products and spark tightly integrated. For example, DataStax integrates Cassandra with Spark, allowing spark to manipulate data within Cassandra, as well as Elasticsearch and Spark, and more of this action can be referenced spark Summit The relevant content mentioned in 2014.
In addition, spark in 2014 also attracted more companies to use the landing. Foreign more well-known have Yahoo! , EBay, Twitter, Amazon, SAP, Tableau and MicroStrategy, and, at the same time, to be happy, in the spark landing practice, domestic enterprises also comparable more let, Taobao, Tencent, Baidu, Millet, Beijing-east, only product will, Archie Art, Sohu, seven cows, Well-known enterprises such as Huawei and Asiainfo have made use of production environment, which has also led to more and more Chinese engineers submitting code for Spark, especially spark SQL, and even about half of contributor are Chinese engineers. The use of major well-known enterprises, greatly enhance the entire industry to use spark interest and confidence, we have reason to believe that in 2015, the use of spark number of enterprises will be blowout-type outbreak. At the same time, there have been a number of start-up companies based on Spark, many of which have developed quite well, such as Adatao and Tuplejump.
With the increasing demand for spark engineers in the market, Databricks also launched the Spark Developer certification program in due course, and the first offline test was held in Barcelona, Spain, in November 2014. As of this writing (January 2015), Spark developer certification does not support online testing, but the online test platform will soon be online.
Based on the sustainable and healthy development of spark ecosystem, more and more enterprises and organizations develop the application and extension library on Spark. As these libraries grew, databricks on Christmas Eve 2014 with a similar PIP feature to track the sites of these libraries: http://spark-packages.org, there are already some libraries settled spark packages, Several of them are quite good, for example: Dibbhatt/kafka-spark-consumer, Spark-jobserver/spark-jobserver and Mengxr/spark-als.
Spark 2014, analysis of the technological evolution under the firewood
As shown in Figure 1, you can see that spark includes batch processing, streaming, graph processing, machine learning, ad hoc queries, and relational queries, which means that we need only one framework to meet the needs of a variety of usage scenarios. In the past, we might have to prepare a framework for each feature, such as using Hadoop mapreduce for batch processing and streaming with storm, and the result is that we have to write different business codes for both sets of computing frameworks, The business code that was written is almost impossible to reuse; On the other hand, in order to stabilize the system, we have to put a lot of effort into understanding the principles of Hadoop mapreduce and Storm, which creates a huge human cost. When we adopt spark, we only need to understand spark, another attraction is that the spark batch and stream computing business code is almost completely reused, which means that we just have to write a logical code to run batch and stream computations separately. Finally, spark can seamlessly use data stored on HDFS without any data migration action.
Figure 1 Spark Stack
At the same time, because the existing system has to share and exchange data with the distributed file system represented by HDFs, the IO overhead greatly reduces the computational efficiency; In addition, iterative serialization and deserialization are also the overhead that can not be neglected. In view of this, the concept of RDD is abstracted in Spark, and a series of rich operators are defined based on RDD, MapReduce is only a very small subset, at the same time, RDD can be cached in memory, thus iterative computation can fully enjoy the acceleration effect of memory calculation. Unlike the MapReduce based computing model, Spark is based on a multithreaded model, which also means that spark task scheduling latency can be controlled at the sub-second level, which can significantly reduce the overall scheduling time when the task is particularly large, and is based on the macro Batch flow calculation lays the foundation. Another feature of Spark is based on DAG task scheduling and optimization, spark does not need to be like mapreduce for every step of the operation to dispatch a job, on the contrary, spark rich operators can be more naturally expressed in the Dag form operations. At the same time, in Spark, there is pipeline optimization within each stage, so even if we don't use memory cache data, the spark is more efficient than Hadoop. Finally spark based on Rdd lineage information to fault tolerance, because RDD is immutable, spark does not need to record the middle state, when Rdd some partition lost, spark can use Rdd lineage information to carry out parallel recovery, However, when the lineage is longer, it is recommended that users checkpoint in a timely manner, thereby reducing recovery time.
Below we summarize the Spark and various components (Spark streaming, Mllib, Graphx and Spark SQL) on new features and stability along the release trajectory of each major release in 2014.
Spark 0.9.x
In early February 2014, Databricks released the first version of Spark 0.9.0, the most direct change in the version of Scala from 2.9.x to 2.10. Since Scala did not have binary backwards compatibility at the time, it was an episode that everyone had to use Scala2.10 to recompile the business code.
This version of the biggest contribution should be to join the configuration system, that is, sparkconf. Before this, various attribute parameters are directly as master parameters, and with the sparkconf, master does not need to control these, the various parameters in the sparkconf configuration completed, will sparkconf to master, This is useful in testing. In addition, when submitting a task, the driver program is allowed to run on a server in the cluster that was previously only run on a server outside the cluster.
Spark streaming finally in this version "confidently" ended the alpha version, and added Ha mode, now you know, in fact, HA does not guarantee that the data is not lost, this point to 1.2 we talk about. As the spark streaming jumps out of alpha, the new Alpha component Graphx,graphx is a distributed Graph computing framework that provides standard algorithms such as PageRank, connected RS, Strongly connected RS and triangle counting and so on, but the stability has yet to be strengthened. Mllib has added a common naive Bayes algorithm to this version, but more notably, Mllib has finally started supporting the Python API (which requires numpy support).
The community released two maintena-nce versions in April and July respectively: 0.9.1 and 0.9.2, fixed bugs, no new feature joined, but 0.9.1 was the first release after the Apache top-level project.
Spark 1.0.x
"Long-awaited" to describe Spark1.0 is not too much, as a landmark release, the spark community is also very cautious, after the release of a number of RC versions, finally at the end of May, the official release of the 1.0 version. This version has more than 110 contributor, after 4 months of joint efforts, and the 1.0 version has become the largest release of spark since the birth. As the beginning version of 1.x, the spark community also guarantees the API's compatibility on all subsequent 1.x versions. On the other hand, the Java API for Spark 1.0 begins to support the lambda expression of Java 8, which makes it much easier for some users who have to write Spark programs in Java.
The High-profile Spark SQL has finally made its debut in this version, though only in Alpha, but Spark users around the world have been eager to start experimenting, and the momentum is still continuing, Spark SQL is now the most active component in Spark, not one of them. Referring to spark SQL, we have to mention shark,databricks on the Spark Summit 2014 announced that Shark has completed its academic mission, and Shark's overall design framework is too dependent on hive to support its long-term development, So decided to terminate shark development, overall turn to spark SQL. Spark SQL supports the operation of structured data in SQL and also supports the use of Hivecontext to manipulate data in hive. In this regard, the industry's strong demand for SQL on Hadoop determines that spark SQL is bound to grow rapidly in the long run. It is worth mentioning that the hive community has also launched a hive on Spark project-replacing Hive's execution engine with spark. However, from a target perspective, Hive on Spark is more focused on Hive full downward compatibility, while Spark SQL focuses more on spark interoperability with other components and diversifying data processing.
There has also been a big improvement in mllib, with the 1.0 beginning to support sparse matrices, which is definitely a heartening feature for mllib users. In the aspect of algorithm, Mllib also increases the decision tree, SVD and PCA. The performance of Spark streaming and GRAPHX has been enhanced in this release.
In addition, Spark provides a new tool for submitting tasks, called Spark-submit, that can be used to submit tasks whether they are running in standalone mode or running on yarn. In this connection, spark unified the portal for submitting tasks.
Finally, the community released 1.0.1 and 1.0 22 maintenance editions in July and August respectively.
Spark 1.1.x
Spark 1.1.0 in September. This version is added to the shuffle implementation of sort-based, where hash-based shuffle need to open a file for each reducer, resulting in a large amount of buffer overhead and inefficient I/O, But the newest sort-based shuffle realization can solve the above problem very well, when the shuffle data quantity is very big, sort-based's shuffle advantage is especially obvious. It should be pointed out that, and MapReduce for KV ordering is not the same, sort-based is sorted according to the partition ordinal, in partition interior is not sorted. However, the default shuffle method in 1.1 is based on the hash, and the sort-based is used as the default shuffle method in 1.2.
Spark SQL adds a lot of new features to this version. The most noteworthy is the ability to add JDBC server, which means that users can enjoy the spark SQL functionality by writing only the JDBC code.
Mllib introduces a statistical database that is used to complete the tasks of sampling, correlation, estimation and testing. The previously vocal feature extraction Tools Word2vec and TF were also added to this version. In addition to adding some new algorithms, Mllib performance has also been greatly improved in this version. There is no particular change in this version of MLLIB,GRAPHX.
Spark streaming has added support for Amazon kinesis in this version of the data source, except that domestic users are less interested in supporting the data source and more meaningful for foreign users. In this release, however, Spark streaming changed the way the data was obtained from the flume before flume push data into Executor/worker, but in this mode, when Executor/worker hangs up, Flume can no longer push data normally. So now the push is changed to pull, which means that even if a receiver is hung up, it will ensure that the newly launched receiver on other worker can continue to receive data normally. Another important improvement is the addition of the current limiting function, such as spark streaming often occurs when the topic data in the Kafka is read, and Oom does not occur after the current limit is added. The combination of Spark streaming and Mllib is another new feature that has to be proposed, using the streaming real-time online training model, but the present is only a relatively elementary implementation.
A larger problem was fixed in the maintenance version 1.1.1 released at the end of November, when the use of external data structures (Externalappendonlymap and externalsorter) produced a large number of very small intermediate files, which not only caused " The exception too many open files also greatly affects performance, which is fixed by version 1.1.1.
Spark 1.2.0
1.2 was released in mid-December and has to say the spark community has done a great job in controlling the release schedule. In this release, the first and foremost is to set the sort-based shuffle as the default shuffle policy. On the other hand, in a very large amount of data transmission, Connection Manager finally replaced the implementation of netty-based, the previous implementation is very slow because every time to read from the disk to the kernel state, and then to the user state, and then back to the kernel state into the network card, Now use Zero-copy to achieve, the efficiency is much higher.
For spark streaming said, finally is a small milestone, began to support the fully h/a model. A small amount of data may have been lost in the past when driver hung up. Now add a layer of Wal (Write projectile Log), every time receiver received data will exist HDFs, so that even if the driver hung up, when it restarted, it can continue to process. At the same time we also need to pay attention to the difference between unreliable receivers and reliable receivers, only users use reliable receivers to ensure that the data 0 lost.
Mllib's biggest change is the introduction of the new Pipeline API, which makes it easier to build a complete set of machine-learning-related pipelining, including a dataset API based on Spark SQL Schemardd.
Graphx End Alpha is officially released, while providing the stable API, which means that users don't have to worry about the changes in the API that the existing code will change later. In addition, the new core API Aggregatemessages also replaced the Mapreducetriplet, we should pay attention to this change.
The most important features of Spark SQL should undoubtedly belong to external data source, which makes it easier for developers to develop Spark connector that butt external data sources, uniformly using SQL to manipulate all data sources, and also push Predicates to data source, for example, you have to do some filtering from the hbase, generally we need to remove the data from the HBase and then filter in the Spark engine, you can now push this step to the data source, Allows the user to filter when fetching data. Another is worth mentioning is now cachetable and native cache has unified semantics, and performance and stability has also significantly improved, not only the memory table to support predicates Pushdown, can be based on statistical information to skip batch data, and build memory buffer time to build , so they are no longer oom in larger cache tables.
Due to space reasons, the above we briefly summed up the spark in 2014 in each version of the more important features, but there is a function of the enhancement has always run through the--yarn, because many companies now have different computing framework running on the YARN, so spark support for YARN will be more and more good , indeed spark did a lot of work in this area.
Conclusion
2014 is a very important year for Spark, not only because of the release of the landmark version 1.0, but also, more importantly, through the efforts of the community as a whole, spark is becoming more and more stable and efficient, and is being adopted by more and more enterprises. In the 2015, with the continuous efforts of the community, I believe spark will reach a new height and play a more important role in more enterprises.
Thanks to Reynold Xin and Liancheng from Databricks Company review to this article, and provide valuable suggestions.
Chen, seven cattle technical director. Weibo: @CrazyJvm.
OpenCloud 2015 will be held in Beijing in April 2015 16-18th. The conference contains "2015 OpenStack Technical Conference", "2015 Spark Technology Summit", "2015 Container Technology Summit" three technical summits and a number of in-depth industry training, theme focused on technology innovation and application practices, the domestic and foreign real cloud computing technology, Daniel Lecturer. Here are the first line of ground gas dry goods, solid products, technology, services and platforms. OpenCloud 2015, the knowledgeable people are here!
more lecturers and schedule information please pay attention to OpenCloud 2015 introduction and official website.