The present situation and future development of spark

Source: Internet
Author: User
Keywords Spark aparch
Tags activity analysis api application based beginning big data big data analysis

The development of Spark

For a platform with a considerable technical threshold and complexity, spark from the birth to the formal version of the maturity, the experience of such a short period of time, let people feel surprised. Spark was born in Amplab, Berkeley, in 2009, at the beginning of a research project at the University of Berkeley. It was officially open source in 2010, and in 2013 became the Aparch Fund project, and in 2014 became the Aparch Fund's top project, the process less than five years time.

Because spark from the University of Berkeley, so that it has been branded in the development process of academic research, for a platform in the field of data science, this is also the title, it even determines the spark development momentum. Spark's core RDD (resilient distributed datasets), as well as streaming, SQL intelligent analysis, machine learning and other functions, are derived from academic research papers, as follows:

discretized streams:fault-tolerant streaming computation at Scale. Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott shenker, Ion Stoica. Sosp 2013. November 2013.

Shark:sql and Rich Analytics at Scale. Reynold Xin, Joshua Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica. Sigmod 2013. June 2013.

discretized Streams:an Efficient and fault-tolerant Model for Stream 處理 on SCM clusters. Matei Zaharia, Tathagata Das, Haoyuan Li, Scott shenker, Ion Stoica. Hotcloud 2012. June 2012.

Shark:fast Data Analysis Using coarse-grained distributed Memory (demo). Cliff Engle, Antonio lupher, Reynold Xin, Matei Zaharia, Haoyuan Li, Scott shenker, Ion Stoica. Sigmod 2012. May 2012. Best Demo Award.

Resilient distributed datasets:a fault-tolerant abstraction for as Cluster. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012. Best Monitors Award and Honorable Mention for Community Award.

Spark:cluster Computing with sharable Sets. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. Hotcloud 2010. June 2010.

In the field of large data, only deep digging in the field of data science, to walk in the academic forefront, in order to be in the underlying algorithms and models to walk in front of, and thus occupy the leading position. The spark of this academic gene makes it possible to establish an advantage in large data fields from the outset. Regardless of the performance, or the unity of the scheme, compared to the traditional Hadoop, the advantages are very obvious. Spark provides a RDD integrated solution that unifies models such as MapReduce, streaming, SQL, Machine Learning, Graph 處理 to a single platform and exposes them to a consistent API and provide the same deployment plan, make Spark engineering application field become more extensive.

Spark Code activity

The evolution of Spark's version is enough to show the vitality of the platform and the activity of the community. Especially in 2013 years, Spark has entered a high-speed development period, the code base submission and community activity has increased significantly. In the active degree, Spark is ranked among the top three in all Aparch Foundation open source projects. Spark's code base is most active compared to other large data platforms or frameworks, as shown in the following illustration:

  

From June 2013 to June 2014, the developers who contributed were growing from 68 to 255, and the participating companies rose from 17 to 50. In these 50 companies, there are Ali from China, Baidu, NetEase, Tencent, Sohu and other companies. Of course, the code line for the code base is also increased from the original 63,000 lines to 175,000 lines. The following figure is the growth curve for spark code contributors for the month ended 2014:

  

The following illustration shows a total of 8,471 submissions, 11 branches, 25 releases, 326-bit code contributors since spark deployed their code to GitHub.

  

The current spark version is 1.1.0. In this version of the code contributor list, there are dozens of of domestic programmers. Most of these contributors are focused on bug fix and even example bug fix. Because the 1.1.0 version greatly enhances the functionality of Spark SQL and Mlib, part of the contribution is focused on the implementation of SQL and Mlib features. The following figure is the most recent pull Request that is still open on the Spark Master branch:

  

As can be seen, because Spark is still relatively young, when applied to production, may find some small defects. And in the code cleanliness aspect, also at any time in the code to focus on the construction. For example, the Taobao technology department started trying to apply spark on yarn to production environments in 2013. In the process of performing data analysis, they have discovered dagschedular memory leaks, mismatched job-ending status and other defects, thus contributing several more important pull Request for Spark library. The specific content can view Taobao Technical Department blog article: "Spark on Yarn: several key pull Request (http://rdc.taobao.org/?p=525)".

Spark Community Activities

Spark attaches great importance to community activities, and the organization is highly regulated, regularly or irregularly, to hold meetings related to spark. The meeting is divided into two kinds, one is spark Summit, the influence is huge, the world spark top technical person's summit. Currently, two consecutive summit conferences have been held in San Francisco in 2013 and 2014. Spark Summit will be held in New York and San Francisco in 2015, with its official website: http://spark-summit.org/.

At the Spark Summit Conference in 2014, we saw that, in addition to Berkeley and Databricks, the speakers came from companies that first started using and trying to spark big data analysis, including the recently very hot music web site Spotify, The world's largest focus on financial transactions Sharethrough, professional data platform MAPR, Cloudera, cloud leader Amazon, as well as the global mega-enterprises IBM, Intel, SAP and so on.

In addition to the powerful Spark Summit, the spark community has been holding small meetup activities around the globe on a regular basis. Spark Meetup Group has spread throughout North America, Europe, Asia and Oceania. In China, Beijing Spark Meetup has held two times and will hold its third meetup this October 26. It will be shared by engineers from Intel China Research Institute, Taobao, TalkingData, Microsoft and databricks. The following figure is a global map of spark Meetup groups:

  

The present and future of spark

Spark's feature is that it first provides a unified platform for large data applications. From the point of view of data processing, the model can be divided into batch processing, interactive, flow processing and other ways, and from large data platform, there are mature Hadoop, Cassandra, Mesos and other cloud vendors. Spark integrates the main data-processing model and is well integrated with the current mainstream of a large platform. The following illustration shows this feature of Spark:

  

The advantages of such a unified platform are obvious. For developers, it is only necessary to learn a platform and reduce the learning curve. For users, the spark application can be easily run on the platform of Hadoop, Mesos and so on, which satisfies the good mobility. The unified data processing way, also can simplify the development model, reduces the platform maintenance difficulty.

Spark provides a standard library of common algorithms for large data, including MapReduce, SQL, streaming, Machine learning, and graph 處理. It also provides support for Scala, Python, Java (Java 8), and R languages:

  

In the latest release of version 1.1.0, enhancements were made to the spark SQL and machine learning libraries. Spark SQL can more efficiently load and query structured data in Spark, while supporting the operation of JSON data and providing a more user-friendly Spark API. In the aspect of machine learning, more than 15 algorithms have been included, including decision Tree, SVD, PCA,L-BFGS, etc. The following figure shows the current technology stack for spark:

  

On the Spark summit in 2014, Patrick Wendell from Databricks Company looked forward to the future of Spark. He mentioned Spark's goals in his speech, including:

Empower Data scientists and engineers

Expressive, clean APIs

Unified runtime across many environments

Powerful Standard Libraries

In his speech, he mentions that the most important core component in Spark's recent version is Spark SQL. The next few releases, in addition to better performance (including code generation and fast join operations), provide extended and better integration of SQL statements (leveraging Schemardd integration with Hadoop, NoSQL, and RDBMS). In future releases, additional algorithms are added to the Mllib, which, in addition to the traditional statistical algorithms, include learning algorithms and better integration with the R language, thus providing a better choice for data scientists, choosing Spark and R based on scenarios.

The development of Spark will combine the development trend of hardware. First of all, the memory will become more and more inexpensive, 256GB memory machines will become more and more common, and for the hard disk, SSD hard disk will slowly become standard server. Because Spark is a large memory-based data processing platform, in the process, because the data stored on the hard disk, resulting in performance bottlenecks. As the capacity of machine memory increases, a distributed file system such as HDFS, which is stored on disk, will slowly be replaced by a distributed storage system with shared memory, such as Tachyon from Berkeley's Amplab Lab, which provides far-hdfs performance. As a result, future spark will change significantly within the internal storage interface to better support SSD and shared memory systems such as Tachyon. In fact, in the recent version of Spark, the Tachyon has been supported.

According to the Spark Roadmap, Databricks will release 1.2.0 and 1.3 0 versions in the last three months. Among them, the 1.2.0 version of the storage side of the API will be refactored, on the 1.3.0 version, will be the combination of Spark and R Sparkr. In addition to the SQL and Mllib mentioned earlier, the future spark has varying degrees of enhancement for streaming, GRAPHX, and better support for yarn.

The application of Spark

Currently, the official version of Spark is supported by some of the main Hadoop vendors, including the spark in the Hadoop version released by the following enterprise or platform:

  

This shows that the industry has recognized that Spark,spark has also been widely used by many enterprises, especially Internet enterprises, in commercial projects. According to SPARK's official statistics, there are about 80 companies (Https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark) involved in SPARK's contribution and the SPARK used in commercial projects. In the domestic, to join the spark camp company including Ali, Baidu, Tencent, NetEase, Sohu and so on. At San Francisco's Spark Summit 2014 conference, speakers shared the use of data Analysis (Sharethrough) in music Recommendations (Spotify), real-time audits (Cassandra), and Text Analysis (IBM), Customer Intelligent Real-time recommendation (Graphflow) and many other topics at the application level, this is enough to explain the extent of spark application.

However, overall, the current application of spark enterprises mainly concentrated in the field of the Internet. The main factors restricting the adoption of spark by traditional enterprises include three aspects. First, it depends on the maturity of the platform. Traditional enterprises in the selection of technology relatively stable, of course, can be said to be conservative. If a technology, especially one that involves the choice of a major platform, becomes particularly cautious. Without a wide range of validation, and from the industry to obtain successful experience, will not be easily selected. The second is support for SQL. The data processing of traditional enterprise mainly concentrates on the relational database, and there are a lot of legacy systems. In these legacy systems, most data processing is done through SQL or even stored procedures. If a large data platform does not support SQL of relational databases well, it can cause the cost of migrating data analysis business logic to be too high. The third is the learning curve of team and technology. Without a team member who is familiar with the platform and the technologies associated with the platform, the organization is concerned about development progress, costs, and possible risks.

Spark is trying to solve these three problems. With the release of the 1.0.2 version, Spark was validated by more commercial cases. Although spark still maintain youthful vigor, but already has the mature platform function. As for SQL support, Spark is very. Before the release of the 1.0.2 release, it was recognized that the shark based on hive was so determined that it decided to discard the shark in the new version and decided to introduce a new SQL module. Today, in Spark version 1.1.0, Spark SQL is relatively well supported to support the need for SQL migration in enterprise applications. On the learning curve of Spark, the main content of learning is the understanding of RDD. Because Spark provides a unified programming model and deployment pattern for a variety of algorithms, and constructs a large data integration scheme, if the enterprise's large data analysis needs to deal with a variety of scenarios, then the architecture of Spark makes its learning curve lower, while also reducing deployment costs. Spark can be well integrated with Hadoop, Cassandra and other platforms and deployed to yarn. If the enterprise already has the ability of big data analysis, the experience of the original mastery still can use to spark. Although Spark is written in Scala, the official also recommends that users invoke Scala's API, but it also provides Java and Python interfaces that are very thoughtful for Java enterprise users or non-JVM users. If you complain about Java's redundant, spark new version of Java 8 support makes the Java API as concise and powerful as the Scala API, such as the classic word count algorithm in Java 8:

Javardd lines = Sc.textfile ("Data.txt");

Javardd linelengths = Lines.map (S-> s.length ());

int totallength = Linelengths.reduce ((A, B)-> A + B);

Obviously, with the gradual maturation of spark, and the impetus of active community, it provides the powerful function will certainly get more technical team and Enterprise's favor. It is believed that in the near future more traditional enterprises will start to try to use spark.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.