Preface to learning spark: lightning-fast big data analytics

Source: Internet
Author: User

Friendship Preface

I was informed that spark was about to publish a book recently. I suddenly felt a lot of emotion. I thought it would be better to write something. It could be regarded as friendship support or my personal summary. The opinion should be as neutral as possible, and the content should be as emotional as possible.

 

In the theory that Niu Ge is "standing on the shoulders of giants", before holding spark, she needs to hold her predecessors first. The core of big data systems is the distributed processing framework, which is responsible for all aspects of job execution, such as job decomposition, task scheduling and execution, error tolerance, and data flow. Earlier (published in) is also the most important contribution of Google's mapreduce framework. She introduced functional programming into Distributed Data Processing and used only two functions (MAP and reduce) it solves a major category of big data batch processing problems, and users no longer have to worry about the many system-level problems brought about by distributed systems. Mapreduce has obvious disadvantages: The processing process is too fixed and does not support iteration. The job execution time is generally long. Hadoop, the open-source implementation of mapreduce, won the terasort championship in and began to occupy the market. Today, the entire ecosystem has become a little bloated. After a few years of silence, Microsoft launched a Dryad that directly competes with mapreduce in, replacing the fixed data stream of mapreduce with a dag-type data stream, which is more complex but more powerful. In the next year, dryadlinq introduced more functional ideas (in fact, many of them come from C #'s LINQ) into the distributed programming model, and today it seems very advanced. Well, this spark was launched. Spark's biggest contribution lies in its data model RDD. In the past, all the frameworks considered data as the original kV key-value pair, there is no clear relationship between the data in different stages of the process. RDD provides clear data logical relationships, establishes Data Dependencies and data models, and finally converts data models into specific processing tasks. Spark's programming paradigm actually draws on dryadlinq's paradigm. As for the features of memory cache and Iterative Computing, we can also change the implementation mechanism of mapreduce or Dryad. Of course, these features are important for reducing job execution time and also a selling point of spark. By the way, the scala language of EPFL is also indispensable, and the name is very scalable. Her key words are academic, functional, static, object-oriented, and Java compatible. Most people think Scala is the successor of Java. If spark is compared to Sun Wukong, Scala is his golden hoop.

 

The school is exhausted, and the issue of publishing books is sprayed. I have to say that when something is written as a book, it is a mature thing, and it also means that it is out of the academic world. Hadoop: the definitive guide was published in. When hadoop was popular, most people (including myself) started from this book. The 11-year programming pig is also good, and the 12-year programming hive is also good. These books are generated when the system has not reached Version 1.0, admire O 'Reilly's speed. This book is also made when spark is still in version 0.7 (0.8/0.9 may have been published). The book should be 2-pages, with more instructions, however, it should be enough for the majority of users. O'reilly's book features new but not in-depth reading. For those who want to understand the internal operating mechanism, readCodeRight. J

 

Let's talk about the geek of Berkeley. They inherited the light of Bill Joy and must do cutting-edge, practical, and solid work. This is why their thesis is always "systemname: A system ***". From the underlying tachyon to the Framework spark, to the Resource Management mesos, to the upper layer shark/mlbase/bagel/streaming and so on, it is very systematic work. They really love and hate them. They love systems that are open-source and can learn advanced technologies. What they hate is that they have finished thinking about idea.

 

Matei is the earliest Scheduler for hadoop: late, delay, and fair. The late article has a high reference rate, and it seems that I have contributed a reference. However, the contribution of the scheduler is generally invisible to users. This is clearly not the original intention of Matei to go to Berkeley. In, Matei integrated the work of his predecessors, combined with brainstorming and prototype implementation, and in 10 years, hotcloud launched his placeholder paper. In the next two years, we will continue to design, implement, optimize, and promote and use the Internet companies. In the last 12 years, our system has taken shape and won the NSDI best paper. Shark is similar. It can be seen that system is not only hard-pressed, but also persistent and endurance. Amplab accounts for all "time, location, and people" in the ancient language cloud. But the most important thing is that these geek have been brave enough to move forward on the bright road and deserve our learning and follow suit. These geek people are also nice and have a question and answer on the mail list. They are very useful in saying "no words, no words. Matei is about to graduate as a teacher this year. I really want to know how he takes students.

 

I was lucky to have met shark author Gong LD Xin in Tsinghua in May. I have asked a lot of questions and congratulated shark High School sigmod 2013. Dr. Xin's promotion activities in China are relatively successful and he hopes to make more reports. There are still many domestic organizations, such as Intel, who have made great contributions to spark. They are a group of passionate and competent members.

 

There is not much nonsense. If you have time, think about how to contribute something to this project.

In addition, we strongly recommend that you replace the cover with Starfish. Otherwise, how can spark be reflected. When shark publishes a book, it will become a fish and a shark.

 

@ Jerrylead

Csxulijie@gmail.com

PhD candidate @ iscas

Intern @ System Research Group, msra

2013-04-27

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.