Big data why Spark is chosen

Last Update:2016-05-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Big data why Spark is chosen

Spark is a memory-based, open-source cluster computing system designed for faster data analysis. Spark, a small team based at the University of California's AMP lab Matei, uses Scala to develop its core code with only 63 Scala files, very lightweight. Spark provides an open-source cluster computing environment similar to Hadoop, but based on memory and iterative optimization design, Spark is performing better on some workloads.

In the first half of 2014, the spark open source ecosystem has grown dramatically and has become one of the most active open source projects in big data, and is now active in many well-known big data companies such as Hortonworks, IBM, Cloudera, MAPR and pivotal. So what spark attracts so much attention, here we look at the 6 summaries on dzone.

1. Lightweight and fast processing. With a focus on big data processing, speed is often placed first, and we often look for tools that can handle our data as quickly as possible. Spark allows applications in a Hadoop cluster to run at 100 times times the amount of memory, even up to 10 times times faster than running on disk. Spark achieves performance gains by reducing disk IO, which puts all of the intermediate processing data into memory. Spark uses the concept of the RDD (resilient distributed Dataset), which allows it to store data transparently in memory and persist to disk only when needed. This approach greatly reduces the amount of time required to read and write the disk during data processing.

2. Easy to use, spark supports multiple languages. Spark allows Java, Scala, and Python, which allows developers to work in their familiar language environment. It comes with more than 80 high-level operators, allowing interactive queries in the shell.

3. Support Complex queries. In addition to simple "map" and "reduce" operations, spark also supports SQL queries, streaming queries, and complex queries, such as the out-of-the-box machine learning graph algorithm. At the same time, users can seamlessly match these capabilities in the same workflow.

4. Real-time stream processing. Compared to MapReduce, which only handles offline data, spark supports real-time stream computing. Spark relies on spark streaming to process data in real time, but Hadoop can also use other tools for streaming calculations after yarn. The rating for Spark Streaming,cloudera is:

Simple: Lightweight and powerful Api,sparks streaming allows you to quickly develop streaming applications.
Fault tolerance: Unlike other streaming solutions such as Storm, there is no need for additional code and configuration, and Spark streaming can do a lot of recovery and delivery work.
Integration: Reuse the same code for stream processing and batching, and even save streaming data to historical data.

5. Can be integrated with Hadoop and existing Hadoop data. Spark can run on its own, and it can read any existing Hadoop data, in addition to being able to run in the current yarn cluster management. This is a big advantage, and it can run on any Hadoop data source, such as HBase, HDFs, and so on. This feature allows users to easily migrate existing Hadoop applications, if appropriate.

6. An active and infinitely growing community. Spark originated in 2009 and now has more than 50 agencies, 250 engineers have contributed code, compared with last June, the number of lines of code almost expanded three times times, this is an enviable growth.

The Management House (formerly the NPC Economic Forum) launched the CDA big data analyst out-of-work class training (http://cda.pinggu.org/bigdata-jy.html), with the goal of big data analysts, from the Data Analysis Foundation, Getting started with the Java language and Linux operating systems, the system introduces the theoretical knowledge of Hadoop, HDFS, MapReduce and HBase, and the ecological environment of Hadoop, detailing the installation configuration of Hadoop three modes, in the form of cases, Focus on the clustering, classification and topic recommendation of Big Data analysis based on Mahout project. Focus on the development of the Hadoop architecture of Big data analysis and architecture design, through the demonstration of real big data analysis cases, so that students can understand the real value of big data analysis in a short time, master how to use the Hadoop architecture for the Big data analysis process, so that learners can have a rapid upgrade to become both theoretical and The big data analyst in combat, so as to better adapt to the current Internet economy in the context of big data analysts demand for the vigorous employment situation.

Beijing Live & Remote Live

time	courses & nbsp;	synopsis & nbsp & nbsp;
first stage	system Foundation 15 days	1. Opening Ceremony and Big Data overview 1 days 2.linux operating system and 2 days 3.unbuntu system introduction 1 days 5.python base 4 days 6.hadoop stand-alone, pseudo-distributed, cluster-built 2 days
second stage	hadoop Ecological combat 15 days	< Span style= "font-size:medium;" >1.hdfs Deep anatomy 1 days 3.pig principle, deployment with pig Latin language, application Case 1 days 4.hive architecture, installation and HIVEQL and hive application case 3 days 5. Zookeeper and distributed system development 1 days 6.hbase architecture, cluster deployment, management 2 days 7.hbase data Model, actual case modeling anatomy 3 days 8.strom Getting Started and deploying for 1 days
third stage	data analysis theory 15 days	1.SPSS software 1 days 2. Statistical basis for data analysis (SPSS using software) 4 days 3.R software operation 1 days 4. Clustering of data mining (using software R) 3 days 5. Classification of data Mining (using software R) 4 days 6. Association rules for Data Mining (using software R) 2 days
Phase IV	Big data analysis case 15 days	1. Introduction to Big Data visualization technology and tools 1 days 2. Privacy protection and technical introduction in the context of big Data 1 days 3. Analysis method of Big Data--smart model introduction 1 days 4.8 Big Data analysis cases based on Hadoop+mahout 2 days 5.spark Basic Principles, Cluster installs and runs Spark 2 days 6.spark SQL principle and data Integration application 2 days 7.spark Graphx diagram calculation method applied 1 days 8.spark Recommended Application (ALS method, Fp-growth method) 2 days 9.spark Data modeling Process (logistics regression, decision tree, Naive Bayes method) 3 days
Fifth stage	Graduation 6 Days	1. Graduation Design 5 days 2. Graduation Ceremony 1 days

Big data why Spark is chosen

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Big data why Spark is chosen

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Big data why Spark is chosen

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support