Big data why Spark is chosen

Source: Internet
Author: User

Big data why Spark is chosen

Spark is a memory-based, open-source cluster computing system designed for faster data analysis. Spark, a small team based at the University of California's AMP lab Matei, uses Scala to develop its core code with only 63 Scala files, very lightweight. Spark provides an open-source cluster computing environment similar to Hadoop, but based on memory and iterative optimization design, Spark is performing better on some workloads.

In the first half of 2014, the spark open source ecosystem has grown dramatically and has become one of the most active open source projects in big data, and is now active in many well-known big data companies such as Hortonworks, IBM, Cloudera, MAPR and pivotal. So what spark attracts so much attention, here we look at the 6 summaries on dzone.

1. Lightweight and fast processing. With a focus on big data processing, speed is often placed first, and we often look for tools that can handle our data as quickly as possible. Spark allows applications in a Hadoop cluster to run at 100 times times the amount of memory, even up to 10 times times faster than running on disk. Spark achieves performance gains by reducing disk IO, which puts all of the intermediate processing data into memory. Spark uses the concept of the RDD (resilient distributed Dataset), which allows it to store data transparently in memory and persist to disk only when needed. This approach greatly reduces the amount of time required to read and write the disk during data processing.

2. Easy to use, spark supports multiple languages. Spark allows Java, Scala, and Python, which allows developers to work in their familiar language environment. It comes with more than 80 high-level operators, allowing interactive queries in the shell.

3. Support Complex queries. In addition to simple "map" and "reduce" operations, spark also supports SQL queries, streaming queries, and complex queries, such as the out-of-the-box machine learning graph algorithm. At the same time, users can seamlessly match these capabilities in the same workflow.

4. Real-time stream processing. Compared to MapReduce, which only handles offline data, spark supports real-time stream computing. Spark relies on spark streaming to process data in real time, but Hadoop can also use other tools for streaming calculations after yarn. The rating for Spark Streaming,cloudera is:

Simple: Lightweight and powerful Api,sparks streaming allows you to quickly develop streaming applications.
Fault tolerance: Unlike other streaming solutions such as Storm, there is no need for additional code and configuration, and Spark streaming can do a lot of recovery and delivery work.
Integration: Reuse the same code for stream processing and batching, and even save streaming data to historical data.

5. Can be integrated with Hadoop and existing Hadoop data. Spark can run on its own, and it can read any existing Hadoop data, in addition to being able to run in the current yarn cluster management. This is a big advantage, and it can run on any Hadoop data source, such as HBase, HDFs, and so on. This feature allows users to easily migrate existing Hadoop applications, if appropriate.

6. An active and infinitely growing community. Spark originated in 2009 and now has more than 50 agencies, 250 engineers have contributed code, compared with last June, the number of lines of code almost expanded three times times, this is an enviable growth.

The Management House (formerly the NPC Economic Forum) launched the CDA big data analyst out-of-work class training (http://cda.pinggu.org/bigdata-jy.html), with the goal of big data analysts, from the Data Analysis Foundation, Getting started with the Java language and Linux operating systems, the system introduces the theoretical knowledge of Hadoop, HDFS, MapReduce and HBase, and the ecological environment of Hadoop, detailing the installation configuration of Hadoop three modes, in the form of cases, Focus on the clustering, classification and topic recommendation of Big Data analysis based on Mahout project. Focus on the development of the Hadoop architecture of Big data analysis and architecture design, through the demonstration of real big data analysis cases, so that students can understand the real value of big data analysis in a short time, master how to use the Hadoop architecture for the Big data analysis process, so that learners can have a rapid upgrade to become both theoretical and The big data analyst in combat, so as to better adapt to the current Internet economy in the context of big data analysts demand for the vigorous employment situation.

Beijing Live & Remote Live

time          

courses                                   & nbsp; 

synopsis                                & nbsp                            & nbsp;                         

first stage

system Foundation 15 days

1. Opening Ceremony and Big Data overview 1 days

2.linux operating system and 2 days

3.unbuntu system introduction 1 days

5.python base 4 days

6.hadoop stand-alone, pseudo-distributed, cluster-built 2 days

second stage

hadoop Ecological combat 15 days

< Span style= "font-size:medium;" >1.hdfs Deep anatomy 1 days

3.pig principle, deployment with pig Latin language, application Case 1 days

4.hive architecture, installation and HIVEQL and hive application case 3 days

5. Zookeeper and distributed system development 1 days

6.hbase architecture, cluster deployment, management 2 days

7.hbase data Model, actual case modeling anatomy 3 days

8.strom Getting Started and deploying for 1 days

third stage

data analysis theory 15 days

1.SPSS software 1 days

2. Statistical basis for data analysis (SPSS using software) 4 days

3.R software operation 1 days

4. Clustering of data mining (using software R) 3 days

5. Classification of data Mining (using software R) 4 days

6. Association rules for Data Mining (using software R) 2 days

Phase IV

Big data analysis case 15 days

1. Introduction to Big Data visualization technology and tools 1 days

2. Privacy protection and technical introduction in the context of big Data 1 days

3. Analysis method of Big Data--smart model introduction 1 days

4.8 Big Data analysis cases based on Hadoop+mahout 2 days

5.spark Basic Principles, Cluster installs and runs Spark 2 days

6.spark SQL principle and data Integration application 2 days

7.spark Graphx diagram calculation method applied 1 days

8.spark Recommended Application (ALS method, Fp-growth method) 2 days

9.spark Data modeling Process (logistics regression, decision tree, Naive Bayes method) 3 days

Fifth stage

Graduation 6 Days

1. Graduation Design 5 days

2. Graduation Ceremony 1 days

Big data why Spark is chosen

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.