First knowledge of Spark 1.6.0

Source: Internet
Author: User
Tags join json

1, Spark development background

Spark was developed in Scala by a small team of Matei based at the University of California, Berkeley Amp Lab (Algorithms,machines,andpeoplelab), and later established spark commercial company Databricks,ceoali , CTO Matei, the latter vision is to achieve databrickscloud. Spark is a new generation of open source, distributed, parallel computing framework based on memory iterative computing, leaving tedious IO reading and writing to make data analysis faster.

2, the difference between spark and MapReduce

1, Mr Job resource control is carried out through the yarn, spark can be yarn for resource control, can also not use yarn, but the combination of multiple components, the proposal or use of yarn;
2), Spark is based on memory calculation, the calculation of the intermediate results stored in memory, can be repeated iterative calculation, and the middle result of Mr Calculation is to disk, so a job will involve repeatedly read and write disk, which is not performance compared to the main reason spark;
3, a task of Mr should be corresponding to a Container,container each boot will take a lot of time, and Spark is based on the thread pool to achieve, the allocation of resources will be faster.

3. Spark System architecture Diagram

 

Basic components in the spark architecture:

Clustermanager: In the standalone mode is master (Master node), control the whole cluster, monitor worker. In yarn mode for the resource manager.

Worker: From node, responsible for control compute node, start executor or driver. In yarn mode, the NodeManager is responsible for computing the node control.

Driver: Runs the main () function of application and creates the Sparkcontext.

Executor: The executor, the component that executes the task on the Workernode, the task that is used to start the thread pool. Each application has a separate set of executors.

Sparkcontext: The context of the entire application, controlling the life cycle of the application.

The basic computational unit of the Rdd:spark, a group of RDD can be formed to perform a direction-free graph rddgraph.

Dagscheduler: Constructs a stage based DAG based on job (job) and submits stage to TaskScheduler.

TaskScheduler: Distribute Tasks (Task) to executor execution.

SPARKENV: A thread-level context that stores references to important components at run time.

Sparkenv creates and contains references to some of the following important components.

Mapoutputtracker: Responsible for storing shuffle meta information.

Broadcastmanager: Responsible for the control of broadcast variables and storage of meta information.

Blockmanager: Responsible for storage management, creation, and lookup blocks.

Metricssystem: Monitor Run-time performance metrics information.

SPARKCONF: Responsible for storing configuration information.

Spark's overall process is: client-submitted application, Master finds a worker to start driver,driver to request resources from master or resource Manager, then converts the application to Rddgraph, Then the rddgraph is converted into stage by Dagscheduler to TaskScheduler, and the TaskScheduler submits the task to the executor for execution. In the process of task execution, other components work together to ensure the smooth implementation of the entire application.

4, Spark task scheduling process

Job Execution Process Description:


Client submit job to Master,master let a worker start driver, that is, schedulerbackend. Worker creates a Driverrunner thread, Driverrunner starts the schedulerbackend process. In addition, Master will let the rest of the worker start Exeuctor, that is, executorbackend. Worker creates a Executorrunner thread, Executorrunner starts the executorbackend process. The executorbackend will be registered to the driver schedulerbackend after startup. The schedulerbackend process contains dagscheduler, which generates execution plans and schedules execution according to the user's program. For each stage task, it will be stored in the TaskScheduler, When Executorbackend reports to Schedulerbackend, dispatch the task in TaskScheduler to executorbackend execution. The job ends after all stage are complete.

5, what is RDD?

RDD, called Resilient Distributeddatasets (Resilient distributed DataSet), is a read-only, fault-tolerant, parallel, distributed collection of data. Rdd can be all cached in memory, can be iterative calculation, RDD is the core of spark things. At the same time, RDD also provides a rich set of operations to manipulate the data. In these operations, conversion operations such as map, Flatmap, and filter implement the Monad pattern, well suited to Scala's set operations. In addition, RDD provides more convenient operations such as join, GroupBy, Reducebykey (note that Reducebykey is action rather than transformation) to support common data operations.

Generally speaking, there are several common models for data processing, including: Iterativealgorithms,relationalqueries,mapreduce,streamprocessing. For example, Hadoopmapreduce uses the mapreduces model, and Storm uses the Streamprocessing model. Rdd mixes these four models so that spark can be applied to a variety of large data processing scenarios.

Rdd, as a data structure, is essentially a read-only collection of partitioned records. A RDD can contain multiple partitions, and each partition is a dataset fragment. Rdd can depend on each other. If each partition of a RDD can be used by at most one partition of a childrdd, it is called narrowdependency, and if multiple CHILDRDD partitions can be relied upon, it is called widedependency. Depending on their characteristics, different operations may have different dependencies. For example, a map operation produces narrowdependency, while a join operation produces widedependency.

The reason why spark is divided into narrow and wide is based on two points:

First, narrowdependencies can support the execution of more than one command on the same clusternode, such as after the map is executed, followed by the filter. Instead, Widedependencies requires all of the parent partitions to be available and may also need to invoke operations like MapReduce for cross node delivery.

Secondly, it is considered from the point of view of failure recovery. Narrowdependencies failure recovery is more efficient because it only needs to recalculate the missing parentpartition, and it can be recalculated at different nodes in parallel. And widedependencies involves multiple parentpartitions at the RDD level. The following figure illustrates the difference between narrowdependencies and widedependencies:


In the picture, a box represents a rdd, and a shaded rectangular frame represents a partition.

Rdd is the core of Spark and the architectural foundation of the whole spark. Its characteristics can be summed up as follows: It is invariant to data structure storage It is a distributed data structure that supports across clusters provides coarse-grained operations that can be partitioned by the key of data records, and these operations support partitioning it stores data in memory, providing low latency

6, what is DAG?

DAG (directed acyclicalgraphs) has a direction-free graph, which is a simple method of using DAG to implement fault tolerance. In the case of a job failure, you can easily backtrack through the graph, even in the middle of the calculation, you can rerun any failed task. The execution order of the graphs always allows you to move from any node in the diagram to the end node.

7, Why Scala?

1, spark and Scala are really perfect match, rdd many of the ideas and Scala similar, such as the exact concept list of map, filter and other higher-order operators, very short code can achieve many of the functions of Java, similar to the FP in the immutable and lazy computing, The distributed Memory object Rdd can be realized and pipeline can be realized at the same time.
2, Scala is good at borrowing power, such as the design of the original intention to include the support of the JVM, so it can be a perfect use of the ecological power of Java; Spark like, many things do not write themselves, direct use, reference, such as directly deployed in yarn, Mesos, EC2, using HDFs, S3 Borrow the SQL parsing part in hive;
3, and Scala to facilitate the development of efficient network communications

8. New characteristics of Spark Release1.6.0

This is the seventh edition of the Spark release, and the main new features of this release include the following:

1). Dataset API

Currently there are two major classes of Spark Api:rdd APIs (Sparkcore), Dataframe API (SPARKSQL)

The RDD API is very flexible, but in some cases execution plans are difficult to optimize.

The Dataframe API facilitates optimization, but the operation of UDF (Uerdefinefunction) is cumbersome.

The birth of Datasetapi in such a context, it is necessary to take into account the advantages of both of the above, so that users can be concise and clear description of the UDF, but also to facilitate the bottom catalyst to the implementation of the Plan optimization,

2). Sessionmanagement

Clusters can be shared to multiple users with different configurations and temporary tables, and can have multiple Sparksql sessions in 1 Sparkcontext. Before using 1.4, I found that a lot of SQL down to the bottom of the end is sequential execution, and SQL to SqlContext scheduling is to write their own thread pool to deal with,

Adding session Management provides greater convenience for use, especially when dealing with multiple concurrency.

3). Per-operatormetricsforsqlexecution

Spark's metrics has been detailed to task and stage, where you can see a number of indicators for tasks to run, such as task time, amount of data processed, and so on, and the information previously provided through metrics found the reason for the long task to run. The new feature is a statistical display of the amount of memory used and the overflow (which should refer to swapping on disk and memory) for each unit of operation when the Sparksql is run, which is helpful for developers to understand resources, find problems, and make adjustments

4). Sqlqueriesonfiles

SQL can be queried directly on a file that conforms to a data rule and is no longer required to register a file as a temporary table before

5.readingnon-standardjsonfiles

Spark the Jackson package used in processing JSON, this time opens the following options for the Jackson package to the user

Allow_comments

Allow_unquoted_field_names

Allow_single_quotes

Allow_numeric_leading_zeros

Allow_non_numeric_numbers

6). Advancedlayoutofcacheddata

Storage partitioning and sorting schemes while scanning memory tables, and adding APIs for distribution and local ordering based on specified columns in Dataframe

7). Allprimitiveinjsoncanbeinfertostrings

By setting Primitivesasstring to True, all primitives in JSON can be specified as String types in Dataframe, that is, the data in JSON is unified as a string in the Dataframe


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.