Initial knowledge of Spark 1.6.0

Source: Internet
Author: User
Tags json memory usage pack thread

1. Spark Development Background

Spark was developed by the UC Berkeley Amp Lab (Algorithms,machines,andpeoplelab) as a Matei-based small team using the Scala language and later set up spark commercial company Databricks,ceoali , CTO Matei, the latter vision is to achieve databrickscloud. Spark is a new generation of memory-based iterative computing, open-source, distributed, parallel computing framework, the use of cumbersome IO read and write, the purpose is to make data analysis faster.

2. The difference between spark and MapReduce

1), Mr Operation of the resource control is through yarn, spark can be through yarn to control the resources, or do not use yarn, but when multiple components are set, it is recommended to use yarn;
2), Spark is based on memory calculation, the calculation of the intermediate results stored in memory, can be repeated iterative calculation, and the intermediate result of Mr Calculation is to drop the disk, so a job will involve the repeated read and write disk, which is not performance compared to Spark's main reason;
3), a task of Mr will take time for each boot of a container,container, and Spark is implemented based on the thread pool, and the allocation of resources will be much faster.

3. Spark System architecture Diagram

 

Basic components in the spark architecture:

Clustermanager: Master (Master node) in standalone mode, control the whole cluster and monitor worker. In yarn mode, it is a resource manager.

Worker: Slave node, responsible for controlling compute nodes, starting executor or driver. In yarn mode, NodeManager is responsible for the control of the compute nodes.

Driver: Run the main () function of application and create the Sparkcontext.

Executor: An executor, a component that performs a task on Workernode, and is used to start a thread pool run task. Each application has a separate set of executors.

Sparkcontext: The context of the entire application, controlling the life cycle of the application.

The basic computational unit of the Rdd:spark, a set of Rdd can form a rddgraph of the executed direction-free graph.

Dagscheduler: Build a stage-based DAG based on the job (job) and submit the stage to TaskScheduler.

TaskScheduler: Distribute Tasks (Task) to executor execution.

SPARKENV: A thread-level context that stores references to important components of the runtime.

Sparkenv creates and contains references to some of the following important components.

Mapoutputtracker: Responsible for the storage of shuffle meta-information.

Broadcastmanager: Responsible for the control of broadcast variables and the storage of meta-information.

Blockmanager: Responsible for storage management, creation, and lookup blocks.

Metricssystem: Monitor Runtime performance metrics information.

SPARKCONF: Responsible for storing configuration information.

The overall process for spark is: the client submits the application, Master finds a worker to initiate driver,driver to the master or Resource Manager to request resources, and then converts the app to Rddgraph, Then by the Dagscheduler to convert the rddgraph into the stage of the direction of the non-circular diagram submitted to TaskScheduler, by the TaskScheduler submit the task to executor execution. During task execution, other components work together to ensure a smooth execution of the entire application.

4. Spark Task Scheduling Process

Job Execution Process Description:


The client submits the job to Master,master so that a worker initiates driver, which is schedulerbackend. The worker creates a Driverrunner thread and Driverrunner initiates the schedulerbackend process. In addition, Master will let the rest of the worker start Exeuctor, which is executorbackend. The worker creates a Executorrunner thread, and Executorrunner initiates the executorbackend process. Executorbackend will register with driver's schedulerbackend after startup. The schedulerbackend process contains dagscheduler, which generates execution plans based on the user program and dispatches execution. For each stage of the task, it will be stored in the TaskScheduler, Executorbackend the task in TaskScheduler to executorbackend execution when reporting to Schedulerbackend. The job ends when all stages are completed.

5. What is an RDD?

RDD, called Resilient distributeddatasets (elastic distributed data Set), is a read-only, fault-tolerant, parallel, distributed collection of data. The RDD can be cached in memory, can be iterated, and the RDD is the core of Spark. The RDD also provides a rich set of operations to manipulate the data. In these operations, conversion operations such as map, FlatMap, and filter implement the Monad mode, which fits nicely with Scala's collection operations. In addition to this, the RDD provides more convenient operations such as joins, GroupBy, Reducebykey, etc. (Note that Reducebykey is action, not transformation) to support common data operations.

Generally speaking, there are several common models for data processing, including: Iterativealgorithms,relationalqueries,mapreduce,streamprocessing. For example, Hadoopmapreduce uses the mapreduces model, and Storm uses the Streamprocessing model. The RDD mixes these four models so that spark can be used in a variety of big data processing scenarios.

The RDD, as a data structure, is essentially a read-only collection of partition records. An RDD can contain multiple partitions, and each partition is a dataset fragment. Rdd can depend on each other. If each partition of the RDD can only be used by one CHILDRDD partition at most, it is called narrowdependency; if more than one CHILDRDD partition can be relied upon, it is called widedependency. Different operations may have different dependencies depending on their characteristics. For example, the map operation produces narrowdependency, and the join operation produces widedependency.

Spark's reliance is divided into narrow and wide, based on two-point reasons:

First, narrowdependencies can support the execution of multiple commands in a pipeline on the same clusternode, such as after a map is executed, followed by the filter. Instead, Widedependencies needs all of the parent partitions to be available and may need to invoke operations like MapReduce to pass across nodes.

Second, it is from the perspective of failure recovery. Narrowdependencies's failure recovery is more effective because it only needs to recalculate lost parentpartition and can be recalculated in parallel on different nodes. And widedependencies involves multiple parentpartitions at the RDD levels. The following figure illustrates the difference between narrowdependencies and widedependencies:


In the diagram, a box represents an rdd, and a shaded rectangle represents a partition.

The RDD is the heart of Spark and the architecture foundation of the entire spark. Its features can be summed up as follows: It is a constant data structure store it is a distributed data structure that supports cross-cluster provides coarse-grained operations based on the key of the data record to partition the structure, and these operations all support partitioning it stores data in memory, providing low latency

6. What is a dag?

Dag (Directed acyclicalgraphs) has a directed acyclic graph, which is a simple way to implement fault tolerance using a DAG. In the case of a job failure, you can easily backtrack through the graph, even in the middle of the calculation, and you can re-perform any failed tasks. The execution order of the graph always allows you to walk from any node in the diagram to the end of the node.

7. Why Scala?

1. Spark and Scala are really perfect pairings, and many of the RDD's ideas are similar to Scala's, such as a map of the same concept list, a high-order operator like filter, and a short code that implements many of the functions of a Java line, like immutable and lazy computations in FP, So that the distributed Memory object Rdd can be realized, at the same time can achieve pipeline;
2, Scala is good at borrowing, such as the design originally included for the JVM support, so it can be very perfect to borrow Java ecological power; like spark, a lot of things do not write their own, direct use, for reference, such as directly deployed in yarn, Mesos, EC2, using HDFs, S3, Borrowing the SQL parsing part of Hive;
3. Scala facilitates the development of efficient network communications

8. Spark Release1.6.0 new features

This is the seventh release of Spark, and the main new features of this release include the following:

1). Dataset API

Currently spark has two major classes of Api:rdd APIs (Sparkcore), DataFrame API (SPARKSQL)

The RDD API is very flexible, but in some cases execution plans are difficult to optimize.

The DataFrame API is easy to optimize, but operating the UDF (Uerdefinefunction) is cumbersome.

Datasetapi born in this context, it is necessary to take into account the advantages of both of these two, both to enable users to concise and clear description of the UDF, but also to facilitate the bottom catalyst to optimize the execution plan,

2). Sessionmanagement

Clusters can be shared with multiple users with different configurations and temporary tables, and there can be more than one sparksql session in 1 Sparkcontext. Before using 1.4 to find a lot of SQL down to the bottom of the final is sequential execution, and SQL to SqlContext scheduling is to write the thread pool to handle,

Joining session Management provides greater convenience for use, especially when dealing with multiple concurrency.

3). Per-operatormetricsforsqlexecution

Spark's metrics has been detailed on the task and stage, where you can see a number of tasks running metrics, such as task time-consuming, amount of data processed, and so on, before the information provided by metrics found out why the task was running for a long time. The new feature is a statistical display of the size of the data for each operating unit when it runs, with memory usage and overflow (which should refer to disk and memory Exchange), which is helpful for developers to understand resources, find problems, and make adjustments sparksql

4). Sqlqueriesonfiles

SQL can query directly on a file that conforms to the data rules, and no longer needs to register the file as a temporary table before

5.readingnon-standardjsonfiles

The Jackson pack that spark uses when working with JSON, this time opens the following options for the Jackson pack to the user

Allow_comments

Allow_unquoted_field_names

Allow_single_quotes

Allow_numeric_leading_zeros

Allow_non_numeric_numbers

6). Advancedlayoutofcacheddata

Storage partitioning and sorting schemes while scanning memory tables, adding APIs for distribution and local ordering based on specified columns in Dataframe

7). Allprimitiveinjsoncanbeinfertostrings

By setting Primitivesasstring to True, all primitives in JSON can be specified in Dataframe As string types, that is, data in JSON is uniformly string in Dataframe


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.