Spark Knowledge points
It 18 palm course system spark knowledge points are as follows:
There is a need for it 18 palm system course can add: 15210639973
1. definition
The Mapreduce-like Cluster Computing framework is designed to work with low latency iterations and interactive use.
2. Architecture
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/7F/B1/wKiom1cpYurgKynvAABoDAybMaY766.png "title=" 11.png "alt=" Wkiom1cpyurgkynvaabodaybmay766.png "/>
3. analysis of some important concepts
(1) RDD (resilient distributed dataset)
resilient distributed datasets A read-only, partitioned, distributed set of data that can be partially or fully cached in memory(data overflow is based onLrupolicy to determine which data can be placed in memory and which is stored on disk), used to reduceDisk-io,network-ioread-write overhead, which reduces the overhead of the entire computational framework. RDDTwo operations are supported, namelyTransformation, such asFilter,Map,Join,Union, andAction, such asReduce,Count,Save,Collectand so on. Transformationis to create a new dataset from an existing set of data, andActionis toTransformationdata set, and pass the calculation results to theDriver. To improve operational efficiency,Sparkin all theActionIs deferred generation, that is, it only temporarily remembers the previous conversion action, only if it really needs to return the dataset toDriverThese actions are performed only when the
(2) Lineage
called descent, is used to record How the Rdd DataSet evolves from other rdd datasets , when some of the partition data in an RDD dataset is lost, the system can get enough information to re-compute and recover the lost data partition by lineage. This is the coarse-grained fault-tolerant mechanism that Spark has designed to improve system performance. This coarse-grained fault-tolerant mechanism reduces the redundancy of data and the overhead of reading and writing disks compared to other backup mechanisms or the fine-grained fault-tolerant processing mechanism of the LOG mechanism.
the pedigree dependence of RDD is divided into two kinds, that is, wide dependence and narrow dependence.
650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M02/7F/B1/wKiom1cpYvrwwdHUAACfOgA7DrI100.png "title=" 22.png "alt=" Wkiom1cpyvrwwdhuaacfoga7dri100.png "/>
narrow dependency refers to the parent each partition of the RDD corresponds to only one child Rdd partition, and one child RDD partition can use one or more parent RDD partitions . A wide dependency means that each partition of the parent RDD can correspond to multiple child Rdd partitions, and each child Rdd partition can also use multiple partitions of the parent RDD. When a node goes down, it is obvious that the overhead of data re-calculation is greater than the narrow dependency.
(3) DAG (Directed acycle graph) has a directed acyclic graph that reflects the dependency between the RDD
4. Eco-System
650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M00/7F/B1/wKiom1cpYwuxtdw-AADvVY4ksTg899.png "title=" 33.png "alt=" Wkiom1cpywuxtdw-aadvvy4kstg899.png "/>
The main components that spark supports are:
(1) components for large data query analysis calculations Shark. For Spark , the role of Shark is similar to the role of Hive in the Hadoop system,Shark provides a series of command interfaces, Configuration parameters allow you to cache a specific RDD in Sparkand retrieve the data. In addition,Shark can invoke user-defined functions, combine data analysis with SQL queries, and implement data reuse to increase computational speed.
(2) for flow-based computing components Sparkstreaming. The rationale is that splitting data into very small pieces of data, encapsulating them in an RDD partition, and then processing the small data in a batch-like manner, using Spark 's memory-based features, can guarantee low latency of the computation, and algorithms that are compatible with batch processing and real-time data processing, in addition to fault tolerance through lineage.
(3) for the calculation of graphs GraphX:Spark 's GraphX provides APIs for diagram operations , withlower requirements for communication in Tomoo, Edge inversion, and adjacency calculations, resulting in an RDD diagram that is simpler. Multi-graph algorithm can be implemented conveniently by using GraphX framework.
(4) for machine learning. The MLib component provides a library of machine learning algorithms that currently support clustering, two-tuple classification, regression, and collaborative filtering algorithms. Relevant tests and data generators are also available.
Spark can be run on a local single node ( for debugging purposes ) or in a cluster, cluster manager Mesos,yarn, and so on, will distribute computing tasks to the various working nodes of the distributed system. The data source for spark can be generated by HDFS ( or other similar file systems ) .
Five, programming model
All of the operations of Spark are based on Rdd , andthe RDD operator is much richer than Hadoop. Some of the transformation operators consider the elements of the RDD as simple elements, divided into the following categories:
the input/outputone-to element-wiseoperator, and the resulting RDD partition structure is the same, mainly map,flatMap(map After flattening for one-dimensional RDD);
input and output one-to-one but results The partition structure of the RDD has changed, such as Union(two RDD together),coalesce(partition reduction);
an operator that selects some element from the input, such as filter,distinct(remove redundant elements),subtract(this RDD has, it the RDD has no elements left) and sample(sampled).
Another part of the transformation operator for Key-value Collection, also divided into:
to a single The Rdd does element-wise operations, such as mapvalues(keeping the source RDD partitioned, which is different from map);
to a single RDD reflow, such as sort,Partitionby(partition partitioning for consistency, which is important for data locality optimization, will be said later);
to a single RDD is based on key reorganization and reduce, such as Groupbykey,reducebykey;
on two x The RDD joins and reorganizes based on key , such as join,cogroup.
the latter three types of operations involve rearrangement, called Shuffle class operations.
from An Rdd- to -rdd sequence of transformation operators has been occurring in the rdd space. The important design here is the lazy evaluation: The calculation does not actually occur, but it is continuously logged to the meta data. The structure of the metadata is a DAG(directed acyclic graph), where each "vertex" is an RDD(including the operator that produces the Rdd ), from the parent Rdd to the child Rdd There is an "edge" that represents the dependency between the RDD.
Spark gives the metadata DAG a cool name,lineage(lineage). This lineage is also the log update described in the previous fault tolerant design.
The lineage continues to grow until the actionoperator ( the green arrow in Figure 1) is evaluate, and all the operators just accumulated are executed once. The input to the action operator is the Rdd(and all the rdd that the Rdd relies on on the lineage ), and the output is the native data generated after the execution, possibly Scala Scalar, collection-type data, or storage. When the output of an operator is the above type, the operator must be an action operator, and the effect is to return the original data space from the RDD space.
ActionThere are several types of operators: generating scalars, such asCount(ReturnRDDthe number of elements in theReduce,Fold/aggregate(seeScalaoperator document with the same name); returns several scalars, such asTake(returns the first few elements);Scalacollection types, such asCollect(PutRDDpour all the elements inScalacollection type),Lookup(Find the correspondingKeywrite storage, as with the previousTextfilecorresponding to theSaveastext-file. There's also a checkpoint operator.Checkpoint. WhenLineageVery long (this is often the case in graph calculations), it takes a long time to re-execute the entire sequence on an error, and can be invoked proactivelyCheckpointwrites the current data to a stable store as a checkpoint.
Here are two main points of design. The first is lazyevaluation. Familiarity with compiling knows that the larger the scope the compiler can see, the more opportunities to optimize. Although Spark does not compile, the scheduler actually optimizes the DAG for linear complexity. In particular, when there are multiple computational paradigms mixed on Spark, the scheduler can break the boundaries of different paradigm codes for global scheduling and optimization. The following example mixes Shark 's SQL code with Spark 's machine learning code. After each part of the code is translated into the underlying RDD, it is fused into a large DAG, allowing formore global optimization opportunities.
It 18 palm course system spark knowledge points Summary