The data flow of an iterative machine learning algorithm in spark can be understood by graph 2.3来. Compare it to the iterative machine learning data stream of Hadoop Mr in figure 2.1. You'll find in Hadoop
Each iteration of MR involves the reading and writing of HDFs, which is much simpler in spark. It only requires one read of the distributed shared object space from HDFs to spark-creating an RDD from the HDFs file. The RDD can be reused, and it will reside in memory in every iteration of machine learning, which can significantly improve performance. When the check end condition finds that the iteration ends, the RDD is persisted and the data is written back to HDFs. The following sections describe the internal structure of spark in detail-including its design, RDD, lineage, and so on.
Figure 2.3 Data sharing for iterative computations in spark
Spark's elastic distributed datasets
The concept of RDD is related to the spark we're talking about-the ability to let users manipulate Scala collections on distributed systems. This important collection in Spark is the RDD. The RDD can be created by performing deterministic operations on data in other RDD or stationary storage (for example, files in HDFs). Another way to create an RDD is to parallelize the Scala collection. The creation of an RDD is also a conversion operation in Spark. There are other actions, such as action, in addition to the conversion operation on the RDD. These are common conversion operations like map, filter, and join. The interesting thing about the RDD is that it can store its lineage or the sequence of transformations needed to create it, as well as the actions above it. This means that the Spark program can only have one RDD reference-it knows its lineage, including how it was created, and what actions were performed on it. Lineage provides a fault tolerance for the RDD-even if it is lost, the entire RDD can still be rebuilt as long as the lineage itself is persisted or duplicated. The persistence of the RDD and the chunking can be specified by the programmer. For example, you can split blocks based on the primary key of the record.
Many operations can be performed on the RDD. These include Count,collect and save, which can be used to count the total number of elements, return records, and save to disk or HDFs. The lineage chart stores the conversion and movement of the RDD. A series of transformations and actions are listed in Table 2.1.
Table 2.1
Transformation |
Describe |
Map (function F1) |
Pass each element in the RDD in parallel to the F1 and return the RDD of the result |
Filter (function F2) |
Select the RDD elements that pass to the function F2 and return True |
FlatMap (function F3) |
Similar to map, but F3 returns a sequence that maps a single input to multiple outputs. |
Union (RDD R1) |
Returns the R1 of the RDD and its own set |
Sample (flag, p, seed) |
Returns the random sample of P of the RDD (using seeding Seed) |
Action |
Describe |
Groupbykey (Notasks) |
You can only call on data on key values--The returned data is grouped by value. The number of parallel tasks is specified by a parameter (default is 8) |
Reducebykey (function F4,notasks) |
Aggregates the results of applying function F4 on the same key element. The second parameter is the number of parallel tasks |
Join (RDD R2, Notasks) |
Connecting the RDD R2 to the object itself-calculates all possible combinations of the specified key |
Groupwith (RDD R3, Notasks) |
Connect the RDD R3 to the object itself and group by key |
Sortbykey (flag) |
Sort the rdd itself in ascending or descending order based on the tag value |
Action |
Describe |
Reduce (function F5) |
Use the function F5 to aggregate all the elements of the RDD |
Collect () |
Returns all the elements of the RDD as an array |
Count () |
Calculates the total number of elements of the RDD |
Take (N) |
Get the nth element of an RDD |
First () |
Equivalent to take (1) |
Saveastextfile (PATH) |
Persistent Rdd into HDFs or other Hadoop-supported file systems in a path to a file |
Saveassequencefile (PATH) |
A sequence file that persists the RDD to Hadoop. A call can be made only on an rdd of a type that implements a Hadoop write interface or a similar interface's key value pair. |
Action |
Describe |
foreach (function f6) |
Run the function f6 in parallel on the elements of the RDD |
Below is an example of how to program the RDD in a spark environment. Here is a call data record (CDR)--an application based on impact analysis--to build a user's diagram through CDR and identify the most influential K-users. The CDR structure includes ID, caller, receiver, schedule type, call type, duration, time, date. This is done by getting the Cdr file from HDFs, then creating the Rdd object and filtering the records, and then performing some operations on it, such as extracting specific fields from the query, or performing aggregation operations such as count. The spark code that was eventually written is as follows:
Val spark = new Sparkcontext ();
Call_record_lines = Spark.textfile ("hdfs://....");
Plan_a_users = Call_record_lines.filter (_.
CONTAINS ("Plana")); The filter operation on the RDD.
Plan_a_users.cache (); Tells Spark that if there is still space, the RDD is cached in memory Plan_a_users.count ();
Percent is called in the data set processing.
The RDD can be represented as a graph, so it is easier to track the change in the lineage of the rdd between different transitions/actions. The Rdd interface consists of five pieces of information, as detailed in table 2.2.
Table 2.2 Rdd Interface
Information |
Hadooprdd |
Filteredrdd |
Joinedrdd |
Partition type |
One partition per HDFs block |
Consistent with the parent Rdd |
Each reduce task is a |
Dependency type |
No dependencies |
And the parent RDD is a one-on |
Shuffle on each parent Rdd |
Functions for calculating datasets based on parent Rdd |
reading data from a corresponding block |
Calculate the parent Rdd and filter |
Read and connect the data after shuffling |
Location Meta data (preferredlocations) |
To read the location information of the HDFs block from a named node |
None (obtained from the parent RDD) |
No |
Partition meta data (PARTITIONINGSCHEME) |
No |
No |
Hashpartitioner |
the implementation of Spark
S Park is written in about 20000 lines of Scala code, and the core is about 14000 lines. Spark can run on a cluster manager such as Mesos, Nimbus, or yarn. It uses the unmodified Scala interpreter. When an action on the RDD is triggered, a spark component called a directed acyclic graph (DAG) scheduler checks the RDD's lineage graph and creates a DAG at each stage. Only narrow dependencies occur in each stage, and the shuffle operation required for wide dependencies is the boundary of the stage. The scheduler initiates tasks at different stages of the DAG to calculate the missing partitions in order to refactor the entire Rdd object. It submits the task objects for each stage to the Task Scheduler (Task Scheduler, TS). A Task object is a separate entity that consists of code and transformations, as well as the required metadata. The scheduler is also responsible for resubmitting those phases where the output is missing. The Task Scheduler uses a scheduling algorithm called delay scheduling (Zaharia, etc. 2010) to assign tasks to individual nodes. If a priority area is specified in the RDD, the task is routed to those nodes, otherwise it is assigned to the nodes that have partitions on the request memory task. For wide dependencies, intermediate records are generated on those nodes that contain the parent partition. This makes error recovery simple, and the materialization of the map output in Hadoop Mr is similar.
The worker component in spark is responsible for receiving task objects and calling their run methods in a pool of threads. It reports exceptions or errors to Tasksetmanager (TSM). TSM is an entity managed by the Task Scheduler-one TSM for each task set that tracks the execution of the task. TS is polling the TSM set in FIFO order. By inserting different strategies or algorithms, there is still some space for optimization. The actuator interacts with other components, such as the Block Manager (BM), the Communication Manager (CM), and the Map output tracker (MOT). A block manager is a component that a node uses to cache the RDD and receive shuffle data. It can also be thought of as a k-v store that is written only once per worker. The Block manager communicates with the communication manager to obtain block data to the far end. The communication Manager is an asynchronous network library. MOT This component will be responsible for tracking where each map task is running and returning this information to the--worker cache this information. When the output of the mapper is lost, a "generational id" is used to invalidate the cache. This is shown in the interaction of each component in Spark 2.4.
Figure 2.4 Components in a spark cluster
The storage of the RDD can be done in one of the following three ways:
- As a Java object that is deserialized in a Java Virtual machine: This will perform better because the object is in the JVM memory.
- As an in-memory serialized Java object: This means that memory usage is higher, but the access speed is sacrificed.
- Stored on disk: This is the worst performance, but only if the RDD is too big to store in memory.
Once the memory is full, Spark's memory management recycles the RDD through a least recently used (LRU) policy. However, partitions belonging to the same RDD cannot be excluded-because, in general, a program may be calculated on a large rdd, and a system bump can occur if a partition in the same RDD is removed.
The lineage graph has enough information to reconstruct the lost partition of the RDD. However, given the efficiency factor (which may require a lot of computational weight to rebuild the entire RDD), checkpoints are still required-the user can autonomously control which RDD is the checkpoint. Using a wide-dependent rdd can use checkpoints, because in this case, calculating the missing partitions would require significant communication and computational capacity. For an rdd with only a narrow dependency, the checkpoint is not suitable.
Reprinted from Concurrent Programming network –ifeve.com
Spark Flex Data Set