Problem
How does Spark's computational model work in parallel? If you have a box of bananas, let three people take home to eat, if not unpacking the box will be very troublesome right, haha, a box, of course, only one person can be carried away. At this time, people with normal IQ know to open the box, pour out bananas, respectively, take three small boxes to reload, and then, each to go home to chew it. Spark and many other distributed computing systems have borrowed this idea to achieve parallelism: a large data set, cut into N small heaps, find M actuators (M < N), each with one or more pieces of data to play slowly, play the results and then collected together, this even after the execution. So spark does a job: whatever can be counted by me is to meet my requirements, so spark no matter what data is processed first into a data set with multiple tiles, this data set is called an RDD.
RDD
The RDD (resilient distributed Datasets, Elastic distributed DataSet) is a collection of read-only records for a partition. The RDD can only be created by deterministic operations on the data of a stable memory or other rdd. We call these operations transformations to differentiate between other types of operations. such as Map,filter and joins.
The RDD does not need to be "materialized" at any time (making a real transformation and eventually writing to a stable memory). In fact, an RDD has enough information to describe how it generates data from other stable memories. It has a powerful feature: In essence, if the RDD fails and cannot be rebuilt, the program will not be able to reference the RDD. Users can control the other two aspects of the RDD: persistence and partitioning. Users can choose which rdd to reuse and develop a storage strategy for them (e.g., memory storage). It is also possible to have the data in the RDD distributed to multiple machines in the cluster based on the recorded key. This is useful for location optimization, such as ensuring that two datasets to be jion use the same hash partitioning method.
A spark programming interface that transforms operations (such as map and filter) on data on a stable store by programmers. And get one or more rdd. You can then invoke the actions of these RDD actions (action) classes. The purpose of this type of operation is to return a value or import the data into the storage system. The action class operates as count (the number of elements that return the dataset), collect (the collection that returns the element itself), and save (the output dataset to the storage System). Spark does not actually calculate the RDD until the RDD first invokes an action.
You can also invoke the persist (persisted) method of the Rdd to indicate that the RDD will also be used in subsequent operations. By default, Spark will have an persist-called Rdd in memory. However, if there is not enough memory, it can be written to the hard disk. By specifying the parameters in the PERSIST function, the user can also request other persistence policies (such as tachyon) and use tags to perform persist, such as storing only on the hard disk or copying one copy between the machines. Finally, the user can set a persistent priority on each RDD to specify which data in memory should be written preferentially to the disk. The cache has a cache manager, which is called Blockmanager in Spark. Note that there is also a misconception that many people think that the moment that the cache or persist is called is cached, which is completely wrong, and the actual cache execution command is triggered on the action.
The data is now stored on HDFS, and the data format is split with ";" as the data for each row:
"Age";"Job";"Marital";"Education";"Default";"Balance";"Housing";"Loan" -;"Unemployed";"Married";"PRIMARY";"No";1787;"No";"No" -;"Services";"Married";"Secondary";"No";4789;"Yes";"Yes"
/1.Defines an RDD Val based on an HDFs file (consisting of several lines of text)Lines= Sc.textfile ("/data/spark/bank/bank.csv")//2. Because the first line is the title of the file, we want to remove the first line and return to the new Rdd WithouttitlelinesVal Withouttitlelines =Lines.Filter(!_.contains("Age"))//3. Dividing each row of data by; split, return the new RDD whose name is LineofdataVal Lineofdata = Withouttitlelines.map (_.Split(";"))//4. Cache the Lineofdata to memory and set the cache name to be LineofdataLineofdata.setname ("Lineofdata") lineofdata.persist//5. Get data older than 30 and return the new Rdd is GtthirtyyearsdataVal gtthirtyyearsdata = Lineofdata.Filter( Line= Line(0). ToInt > -)//To this, no work has been performed on the cluster. However, the user is now ready to use the RDD in the action. //Calculate how many people are more than 30 years oldGtthirtyyearsdata.count//Return result is 3027
- Lineage
In the query above the 30-year-old query, we began to get rid of the title line corresponding to the RDD lines, that is, Withtitlelines, and then the Withtitlelines map operation to split each row of data content, and then filter the age of more than 30 years old people, Last count (All records are counted). The Spark Scheduler streamlines the last two transformations and sends a set of tasks to the nodes that hold the lineofdata corresponding cache partition. In addition, if a partition of Lineofdata is lost, Spark restores the partition by performing the original split operation only on those rows that correspond to that partition.
So when the current RDD is not available in spark calculations, the current RDD data can be recalculated from the parent rdd, but if the parent Rdd is not available, the parent Rdd can be recalculated by the parent RDD.
- Transformations and actions
The transformations operation is understood as an inert operation that simply defines a new rdd instead of calculating it immediately. Instead, the actions are evaluated immediately, return the results to the program, or write the results to an out-of-store.
5 Rdd features, briefly summarized as: A component area, they are the smallest shard of the dataset, a set of dependencies, pointing to its parent rdd, a function based on the parent RDD, and the metadata that divides the policy and the location of the data. For example, an RDD that behaves as an HDFs file represents each file block of a file as a partition and knows the location information for each file block. At the same time, the RDD has the same partition after the map operation. When the element is evaluated, the map function is applied to the data of the parent RDD.
Rdds dependencies, how to represent the dependencies between Rdd in Spark is divided into two categories: narrow dependency: The partition of each parent RDD is used at most by a sub-RDD partition, which is onetoonedependecies; wide dependency: Partition of multiple child RDD dependent on one parent RDD partition , which is onetomanydependecies. For example, a map operation is a narrow dependency, and a join operation is a wide dependency (unless the parent RDD has been partitioned based on a hash policy)
Detailed Description:
First, narrow dependencies allow pipelined execution on a single cluster node, which can compute all of the parent partitions. For example, you can perform the filter operation and the map operation sequentially by element. Instead, a wide dependency requires all the parent RDD data to be available and the data has been shuffle through the class MapReduce operation.
Second, in narrow dependencies, recovery from node failure is more efficient. Because only the missing parent partition needs to be recalculated, and these lost parent partitions can be recalculated on different nodes in parallel. In contrast, in a wide dependency inheritance relationship, a single failed node can cause some partitions in one rdd to be lost in all ancestor rdd, resulting in a recalculation of the computation.
For Hdfs:hdfs files as input rdd. For these rdd,partitions represents the partition of each file block in the file (containing the offset of the file block in each partition object), Preferredlocations represents the node where the file block resides, and iterator reads the file blocks.
For map: The map operation on any RDD will return a Mappedrdd object. This object has the same partition as its parent object and a preferred location (preferredlocations), but in its iteration method (iterator), the function passed to map is applied to the parent object record.
- Job scheduling
When a user performs an action (such as Count or save) on an RDD, the scheduler constructs a DAG (directed acyclic graph) consisting of several stages (the stage) to execute the program, as shown in the lineage of the Rdd.
Each stage contains as many successive narrow-dependent conversions as possible. The demarcation between the stages is the shuffle operation required for wide dependencies, or a computed partition in the DAG that can reach the parent Rdd more quickly through the partition. The scheduler then runs multiple tasks to calculate partitions that are missing from each stage until the target Rdd is eventually reached.
The scheduler uses the delayed scheduling mechanism to assign tasks to each machine and determines the data storage location (locality). If a partition that a task needs to process is stored exactly in the memory of a node, the task is assigned to that node. Otherwise, if a task processes a partition with an RDD that contains a better location (for example, an HDFs file), we assign the task to those locations.
"Actions that correspond to a wide dependency class, such as shuffle dependencies, will be physically stored on the node that holds the parent partition." This is similar to the output of the MapReduce materialized map, which simplifies the data recovery process.
For a failed task, it will be re-executed on the other node as long as it is still available for the parent class information for the stage. If some stage becomes unavailable (for example, because shuffle is missing at some point in the map phase), the corresponding task is resubmitted to calculate the missing partition in parallel.
If a task executes slowly (that is, "straggler"), the system performs a copy of the task on the other node, similar to the MapReduce practice and takes the first results as the final result
- Memory management
Spark provides three storage strategies for the persistent RDD: The serialized Java object is stored in memory, and the serialization data is stored in memory and disk storage. The first option has the best performance because it provides direct access to the Rdd object in the Java Virtual machine memory. In the case of limited space, the second approach allows the user to use a more efficient way of organizing memory than the Java object graph, at the cost of reducing performance. The third strategy applies to the case where the RDD is too large to store in memory, but each recalculation of the RDD brings additional resource overhead.
For limited available memory, Spark uses an LRU recovery algorithm with an Rdd object for management. When a new RDD partition is calculated, but there is not enough space to store it, the system reclaims the space of one of its partitions from the least recently used RDD. Unless the RDD is the appropriate RDD for the new partition, in this case, spark will keep the old partition in memory, preventing the partition of the same rdd from being paged in. Since most operations are performed on all partitions of an RDD, it is likely that the existing in-memory partitions will be reused.
5. Checkpoint support (checkpoint)
Although lineage can be used for the recovery of an RDD after an error, this recovery takes a long time for a long lineage Rdd. Therefore, it is helpful to save certain RDD checkpoint operations (Checkpoint) to stable storage.
In general, it is useful to set checkpointing for long-lineage rdd with wide dependencies, in which case the failure of one node in the cluster causes some data loss from each parent RDD, which requires a full re-calculation. On the other hand, it is not necessary for an RDD that is narrowly dependent on the data on the stable store to checkpoint it. If a node fails, the partition data that the RDD loses in that node can be recalculated from other nodes in parallel, costing only a small portion of the entire RDD.
Spark currently provides APIs that set checkpoints for rdd (persisted with a replicate flag), giving users the discretion to set checkpoint operations for which data.
Finally, because the read-only nature of the RDD makes checkpoint easier than common shared memory, because there is no need to care about consistency, the write-up of the RDD can be done in the background without the need for programs to pause or perform distributed snapshots.
Summarize
In short, the features are as follows:
1. Data structure is not variable
2. Support for distributed data operations across clusters
3. You can partition the data record by key
4. Provides coarse-grained conversion operations
5. Data is stored in memory, ensuring low latency
spark-Understanding Rdd