Spark Learning notes Summary-Super Classic Summary

Last Update:2016-06-30 Source: Internet

Author: User

Tags random seed shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

About Spark

Spark can be easily combined with yarn to call directly HDFs, hbase data, and Hadoop. Configuration is easy.

Spark is growing fast and the framework is more flexible and practical than Hadoop. Reduced latency processing for improved performance efficiency and practical flexibility. And you can actually combine it with Hadoop.

The spark core is divided into Rdd. Core components such as Spark SQL, spark streaming, MLlib, GraphX, spark R solve a lot of big data problems, and their perfect frame day is popular. Its corresponding ecological environment, including Zepplin and other visual aspects, is growing. Large companies are scrambling to use spark to replace the corresponding functional modules of the original Hadoop. The spark read and write process is not as memory-based as Hadoop overflows to disk, so it is fast. In addition, the bandwidth dependence of DAG job scheduling system makes spark speed increase.

Spark Core composition

1. RDD

is an elastic distributed data set that is fully resilient and can be rebuilt if part of the data loss. Automatic fault tolerance, location-aware scheduling, and scalability to update the gold image fault-tolerant checks through data checkpoints and record data. The Sparkcontext.textfile () loads the file into an RDD and then builds a new RDD through transformation to store the RDD in an external system via action.

The RDD uses lazy loading, which is to load the data only when it is used. It wastes space if all intermediate processes are loaded and stored. So you want to delay loading. Once spark sees the entire transformation chain, he can calculate only the desired result data, and if the following function does not require data then the data will not load again. The converted Rdd is inert and can be used only in actions.

Spark is divided into driver and executor,driver submit job, executor is application early Worknode process, run task,driver corresponds to Sparkcontext. The RDD operation of Spark has transformation, action. Transformation is dependent on the wrapper for the Rdd, and the dependency of the RDD is built and saved by the DAG, and after the Worknode is hung up, it can be recalculated by the metadata for its saved dependencies. When the job submission is called Runjob, Spark constructs a DAG diagram based on the RDD and submits it to Dagscheduler, which is initialized with the sparkcontext when it is created, and dispatches the job. When the dependency graph is built, the action begins with parsing, each operation as a task, cutting into a taskset every time shuffle is encountered, and outputting the data to disk, if not shuffle data is also stored in memory. Just move forward until there is no operator and then run from the front, and if no action operator is executed here, it will not run until the action is encountered, which forms the lazy loading of spark, The Taskset is submitted to Tasksheduler to generate Tasksetmanager and is submitted to the executor to run, feedback to Dagscheduler after the run completes a taskset, then commits the next one, When Taskset fails, it returns Dagscheduler and re-creates it again. There may be more than one taskset in a job, and a application may contain more than one job.

2. Spark Streaming

By reading the Kafka data, the stream data is divided into small time fragments (a few seconds), which is processed in a batch-like way to deal with this part of the small data, each time slice generates an RDD, has the efficient fault tolerance, the small batch data can be compatible with the batch real-time processing logic algorithm, Using some historical data together with real-time data analysis, such as classification algorithm. It is also possible to perform mapreduce, join and other operations on small batches of streams, while guaranteeing their real-time nature. Engineering problems that require less than a millisecond of data flow time are available.

Spark streaming also has a streamingcontext, the core of which is Dstream, which is composed of a continuous rdd on a group time series, containing a structure that has the duration as a key and an rdd as value. Each RDD contains a data stream at a specific time interval that can be persisted through persist. After accepting the constant flow of data, a queue is maintained in the Blockgenerator, the stream data is placed in the queue, and all of its data is merged into an RDD (data in this interval) after the processing interval arrives. Its job submission is similar to spark, except that it gets the Dstream internal rdd and generates a job submission at the time of submission, and the RDD submits the job to Jobqueue in JobManager and Jobscheduler Dispatch after the action is triggered. Jobscheduler commits the job to the job scheduler of Spark and then transforms the job into a large number of tasks distributed to the spark cluster execution. The job is generated from outputstream and then triggers reverse backtracking to execute dstreamdag. In the process of stream data processing, the processing of general node failure is more complicated than the offline data. Spark StreamIn can periodically write dstream to HDFs after 1.3, while offset is also stored to avoid writing to ZK. Once the primary node fails, the previous data is read in a checkpoint manner. When the Worknode node fails, if HDFs or file is the input source, then spark recalculates the data based on dependencies, and if it is based on a network data source such as Kafka, Flume, Spark will back up the phone's data source in different nodes in the cluster, and once a working node fails, The system can be recalculated based on the data that is still present, but if the Accept node fails, it loses some of the data, and the accepting thread restarts and accepts the data on the other nodes.

3, Graphx

Mainly used for calculation of graphs. The core algorithms are PageRank, SVD singular matrices, triangleconut and so on.

4. Spark SQL

Is the new interactive Big Data SQL technology launched by Spark. Translating SQL statements into an RDD operation on Spark can support data types such as hive, JSON, and so on.

5. Spark R

Calling spark through the R language does not currently have a wide range of api,spark, such as Scala or Java, that provides the spark API through the Rdd class and allows users to run tasks in the cluster interactively using R. The Mllib machine Learning Class library is also integrated.

6, Mlbase

Top to bottom include Mloptimizer (for users), MLI (for algorithm users), MLlib (for algorithmic developers), and Spark. You can also use Mllib directly. ML Optimizer, a module that optimizes machine learning to select more appropriate algorithms and related parameters, as well as MLI to perform feature extraction and advanced ML programming abstraction algorithms to implement API platforms, Mllib distributed machine learning libraries that can be continuously augmented with algorithms. Based on the Spark computing framework, Mlruntime applies Spark's distributed computing to the Machine learning field. Mlbase provides a simple declarative way to specify machine learning tasks, and dynamically selects the optimal learning algorithm.

7, Tachyon

High-fault-tolerant Distributed file system. Claims that its performance is more than 3,000 times more than HDFs. There are Java-like interfaces, and HDFS interfaces are implemented, so spark and Mr Programs can run without any modification. Currently support HDFs, S3 and so on.

8. Spark operator

1, Map. The original data is processed, similar to the traversal operation, converted to Mappedrdd, the original partition unchanged.

2, FlatMap. Each element in the original RDD is converted to a new element through a function, combining elements from each set of the RDD into a single collection. For example, an element inside multiple lists, through this function are merged into a large list, the most classic is wordcount each line of elements to become, through flapmap into a word, line.flapmap (_.split (""). Map (_,1) Converts a line of words into a list if it is via map.

3, Mappartitions. Iterate over each partition to generate a mappartitionsrdd.

4, Union. is to combine two rdd into one. Use this function to ensure that the data type of the two RDD elements is the same, and the data type of the RDD returned is the same as the RDD data type being merged.

5, Filter. The function is to filter the elements, call the F function on each element, and the element with the return value true remains in the RDD.

6, Distinct. The element in the Rdd is de-re-manipulated.

7, Subtract. Remove all elements of RDD1 and RDD2 in the RDD1.

8, Sample. Sampling the elements within an RDD, the first parameter, Withreplacement, is true to indicate that there is a drop-back sampling, and False indicates no return. The second parameter represents a scale, and the third parameter is a random seed. such as Data.sample (true, 0.3,new Random (). Nextint ()).

9, Takesample. As with the sample usage, only the second parameter is replaced by the number. Return is not an RDD, but a collect.

10, Cache. Caches the RDD into memory. Equivalent to persist (memory_only). You can set the ratio between the cache and the running memory through the parameters, which is lost if the amount of data is larger than the cache memory.

11, Persist. Inside parameters can choose Disk_only/memory_only/memory_and_disk, etc., where the memory_and_disk automatically overflow to disk when the cache space is full.

12, Mapvalues. For the KV data, the value in the data map operation, and the key is not processed.

13, Reducebykey. The value of the same key is aggregated for the KV data. Unlike Groupbykey, a combine operation similar to MapReduce is performed, which reduces the corresponding data IO operations and accelerates efficiency. If you want to do some non-overlapping operations, we can combine the value group composite strings or other formats with the value of the same key, and iterate through the combined data to disassemble the operation.

14, Partitionby. You can partition the RDD, regenerate a shufflerdd, perform a shuffle operation, and perform frequent shuffle operations on the back to speed up efficiency.

15, Randomsplit. Random segmentation of the RDD. such as Data.randomsplit (new double[]{0.7, 0.3}) returns an array of Rdd.

16, Cogroup. For the KV element in two Rdd, the elements in the same key in each rdd are aggregated into a single set. Unlike Reducebykey, the elements for the same key in two Rdd are merged.

17, Join. Equivalent to inner join. Cogroup the two rdd that need to be connected, and then perform a Cartesian product operation on the list below each key, outputting 22 of the two sets that intersect as value. Equivalent to where A.key=b.key in SQL.

18, Leftoutjoin,rightoutjoin. The LEFT join in the database lists all the data in the table in left table coordinates, and the right does not exist with null padding. On the basis of the face join here, judge whether the left-hand rdd element is empty, and if it is empty, it fills. The right connection is the opposite.

19, Saveastestfile. Outputs the data to the specified directory in HDFs.

20, Saveasobjectfile. Write HDFs to sequencefile format.

21, Collect, Collectasmap. Convert the RDD into a list or map. The results are output in list or HashMap mode.

22, Count. The elements of the rdd are counted, and the number is returned.

23, Top (k). Returns the largest element of K, returning the form of a list.

24. Take to return the first k elements of the data.

25, takeordered. Returns the smallest k elements of the data, and preserves the order of the elements in the return.

9. Tips

1, rdd.repartition (n) can initially partition the RDD, this operation is actually a shuffle, it may be time-consuming, but if the following action is more, you can reduce the time for the following operation. where the n value of the number of CPUs, generally more than twice times the CPU, less than 1000.

2, action can not be too much, each action will be the above Taskset partition a job, so when the job increased, and where the task is not released, will occupy more memory, make the GC pull down efficiency.

3, a filter in front of the shuffle, reduce the shuffle data, and filter out null values, as well as null values.

4, GroupBy as far as possible through reduceby substitution. Reduceby will do a reduce in the work node, in the overall reduce, equivalent to do a Hadoop combine operation, and combine operation and Reduceby logic consistent, this groupby is not guaranteed.

5, do join, as far as possible with a small rdd to join the big Rdd, with large rdd to join the super-large rdd.

6, avoid the use of collect. Because collect when the data set is large, it is collected through various work, IO increases, and the performance is reduced, so save to HDFs when the dataset is large.

7, Rdd If the use of iterations behind the proposed cache, but it is important to estimate the size of the data, avoid the cache is larger than the memory, if greater than memory will delete the previously stored cache, may result in calculation errors, if you want to complete storage can use persist (memory _and_disk), because the cache is persist (memory_only).

8, set SPARK.CLEANER.TTL, scheduled cleanup task, because the job reason may cache a lot of past tasks, so the timing of recycling may avoid centralized GC operation to pull down performance.

9, appropriate pre-partition, through Partitionby () set, each Partiti

Ext.: http://www.cnblogs.com/hellochennan/p/5372946.html

Spark Learning notes Summary-Super Classic Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More