Spark Core operator Optimization __spark

Source: Internet
Author: User
Tags bulk insert compact shuffle
operator Optimization Mappartitions
In Spark, the most basic principle is that each task deals with a RDD partition.

Advantages of Mappartitions Operations:

If it's a normal map, like 10,000 data in a partition, OK, then your function will be executed and calculated 10,000 times.

However, after using the mappartitions operation, a task only executes one function,function at a time to receive all the partition data. Just one execution at a time, the performance is relatively high.

Mappartitions's shortcoming: there must be some.

If it's a normal map operation, the execution of a function processes a piece of data; if there is not enough memory, such as processing 1000 data, then there is not enough memory, then can be processed 1000 data from the memory of garbage collection, or in other ways, Make room for it.

Therefore, ordinary map operations do not usually cause memory oom exceptions.

But mappartitions operation, for a large number of data, such as even a partition,100 million data, one function after passing in, then may suddenly not enough memory, but there is no way to free up memory space, may be oom, memory overflow.

When it is more suitable to use the Mappartitions series operation, that is, the amount of data is not particularly large, you can use this mappartitions series operation, performance is very good, there is a promotion. For example, the original is 15 minutes, (there was once performance tuning), 12 minutes. 10 minutes->9 minutes.

But there have been problems with the experience, mappartitions as long as a use, direct oom, memory overflow, crash.

In the project, you first estimate the amount of RDD data, the amount of each partition, and the memory resources that you allocate to each executor. Look at all of a sudden memory to hold all the partition data, OK. If the line, you can try, can run through the good. There must be some improvement in performance.

But after a try, found, no, oom, then give up.

operator Optimization Reducebykey

Transformation operations, similar to the combiner in MapReduce


val lines = sc.textfile ("hdfs://")
val words = Lines.flatmap (_.split (""))
Val pairs = Words.map ((_, 1)) val counts = Pairs.reducebykey (_ + _)
Counts.collect ()

Reducebykey, one of its features compared to ordinary shuffle operations (such as Groupbykey) , that is, the
makes local aggregation on the map side.

In the output file that is created for each task on the map side to the next stage, the local combiner operation is performed before the data is written, and
that is to say, for each key, the corresponding values will execute your operator function (_ + _)

Performance improvement with Reducebykey:

1, after local aggregation, the amount of data on the map side becomes less, reducing disk IO. And you can reduce disk space consumption.
2, the next stage, pull the amount of data, it will become less. Reduce the performance consumption of network data transfer.
3, less memory footprint for data caching at the reduce end.
4, reduce end, the amount of data to be aggregated is also less.

Summary:

Reducebykey under what circumstances.

1, very common, for example, is to implement the same as the WordCount program, for each key corresponding to the value,
for some data formula or algorithm calculation (cumulative, tired)
2, For some of the more complex operations that are similar to string concatenation of each key, you can measure it yourself,
In fact, you can use Reducebykey to do that sometimes. But not too good to achieve.
is definitely helpful for performance if you can do that. (Shuffle basically takes up more than 90% of the performance consumption of the entire spark job,
can be tuned for shuffle, which is valuable)

Our program doesn't do that. But take this as a lesson after study questions to everyone, see if we can reducebykey the operation of our aggregate session
to improve performance.

operator optimization Repartiton


Operator tuning using repartition solves the performance problem of spark SQL with low degree of parallelism
Shuffle parallelism of spark.sql.shuffle.partitions adjustment dataframe
Shuffle parallelism of spark.default.parallelism adjustment rdd

Degree of parallelism: Previously said that the degree of parallelism can be adjusted, or set.
1, Spark.default.parallelism
2, Textfile (), passing in the second parameter, specify the number of partition (relatively less)

In our project code, there is no degree of parallelism, in fact, in the production environment, it is best to set their own.
The website has the recommended setting way, your spark-submit script, will specify your application to start how many executor altogether,
100, each executor how many CPU core,2~3 A, total application, have CPU core,200.

Officially recommended, according to your application total CPU core number (can be specified in the Spark-submit, 200),
Manually set the Spark.default.parallelism parameter, specified as 2~3 times the total number of CPU core. 400~600 a degree of parallelism. 600.

Connecting link

The degree of parallelism you set up, and under what circumstances it will take effect. Under what circumstances it will not take effect.
If you never use Spark SQL (dataframe), then your entire spark application default all stage parallelism
It's the same argument you set. (Unless you use the COALESCE operator to reduce the number of partition)

Here's the problem, Spark SQL, used. You can't specify the parallelism of the stage with the spark SQL.
Spark SQL itself defaults to the hive table corresponding to the HDFs file block, automatically set Spark SQL query the stage of the
Degree of parallelism. The degree of parallelism you specify by yourself through the Spark.default.parallelism parameter will only take effect in stage without spark SQL.

For example, your first stage, spark SQL from the Hive table to query out some data, and then do some transformation operations,
followed by a shuffle operation (Groupbykey); The next stage, after the shuffle operation,
Have done some transformation operations. Hive table, corresponding to a HDFs file, there are 20 blocks;
You set the Spark.default.parallelism parameter to 100.

The degree of parallelism in your first stage is not controlled by you, there are only 20 tasks; the second stage,
will become the degree of parallelism you set yourself, 100.

The problem is where.


Spark SQL By default, it's that degree of parallelism that we can't set. May cause problems, maybe no problem,
Maybe it's a problem. Spark SQL is in the stage, the ones that follow transformation operations,
There may be very complex business logic, or even complex algorithms. If your spark SQL defaults to set the task number to little,
20, and then each task handles a large amount of data, and then performs particularly complex algorithms.

This time, it will lead to the first stage speed, especially slow. The second stage,1000 a task, brush, very fast.

To solve the above spark SQL can not set the degree of parallelism and task number of methods, what is it.

Repartition operator, you use spark SQL this step of parallelism and task number, there must be no way to change. But then,
You can spark SQL query out of the RDD, using the repartition operator, to repartition,
This can be partitioned into multiple partition, from 20 partition to 100.

And then, from the repartition Rdd, and then back, the parallelism and the number of tasks will follow your expectations.
You can avoid operators that are bound to spark SQL in a stage, and you can use only a small number of tasks to handle large amounts of data and
Complex algorithmic logic.


There's a good chance that this is going to happen.
For example, Spark SQL sets 20 tasks for the first stage by default, but depending on the amount of data you have and the complexity of the algorithm
In fact, you need 1000 tasks to execute in parallel.

So, here, you can perform a repartition partition operation on the RDD Spark SQL just queried


operator Optimization Filter


By default, after this filter, the amount of data per partition in the RDD may be different.
(the amount of data per partition may be the same)

Problem:

1, each partition data is less, but in the later processing, or with partition number of the same number of tasks,
to handle; a bit of a waste of task computing resources.

2, each partition data volume is different, will cause each subsequent task to deal with each partition time,
Each task has a different amount of data to deal with, and this is a very easy problem.
Data skew ....
For example, the second partition data is only 100, but the third partition is 900;
Then, in the same scenario as the next task-processing logic, the amount of data that a different task would have to deal with might be 9 times times the difference.
Even more than 10 times times, the same results in the speed of the difference is 9 times times, or even 10 times times more.
In that case, some tasks will run very fast, and some tasks may run slowly. This is the data skew.

We hope to be able to deal with the two problems mentioned above.

1, for the first question, we hope to be able to partition compression, because the data is less,
So partition actually can be correspondingly smaller. For example, the original is 4 partition, now completely can become 2 partition.
Then just use the 2 tasks that follow. will not cause a waste of task computing resources.
(not necessary, for partition with only a little bit of data, and to start a task to compute)

2, for the second question, in fact, the solution is the same as the first problem; to compress partition,
Try to make each partition the same amount of data. So then, the amount of data that the next task assigns to the partition
It's about the same. It does not cause some tasks to run particularly slowly, and some tasks run particularly fast. Avoids the problem of data skew.

With the idea of solving the problem, what are we going to do next? Realize.


operator optimization COALESCE operator


It is mainly used to compress the number of partition after the filter operation, for each partition data amount is different.
Reduce the number of partition, and make each partition data volume as far as possible even compact.
So that the task of the subsequent calculation operations, to some extent, to a certain extent to improve performance.


Explain:

Here, is to filter the complete data, filter out the click behavior of the data click behavior is only a small part of the total data (for example, 20%)
So after filtering the RDD, the amount of data per partition, it is likely to be the same as we said before, will be very uneven and the amount of data will certainly become a lot less

So in this case, or more appropriate to use the coalesce operator, after the filter to reduce the number of partition
COALESCE (100)
This means that after the filter after the data compression is more compact, compressed into 100 data fragments, that is, the formation of 100 partition

Make a note of this coalesce operation

If the run mode is local mode and is used primarily for testing, in local mode,
No need to set the number of partitions and degrees of parallelism
Local mode itself is the process of simulation of the cluster to execute, its own performance is very high
Moreover, there are some internal optimization for the degree of parallelism and partition quantity.

Here we go to set ourselves, it's a little superfluous

But just to explain to you, the use of coalesce operator, you can


operator Optimization Foreachpartition


The write-Library principle of foreach

Where the default foreach performance flaw is.

First of all, for each piece of data, you have to call it once function,task for each data, you have to perform function functions once.
If 1 million data, (one partition), is invoked 1 million times. Performance is relatively poor.

Another very, very important point.
If you are going to create a database connection for each data, then you have to create 1 million database connections.
However, it is important to note that the creation and destruction of database connections are very, very performance-consuming. Although we've used it before.
Database connection pool, just creates a fixed number of database connections.

You still have to send a SQL statement to the database (MySQL) several times through a database connection, and MySQL needs to execute the SQL statement.
If there are 1 million data, then 1 million times the SQL statement is sent.

The above two points (database connection, sending SQL statements multiple times) are very performance-intensive.

Foreachpartition, in a production environment, typically, you use Foreachpartition to write a database

Using batch operations (one SQL and multiple sets of parameters)
Send an SQL statement, send a
In a flash, insert 1 million data in bulk.

After using the foreachpartition operator, where is the benefit?

1. For the function functions we write, we call them once, pass in a partition all the data
2, the main create or get a database connection can be
3, as long as the database sent once SQL statements and multiple sets of parameters can be

In the actual production environment, the exclusively, is the use of foreachpartition operations, but there is a problem, as with the mappartitions operation,
If the number of a partition is really particularly big, such as really is 1 million, that is basically not very reliable.

Come in at once, very likely will occur oom, memory overflow problem.

Comparison of a set of data: production environment

A partition is about 1000 or so.
With foreach, with Foreachpartition, the performance of the upgrade reached 2-3 minutes.

Actual Project action:
First, a BULK insert operation has been encapsulated inside the jdbchelper.

BULK INSERT Session Detail

The only difference is that we need to isessiondetaildao inside to implement a bulk insert
List<sessiondetail> sessiondetails








Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.