- 1. Introduction
- 2. Experiment Notes
- 2.1 Experimental environment
- 2.2 Experimental methods
- 2.3 Experimental Load
- 3. MapReduce on Alluxio
- 3.1 Reading 10G files (1G split)
- 3.2 Reading 20G files (1G split)
- 3.3 Reading 60G files (1G split)
- 3.4 Reading 60G files (512MB split)
- 4. Spark on Alluxio
- 5. Points to note about using Alluxio to improve performance
- 5.1 Alluxio Do you read and write with memory speed?
- 5.2 How do I use Alluxio to boost Mr Job performance?
- 5.3 How do I use Alluxio to boost spark job performance?
- 6. In comprehensive
- Another reminder: Using Alluxio requires a closer look at whether the above scenario fits your usage scenario. If you want to integrate with Mr and spark and use memory to speed up job processing, consider Apache ignite, which I'm going to try to play recently.
1. Introduction
Believe that a lot of students use Alluxio, are directed at its memory speed acceleration effect. I also think that with the Alluxio, integrating spark and Hadoop makes it easy to boost the performance of the previous job several times. However, the facts are not so smooth.
Today, we summarize the problems and challenges of alluxio in improving Mr Job and spark job performance.
2. Experimental Description 2.1 Experimental environment
Later, when explaining the problem, some experimental results will be posted. To eliminate the impact of network IO, my experiments on this side have deployed Hadoop, Spark, and Alluxio on a single machine. This machine memory 120g,40 core.
2.2 Experimental methods
The main is to do contrast experiment, one uses Alluxio, one does not use Alluxio, view effect.
2.3 Experimental Load
In order to make the results of the contrast experiment as obvious as possible, in the design of the experimental load, we select the experimental load which can maximize the Alluxio memory storage advantage.
We run a job that reads only text files as an experimental load. The main performance bottleneck for this job is on Io, which guarantees the maximum effect of memory storage.
3. MapReduce on alluxio3.1 read 10G file (1G split)
3.2 Reading 20G files (1G split)
3.3 Reading 60G files (1G split)
3.4 Reading 60G files (512MB split)
4. Spark on Alluxio
This can be tested with the spark shell, and the results are: Alluxio does not provide a multiplier performance boost. The actual use of Alluxio and without Alluxio effect is similar.
5. Note on the use of Alluxio to improve performance 5.1 Alluxio read and write with memory speed?
Does the Alluxio read and write at the speed of memory? The answer must be: yes. The premise is that there are no other distractions (such as the integration of distributed computing engines such as Hadoop), and there are a number of factors that affect performance. This is mainly based on the use of file Operations API read and write and in the form of Mr Job or spark job read and write differences. Here's to the conclusion:
The difference between testing in a pure file system and testing in job form
A. pure file system API to read text files, can have expected performance improvement (about 8 times times). This is a relatively pure test, no interference.
B. The job form is tested. The Mr Job that is used for production is modified to be tested in almost the same way (whether text or seq file). Therefore, the reasons for this need to be analyzed.
Job form testing relies on configuration, job task load (different job, where performance bottlenecks are generated. To ensure that it is an IO-intensive job)
PS: There is also a pit, that is Alluxio using the file system API to operate, if it is the sequence files format, it will not be read and write memory speed. I suspect that the overhead might be on decompression. Because the sequence file data of my experiment is compressed. We welcome your feedback.
5.2 How do I use Alluxio to boost Mr Job performance?
It is also possible to use Alluxio to improve performance, ensuring the following requirements:
- Task is IO-intensive job
- Parameter configuration meets requirements
- Split size is large (that is, fewer map numbers), at least split size greater than 1G of pure io-intensive can experience a performance improvement
The above conclusions can be seen in my third section of the experimental results. But in practice you will find a lot of problems, such as control split size is particularly large, when adjusting the configuration will cause a variety of problems. And even if split size becomes larger, it increases the read speed of a single map task, but it reduces the concurrency of the map at the cluster angle.
And even in perfect circumstances, I am afraid there can be no multiplier performance improvement (the third section of the experiment is purely IO-type job, also try to use the larger split size, but still only about 50% performance improvement). As for why I did not adopt a larger split size, it is because this configuration raises other problems that cause the cluster to not work properly.
5.3 How do I use Alluxio to boost spark job performance?
The experiment ran down line count job and counted the number of lines of the G file. But the test is similar in performance. The time-consuming should be on the task shard. So using spark on Alluxio to improve performance also requires the following:
- The spark job is IO-intensive
- It is best to have a lot of spark jobs, and there is an RDD share between the different jobs (the memory data management provided by Alluxio can manage hot data very well, thus increasing the memory hit rate to improve performance)
- The scenario for a large table association query like spark SQL looks more appropriate. Because there are often some intermediate result sets that are shared during the query. So the associated query job for the big table of Spark is estimated to use Alluxio to be effective ~
6. In comprehensive
Want to use Alluxio to improve the performance of a single job, is basically more difficult, not recommended. If you use the scene, the job is more, memory resources are relatively limited, through ALLUXIO reasonable management of thermal data, can effectively improve performance. There are also large table-related queries like Spark or hive, which are also expected to be effective. Other scenarios should be more difficult to see what performance gains.
About Alluxio's application scenario, Alluxio's author also gives a summary: Interview fan bin, talk about open source three years after the Alluxio
Fan bin: Alluxio as a memory-level virtual distributed storage System there are several common usage scenarios:
- The computing layer needs to repeatedly access the data in the remote (such as in the cloud, or across the room);
- The compute tier requires simultaneous access to multiple independent persistent data sources (such as simultaneous access to data in S3 and HDFS);
- Multiple independent Big data applications (such as different spark jobs) require high-speed and efficient sharing of data;
- When the compute layer has a more serious memory resource, a JVM GC pressure, or a higher task failure rate, the off heap storage of the Alluxio as input and output data can greatly alleviate this pressure and make the calculation time and resources more controllable and predictable.
Practical Examples:
The goal of Alluxio is to allow the compute and storage layers to travel light again, allowing them to independently optimize and develop themselves without worrying about destroying the dependencies between the two. Specifically, Alluxio provides a layer of abstraction of a file system to the computing layer. The computation above this layer of abstraction only needs to interact with Alluxio to access the data, and this abstraction can simultaneously dock multiple different persistent stores (such as a S3 plus an HDFS deployment), which itself is implemented by a memory-level Alluxio storage system that is deployed close to the computation. A typical scenario such as in Baidu, Spark does not need to care whether the data is in this room or remote data center, it only need to read and write data through Alluxio, and Alluxio can intelligently help the application when needed to synchronize data to the remote.
Baidu uses Alluxio to improve the performance of the main reason: After a deeper exploration, we found the problem lies. Because the data is distributed across multiple datacenters, it is likely that the data query needs to reach the remote datacenter to fetch the data-this should be the biggest reason for the delay when the user runs the query.
This can be referred to the article: in order to deal with the data big Bang Baidu to this open source new project
I think the reason why Alluxio as acceleration layer is not competent, the reasons may be not to go for MapReduce and spark to implement their own Alluxio native acceleration layer. When it comes to wanting to do this plug-and-play memory acceleration for a compute job, consider using ignite. I also wrote related articles compared ignite and alluxio, can be concerned about.
Another reminder: Using Alluxio requires a closer look at whether the above scenario fits your usage scenario. If you want to integrate with Mr and spark and use memory to speed up job processing, consider Apache ignite, which I'm going to try to play recently.
Using Alluxio to enhance the performance of Mr Job and spark job