Transferred from: Http://kaimingwan.com/post/alluxio/spark-on-alluxiohe-mr-on-alluxioce-shi-gai-jin-ban
- 1. Introduction
- 2. Preparing the data
- 2.1 Emptying the system cache
- 3. Mr Test
- 3.1 MR without Alluxio
- 3.2 MR with Alluxio
- 3.3 Supplementary Questions
- 4. Spark Test
- 4.1 Spark without Alluxio
- 4.2 Spark with Alluxio
- 5. First-stage Experimental summary
- 6. IO Experiment
- 6.1 Task Load
- 6.2 Reading 10G files from HDFs
- 6.3 Reading 10G files from HDFs
- 7. Further attempts
- 7.1 Storage Equalization Processing
- 7.2 Count rows using Mr Job
- 7.3 Count rows using Mr on Alluxio
- 7.4 Summary
1. Introduction
Before we had a test, see article Alluxio and Spark and mapreduce performance comparisons. However, due to hardware limitations, the effect of Alluxio is not reflected.
This time we will be testing again. The hardware configurations we use are as follows:
Pay ATTENTION!!! : Latest Mr on Alluxio test please refer to the article MapReduce on Alluxio performance test
IP |
CPU |
Number of cores |
Memory |
assume the role |
10.8.12.16 |
Intel (R) Xeon (r) CPU e5-2650 v3 @ 2.30GHz |
40 cores |
128GB |
Namenode,alluxio-master,datanode,alluxio-worker,yarn Manager |
10.8.12.17 |
Intel (R) Xeon (r) CPU e5-2650 v3 @ 2.30GHz |
40 cores |
128GB |
Standby Namenode,standby Alluxio-master,datanode,alluxio-worker |
10.8.12.18 |
Intel (R) Xeon (r) CPU e5-2650 v3 @ 2.30GHz |
40 cores |
128GB |
Alluxio-master,datanode,alluxio-worker |
2. Preparing the data
Prepare 10G of data for both Mr and Spark to calculate the count of rows.
# Create a new directory to store test data on Alluxio Alluxio FS Mkdir/linecountifbscount= of=10g.txt# Put the test file into Alluxio memory space Alluxio FS copyfromlocal 10g.txt/linecount#再持久化到hdfs, persistence will be stored in the/alluxio directory of HDFs Alluxio FS Persist/linecount/10g.txt
2.1 Emptying the system cache
To ensure the accuracy of the experiment, we need to clear the system cache to ensure the accuracy of the experiment after we load the data into HDFs and Alluxio. Perform the following actions on each node.
' Free && sync && echo 3 >/proc/sys/vm/drop_caches && free '
3. Mr Test 3.1 Mr without Alluxio
Because we have been programmed before, we change the access path directly to use. If not, you can refer to my article: Alluxio and Spark and mapreduce performance comparison
Test results:
Takes 59 seconds to count the number of rows in the 10G file (=. =# machine configuration is Diao)
3.2 MR with Alluxio
Run the test:
Takes 59 seconds (the average of several experiments is about 1 points)
3.3 Supplementary Questions
The following error may be found in the experiment:
Diagnostics:org.apache.commons.codec.binary.Base64.encodeBase64String([B) ljava/lang/string;
This problem arises mainly due to the Alluxio source code compilation problem. For this please refer to my article: alluxio1.2.0 for Hadoop 2.7.2 installation
PS: After this question has been fed back to the author, it has been fixed in version 1.3 and can download the latest version.
4. Spark Test 4.1 spark without Alluxio
It is important to note that when you run the spark job, we find that the spark job actually runs slower than Mr. When no tuning is made to spark, follow the default settings in Spark-shell to run the same Line-count task. It took me 6 minutes.
For a description of this question, see my question on SF. Why line count job runs slower in spark shell than a mapreduce job
After simple tuning, the 10G file is analyzed to count the number of rows, which takes 1 minutes and 27 seconds (just for simple tuning, although still no Mr Fast, but is much better than the original 6 minutes)
4.2 Spark with Alluxio
1 minutes, 21 seconds.
5. First-stage Experimental summary
Alluxio did not achieve the desired effect, what is the reason? Let's analyze it.
First we can take a look at the official spark on Alluxio and spark without Alluxio experiment--using Alluxio to Improve the performance and consistency of HDFS Clusters
It can be found that the official experiment at the time of the load simulation, or similar to the real environment. That is, there is a weekly analysis job and a monthly analysis job. Tasks can also be divided into I/O intensive and CPU intensive tasks. We used LineCount to experiment and did not achieve the desired results, possibly for the following reasons:
The first thing to be sure is that we did erase the cached data, but after many experiments we found it. The time spent using Alluxio and not using Alluxio is basically the same. This means that when using Alluxio, the program may still be reading HDFs, resulting in no essential difference in time. What is the problem we need to continue to explore.
6. IO Experiment
Alluxio's biggest role is IO acceleration. In fact, the previous experiment, if normal, read the data and write the results of the IO time, after the use of Alluxio will inevitably be greatly reduced, because the data can be taken directly from memory.
To do this, for a more focused study of the crux of the problem, now we just call the IO interface of HDFS and Alluxio for file reading, and verify that alluxio on Io performance can help me improve performance.
6.1 Task Load
The data we need to read is still a randomly generated 10G large file. Whether read from HDFs or read from Alluxio, we use a single thread for file reads. The experiment was divided into two large groups as a comparative experiment. That is, read the file from HDFs and read the file from the Alluxio. The experiments within each large group are divided into 2 task types:
6.2 Reading 10G files from HDFs
Note Be sure to clear the system cache and confirm with free.
Using the code below, we follow 64MB a batch to read.
PublicClassRwhdfstest{PublicStaticvoidMain(String[]Args)ThrowsIOException{Statistical timeSystem.Out.println("Program start timestamp information:"+NewDate());FinalLongStartTime=System.Currenttimemillis();ConfigurationConf=Initialconfig("Kaimingwan","Hdfs://ns","10.8.12.16");Set the URL of the file to be accessed on HDFsStringUri="/alluxio/linecount/10g.txt";PathPath=NewPath(Uri);Normalfilereader(Conf,Path,Uri);FinalLongEndTime=System.Currenttimemillis();FloatExctime=(Float)(EndTime-StartTime)/1000;System.Out.println("Execution Time:"+Exctime+"S");System.Out.println("The current time is:"+NewDate());}Read any format dataPublicStaticvoidNormalfilereader(ConfigurationConf,PathPath,StringUri)ThrowsIOException{FileSystemFileSystem=FileSystem.Get(Uri.Create(Uri),Conf);FsdatainputstreamFsdatainputstream=FileSystem.Open(Path);Read 64MB bytes readByte[]Buffer=new byte[67108864//record read length int len = Fsdatainputstream. (bufferwhile (len != -1 {//system.out.write (buffer, 0, Len); len = fsdatainputstream. Read (buffer} } /span>
6.3 Reading 10G files from HDFs
Change the address above to: alluxio://10.8.12.16:19998 run again
It's been a long time.
7. Further attempts
Again, it is possible for the following reasons to be tested further.
- Remote commit code running Mr Job or remotely to operate HDFS will have a lot of wasted IO time
- The data blocks of the same 10G file are all scattered across a worker, which results in network IO overhead across the worker
7.1 Storage Equalization Processing
In order to ensure the effectiveness of the experimental results, we first ensure that the 10G data blocks are evenly distributed among all workers.
# If you copy the past from the local file system, you can use the command: Alluxio fs-dalluxio.user.file.write.location.policy.class= Alluxio.client.file.policy.RoundRobinPolicy copyfromlocal 10g.txt/linecount/10g.txt# If you copy the past from UFS, You can use the command Alluxio fs-dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.roundrobinpolicy Load/linecount/10g.txt
7.2 Count rows using Mr Job
Still ensure that the system cache is emptied on each node. Package The program as a jar to commit on the master node, which takes 1 minutes and 03 seconds
7.3 Count rows using Mr on Alluxio
Results take 34 seconds
7.4 Summary
After a lot of "tribulations", we finally made the experiment that was expected, that is, Mr on Alluxio can bring about performance improvement. In this process, I also used the Iostat and nload tools to monitor disk IO and network IO. Finally, it is found that Mr on Alluxio is actually accessing data from within. While the network IO overhead is reduced by load-balanced placement of data, there is still a significant amount of network IO overhead that is detected by monitoring network IO. Therefore, the performance improvement brought by Alluxio is relatively small, it should be more than a few times the promotion is.
It can be seen that Alluxio performance is achieved through reasonable design and configuration, or even worse results may be made using Alluxio performance. Later, for the network IO traffic is too big problem, still need to explore. Want to know more about my follow-up articles.
Alluxio
Spark on Alluxio and Mr on Alluxio test (improved version) "Turn"