Spark on Alluxio and Mr on Alluxio test (improved version) "Turn"

Source: Internet
Author: User

Transferred from: Http://kaimingwan.com/post/alluxio/spark-on-alluxiohe-mr-on-alluxioce-shi-gai-jin-ban

    • 1. Introduction
    • 2. Preparing the data
      • 2.1 Emptying the system cache
    • 3. Mr Test
      • 3.1 MR without Alluxio
      • 3.2 MR with Alluxio
      • 3.3 Supplementary Questions
    • 4. Spark Test
      • 4.1 Spark without Alluxio
      • 4.2 Spark with Alluxio
    • 5. First-stage Experimental summary
    • 6. IO Experiment
      • 6.1 Task Load
      • 6.2 Reading 10G files from HDFs
      • 6.3 Reading 10G files from HDFs
    • 7. Further attempts
      • 7.1 Storage Equalization Processing
      • 7.2 Count rows using Mr Job
      • 7.3 Count rows using Mr on Alluxio
      • 7.4 Summary
1. Introduction

Before we had a test, see article Alluxio and Spark and mapreduce performance comparisons. However, due to hardware limitations, the effect of Alluxio is not reflected.

This time we will be testing again. The hardware configurations we use are as follows:

Pay ATTENTION!!! : Latest Mr on Alluxio test please refer to the article MapReduce on Alluxio performance test

IP CPU Number of cores Memory assume the role
10.8.12.16 Intel (R) Xeon (r) CPU e5-2650 v3 @ 2.30GHz 40 cores 128GB Namenode,alluxio-master,datanode,alluxio-worker,yarn Manager
10.8.12.17 Intel (R) Xeon (r) CPU e5-2650 v3 @ 2.30GHz 40 cores 128GB Standby Namenode,standby Alluxio-master,datanode,alluxio-worker
10.8.12.18 Intel (R) Xeon (r) CPU e5-2650 v3 @ 2.30GHz 40 cores 128GB Alluxio-master,datanode,alluxio-worker
2. Preparing the data

Prepare 10G of data for both Mr and Spark to calculate the count of rows.

# Create a new directory to store test data on Alluxio Alluxio FS Mkdir/linecountifbscount= of=10g.txt# Put the test file into Alluxio memory space Alluxio FS copyfromlocal 10g.txt/linecount#再持久化到hdfs, persistence will be stored in the/alluxio directory of HDFs Alluxio FS Persist/linecount/10g.txt       
2.1 Emptying the system cache

To ensure the accuracy of the experiment, we need to clear the system cache to ensure the accuracy of the experiment after we load the data into HDFs and Alluxio. Perform the following actions on each node.

' Free && sync && echo 3 >/proc/sys/vm/drop_caches && free '
3. Mr Test 3.1 Mr without Alluxio

Because we have been programmed before, we change the access path directly to use. If not, you can refer to my article: Alluxio and Spark and mapreduce performance comparison

Test results:

Takes 59 seconds to count the number of rows in the 10G file (=. =# machine configuration is Diao)

3.2 MR with Alluxio

Run the test:

Takes 59 seconds (the average of several experiments is about 1 points)

3.3 Supplementary Questions

The following error may be found in the experiment:

Diagnostics:org.apache.commons.codec.binary.Base64.encodeBase64String([B) ljava/lang/string;  

This problem arises mainly due to the Alluxio source code compilation problem. For this please refer to my article: alluxio1.2.0 for Hadoop 2.7.2 installation

PS: After this question has been fed back to the author, it has been fixed in version 1.3 and can download the latest version.

4. Spark Test 4.1 spark without Alluxio

It is important to note that when you run the spark job, we find that the spark job actually runs slower than Mr. When no tuning is made to spark, follow the default settings in Spark-shell to run the same Line-count task. It took me 6 minutes.

For a description of this question, see my question on SF. Why line count job runs slower in spark shell than a mapreduce job

After simple tuning, the 10G file is analyzed to count the number of rows, which takes 1 minutes and 27 seconds (just for simple tuning, although still no Mr Fast, but is much better than the original 6 minutes)

4.2 Spark with Alluxio

1 minutes, 21 seconds.

5. First-stage Experimental summary

Alluxio did not achieve the desired effect, what is the reason? Let's analyze it.

First we can take a look at the official spark on Alluxio and spark without Alluxio experiment--using Alluxio to Improve the performance and consistency of HDFS Clusters

It can be found that the official experiment at the time of the load simulation, or similar to the real environment. That is, there is a weekly analysis job and a monthly analysis job. Tasks can also be divided into I/O intensive and CPU intensive tasks. We used LineCount to experiment and did not achieve the desired results, possibly for the following reasons:

The first thing to be sure is that we did erase the cached data, but after many experiments we found it. The time spent using Alluxio and not using Alluxio is basically the same. This means that when using Alluxio, the program may still be reading HDFs, resulting in no essential difference in time. What is the problem we need to continue to explore.

6. IO Experiment

Alluxio's biggest role is IO acceleration. In fact, the previous experiment, if normal, read the data and write the results of the IO time, after the use of Alluxio will inevitably be greatly reduced, because the data can be taken directly from memory.

To do this, for a more focused study of the crux of the problem, now we just call the IO interface of HDFS and Alluxio for file reading, and verify that alluxio on Io performance can help me improve performance.

6.1 Task Load

The data we need to read is still a randomly generated 10G large file. Whether read from HDFs or read from Alluxio, we use a single thread for file reads. The experiment was divided into two large groups as a comparative experiment. That is, read the file from HDFs and read the file from the Alluxio. The experiments within each large group are divided into 2 task types:

6.2 Reading 10G files from HDFs

Note Be sure to clear the system cache and confirm with free.

Using the code below, we follow 64MB a batch to read.

PublicClassRwhdfstest{PublicStaticvoidMain(String[]Args)ThrowsIOException{Statistical timeSystem.Out.println("Program start timestamp information:"+NewDate());FinalLongStartTime=System.Currenttimemillis();ConfigurationConf=Initialconfig("Kaimingwan","Hdfs://ns","10.8.12.16");Set the URL of the file to be accessed on HDFsStringUri="/alluxio/linecount/10g.txt";PathPath=NewPath(Uri);Normalfilereader(Conf,Path,Uri);FinalLongEndTime=System.Currenttimemillis();FloatExctime=(Float)(EndTime-StartTime)/1000;System.Out.println("Execution Time:"+Exctime+"S");System.Out.println("The current time is:"+NewDate());}Read any format dataPublicStaticvoidNormalfilereader(ConfigurationConf,PathPath,StringUri)ThrowsIOException{FileSystemFileSystem=FileSystem.Get(Uri.Create(Uri),Conf);FsdatainputstreamFsdatainputstream=FileSystem.Open(Path);Read 64MB bytes readByte[]Buffer=new byte[67108864//record read length int len =  Fsdatainputstream. (bufferwhile  (len != -1 {//system.out.write (buffer, 0, Len); len = fsdatainputstream. Read (buffer} }          /span>                

6.3 Reading 10G files from HDFs

Change the address above to: alluxio://10.8.12.16:19998 run again

It's been a long time.

7. Further attempts

Again, it is possible for the following reasons to be tested further.

    1. Remote commit code running Mr Job or remotely to operate HDFS will have a lot of wasted IO time
    2. The data blocks of the same 10G file are all scattered across a worker, which results in network IO overhead across the worker
7.1 Storage Equalization Processing

In order to ensure the effectiveness of the experimental results, we first ensure that the 10G data blocks are evenly distributed among all workers.

# If you copy the past from the local file system, you can use the command: Alluxio fs-dalluxio.user.file.write.location.policy.class= Alluxio.client.file.policy.RoundRobinPolicy copyfromlocal 10g.txt/linecount/10g.txt# If you copy the past from UFS, You can use the command Alluxio fs-dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.roundrobinpolicy Load/linecount/10g.txt  
7.2 Count rows using Mr Job

Still ensure that the system cache is emptied on each node. Package The program as a jar to commit on the master node, which takes 1 minutes and 03 seconds

7.3 Count rows using Mr on Alluxio

Results take 34 seconds

7.4 Summary

After a lot of "tribulations", we finally made the experiment that was expected, that is, Mr on Alluxio can bring about performance improvement. In this process, I also used the Iostat and nload tools to monitor disk IO and network IO. Finally, it is found that Mr on Alluxio is actually accessing data from within. While the network IO overhead is reduced by load-balanced placement of data, there is still a significant amount of network IO overhead that is detected by monitoring network IO. Therefore, the performance improvement brought by Alluxio is relatively small, it should be more than a few times the promotion is.

It can be seen that Alluxio performance is achieved through reasonable design and configuration, or even worse results may be made using Alluxio performance. Later, for the network IO traffic is too big problem, still need to explore. Want to know more about my follow-up articles.

Alluxio

Spark on Alluxio and Mr on Alluxio test (improved version) "Turn"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.