Hadoop vs spark Performance Comparison

Last Update:2018-12-07 Source: Internet

Author: User

Tags add time

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

ArticleDirectory

Based on Spark-0.4 and Hadoop-0.20.2

Spark-0.4 based and Hadoop-0.20.21. kmeans

Data: self-generated 3D data, which is centered around the eight vertices of a square

{0, 0, 0}, {0, 10, 0}, {0, 0, 10}, {0, 10 },

{10, 0, 0}, {10, 0, 10}, {10, 10, 0}, {10, 10}

Point number	189,918,082 (0.1 billion million 3D points)
Capacity	10 GB
HDFS location	/User/lijiexu/kmeans/Square-10GB.txt

ProgramLogic:

Read the block on HDFS to the memory. Each block is converted to RDD, which contains a vector.

Then map the RDD and extract the class number corresponding to each vector (point). The output (K, v) is (class, (point, 1) to form a new RDD.

Before reduce, combine each new RDD and calculate the center and sum of each class in RDD. The output of each RDD has a maximum of K kV pairs.

Finally, reduce to get the new RDD (the key of the content is class, the value is the center and sum, and then get the final center after map.

First upload to HDFS, and then run on Master

Root @ master:/opt/spark #./run spark. Examples. sparkmeans master @ master: 5050 HDFS: // master: 9000/user/lijiexu/kmeans/Square-10GB.txt 8 2.0

Iterative kmeansAlgorithm.

A total of 160 tasks. (160*64 MB = 10 Gb)

32 CPU cores and GB memory are used.

The memory consumption for each machine is 4.5 GB (40 GB in total) (the points data itself is 10 Gb X 2, and the intermediate data (K, v) after map => (INT, (vector, 1) (about 10 Gb)

Final result:

0.505246194 s

Final centers: Map (5-> (13.997101228817169, 9.208875044622895,-2.494072457488311), 8-> (-2.33522333047955, 9.128892414676326, 1.7923150585737604), 7-> (8.658031587043952, 2.162306996983008, 17.670646829079146 ), 3-> (11.530154433698268, 0.17834347219956842, 9.224352885937776), 4-> (12.722903153986868, 8.812883284216143, 0.6564509961064319), 1-> (6.458644369071984, 11.345681702383024, 7.041924994173552), 6-> (12.887793408866614, -1.5189406469928937, 9.526393664105957), 2-> (2.3345459304412164, 2.0173098597285533, 1.4772489989976143 ))

50 MB/S 10 GB = & gt; 3.5 min

10 Mb/s 10 GB => 15 min

Test on 20 GB of data

Point number	377,370,313 (0.3 billion million 3D points)
Capacity	20 GB
HDFS location	/User/lijiexu/kmeans/Square-20GB.txt

Run the test command:

Root @ master:/opt/spark #. /run spark. examples. sparkkmeans master @ master: 5050 HDFS: // master: 9000/user/lijiexu/kmeans/Square-20GB.txt 8 2.0 | tee mylogs/sqaure-20GB-kmeans.log

Obtain the clustering result:

Final centers: Map (5-> (-0.47785701742763115,-1.5901830956323306,-0.18453046159033773), 8-> (1.1073911553593858, 9.051671594514225,-0.44722211311446924), 7-> (1.4960397239284795, 10.173412443492643, -1.7932911100570954), 3-> (-1.4771114031182642, 9.046878176063172,-2.4747981387714444), 4-> (-0.2796747780312184, 0.06910629855122015, 10.268115903887612), 1-> (10.467618592186486,-1.168580362309453, -1.0462842137817263), 6-> (0.7569895433952736, 0.8615441990490469, 9.552726007309518), 2-> (10.807948500515304,-0.5368803187391366, 0.04258123037074164 ))

Basically eight centers

Memory consumption: (each node is about 5.8 GB), about 50 GB in total.

Memory analysis:

20 GB of raw data and 20 GB of map output

Iterations	Time
1	108 S
2	0.93 s

12/06/05 11:11:08 info spark. cachetracker: Looking for RDD Partition

12/06/05 11:11:08 info spark. cachetracker: Found partition in cache!

Test on 20 GB of data (more iterations)

Root @ master:/opt/spark #./run spark. Examples. sparkmeans master @ master: 5050 HDFS: // master: 900

0/user/lijiexu/kmeans/Square-20GB.txt 8 0.8

Number of tasks: 320

Time:

Iterations	Time
1	100.9 s
2	0.93 s
3	4.6 s
4	3.9 s
5	3.9 s
6	3.9 s

Impact of iterations on memory capacity:

There is basically no impact, the main memory consumption: 20 GB of input data RDD, 20 GB of intermediate data.

Final centers: Map (5-> (-bytes-5, 3.17334874733142e-5,-2.0605806380414582e-4), 8-> (1.1841686358289191e-4, 10.000062966002101, 9.999933240005394 ), 7-> (9.999976672588097, 10.000199556926772,-2.0695123602840933e-4), 3-> (-1.3506815993198176e-4, 9.999948270638338, Percentile-5), 4-> (3.2493629851483764e-4, -7.892413981250518e-5, 10.00002515017671), 1-> (10.00004313126956, 7.431996896171192e-6, 7.590402882208648e-5), 6-> (9.999982611661382, 10.000144597573051, 10.000037734639696), 2-> (9.999958673426654, -1.1917651103354863e-4, 9.99990217533504 ))

Result Visualization

2. hdfstest

Test logic:

package spark. examples

Import spark. _

Object hdfstest {

def main (ARGs: array [String]) {

Val SC = New sparkcontext (ARGs (0 ), "hdfstest")

Val file = SC. textfile (ARGs (1)

Val mapped = file. map (S => S. length ). cache ()

for (ITER <-1 to 10) {

Val Start = system. currenttimemillis ()

for (x <-mapped) {x + 2}

// println ("processing:" + x)

Val end = system. currenttimemillis ()

println ("iteration" + ITER + "took" + (end-Start) + "Ms ")

}

First, read a text file from HDFS and save it in file

Calculate the number of characters in each row in the file again and save it in the ed of the memory RDD.

Then read the number of each character in mapped, add it to 2, and calculate the read + add time consumption.

Only map, no reduce.

Test 10 Gb Wiki

The read performance of RDD is tested.

Root @ master:/opt/spark #./run spark. Examples. hdfstest master @ master: 5050 HDFS: // master: 9000:/user/lijiexu/Wikipedia/TXT/enwiki-20110405.txt

Test results:

Iteration 1 took 12900 MS = 12 s

Iteration 2 took 388 MS

Iteration 3 took 472 MS

Iteration 4 took 490 MS

Iteration 5 took 459 MS

Iteration 6 took 492 MS

Iteration 7 took 480 MS

Iteration 8 took 501 MS

Iteration 9 took 479 MS

Iteration 10 took 432 MS

Memory consumption for each node is 2.7 GB (9.4 GB x 3 in total)

The read performance of RDD is tested.

Root @ master:/opt/spark #./run spark. Examples. hdfstest master @ master: 5050 HDFS: // master: 9000/user/lijiexu/Wikipedia/TXT/enwiki-20110405.txt

Test 90 GB randomtext data

Root @ master:/opt/spark #./run spark. Examples. hdfstest master @ master: 5050 HDFS: // master: 9000/user/lijiexu/randomtext90gb/randomtext90gb

Time consumed:

Iterations	Time consumed
1	111.905310882 s
2	4.681715228 s
3	4.469296148 s
4	4.441203887 s
5	1.999792125 s
6	2.151376037 s
7	1.889345699 s
8	1.847487668 s
9	1.827241743 s
10	1.747547323 s

The total memory consumption is about 30 GB.

Resource consumption of a single node:

3. Test wordcount

Write Program:

ImportSpark. sparkcontext

ImportSparkcontext ._

ObjectWordcount {

DefMain (ARGs: array [String]) {

If(ARGs. Length <2 ){

System. Err. println ("Usage: wordcount <master> <jar> ")

System. Exit (1)

}

ValSP =NewSparkcontext (ARGs (0), "wordcount", "/opt/spark", list (ARGs (1 )))

ValFile = sp. textfile ("HDFS: // master: 9000/user/lijiexu/Wikipedia/TXT/enwiki-20110405.txt ");

ValCounts = file. flatmap (line => line. Split (""). Map (WORD => (word, 1). performancebykey (_ + _)

Counts. saveastextfile ("HDFS: // master: 9000/user/output/wikiresult3 ")

}

Package it into myspark. jar and upload it to/opt/spark/newprogram of the master.

Run the program:

Root @ master:/opt/spark #./run-CP newprogram/myspark. Jar wordcount master @ master: 5050 newprogram/myspark. Jar

Mesos automatically copies the JAR file to the execution node and then executes the file.

Memory consumption: (10 Gb input file + 10 Gb flatmap + 15 GB map intermediate result (word, 1 ))

Some memory does not know where to allocate.

Time consumed: 50 sec (not sorted)

Hadoop wordcount elapsed time: 120 sec to 140 Sec

Unordered results

Single Node:

Hadoop testing kmeans

Run kmeans in mahout

Root @ master:/opt/mahout-distribution-0.6 # bin/mahout Org. apache. mahout. clustering. syntheticcontrol. kmeans. job-dmapred. reduce. tasks = 36-I/user/lijiexu/kmeans/Square-20GB.txt-O output-T1 3-t2 1.5-CD 0.8-K 8-x 6

Running (320 maps and 1 reduce)

Canopy driver running buildclusters over input: output/Data

Slave Resource Consumption

Completed jobs

Jobid	Name	Map total	Reduce total	Time
Job_201206050916_0029	Input driver running over input:/user/lijiexu/kmeans/Square-10GB.txt	160	0	1 minute 2 seconds
Job_201206050916_0030	Kmeans driver running runiteration over clustersin: output/clusters-0/part-randomseed	160	1	1 minute 6 seconds
Job_201206050916_0031	Kmeans driver running runiteration over clustersin: output/clusters-1	160	1	1 minute 7 seconds
Job_201206050916_0032	Kmeans driver running runiteration over clustersin: output/clusters-2	160	1	1 minute 7 seconds
Job_201206050916_0033	Kmeans driver running runiteration over clustersin: output/clusters-3	160	1	1 minute 6 seconds
Job_201206050916_0034	Kmeans driver running runiteration over clustersin: output/clusters-4	160	1	1 minute 6 seconds
Job_201206050916_0035	Kmeans driver running runiteration over clustersin: output/clusters-5	160	1	1 minute 5 seconds
Job_201206050916_0036	Kmeans driver running clusterdata over input: output/Data	160	0	55 seconds
Job_201206050916_0037	Input driver running over input:/user/lijiexu/kmeans/Square-20GB.txt	320	0	1 minute 31 seconds
Job_201206050916_0038	Kmeans driver running runiteration over clustersin: output/clusters-0/part-randomseed	320	36	1 minute 46 seconds
Job_201206050916_0039	Kmeans driver running runiteration over clustersin: output/clusters-1	320	36	1 minute 46 seconds
Job_201206050916_0040	Kmeans driver running runiteration over clustersin: output/clusters-2	320	36	1 minute 46 seconds
Job_201206050916_0041	Kmeans driver running runiteration over clustersin: output/clusters-3	320	36	1 minute 47 seconds
Job_201206050916_0042	Kmeans driver running clusterdata over input: output/Data	320	0	1 minute 34 seconds

Run kmeans on 10 GB and 20 GB for multiple times, and consume resources

Hadoop wordcount Test

Spark Interactive Operation

Enter/opt/spark of the master

Run

Master = Master @ master: 5050./spark-shell

Open spark of mesos version

On master: 8080, you can see the framework

Active frameworks

ID	User	Name	Running tasks	CPUs	Mem	Max Share	Connected
201206050924 to 0-0018	Root	Spark Shell	0	0	0.0 MB	0.00	21:12:56

Scala & gt; Val file = SC. textfile ("HDFS: // master: 9000/user/lijiexu/Wikipedia/TXT/enwiki-20110405.txt ")

Scala> file. First

Scala> Val words = file. Map (_. Split (''). Filter (_. Size <100) // obtain RDD [array [String]

Scala> words. Cache

Scala> words. Filter (_. Contains ("Beijing"). Count

12/06/06 22:12:33 info sparkcontext: job finished in 10.862765819 s

RES1: Long = 855.

Scala> words. Filter (_. Contains ("Beijing"). Count

12/06/06 22:12:52 info sparkcontext: job finished in 0.71051464 s

RES2: Long = 855

Scala> words. Filter (_. Contains ("Shanghai"). Count

12/06/06 22:13:23 info sparkcontext: job finished in 0.667734427 s

RES2: Long = 614

Scala> words. Filter (_. Contains ("Guangzhou"). Count

12/06/06 22:13:42 info sparkcontext: job finished in 0.800617719 s

Res4: Long-term = 134

Due to GC problems, you cannot cache large datasets.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More