Hadoop vs spark Performance Comparison

Source: Internet
Author: User
Tags add time
ArticleDirectory
    • Based on Spark-0.4 and Hadoop-0.20.2
Spark-0.4 based and Hadoop-0.20.21. kmeans

Data: self-generated 3D data, which is centered around the eight vertices of a square

{0, 0, 0}, {0, 10, 0}, {0, 0, 10}, {0, 10 },

{10, 0, 0}, {10, 0, 10}, {10, 10, 0}, {10, 10}

Point number

189,918,082 (0.1 billion million 3D points)

Capacity

10 GB

HDFS location

/User/lijiexu/kmeans/Square-10GB.txt

ProgramLogic:

Read the block on HDFS to the memory. Each block is converted to RDD, which contains a vector.

Then map the RDD and extract the class number corresponding to each vector (point). The output (K, v) is (class, (point, 1) to form a new RDD.

Before reduce, combine each new RDD and calculate the center and sum of each class in RDD. The output of each RDD has a maximum of K kV pairs.

Finally, reduce to get the new RDD (the key of the content is class, the value is the center and sum, and then get the final center after map.

First upload to HDFS, and then run on Master

Root @ master:/opt/spark #./run spark. Examples. sparkmeans master @ master: 5050 HDFS: // master: 9000/user/lijiexu/kmeans/Square-10GB.txt 8 2.0

Iterative kmeansAlgorithm.

A total of 160 tasks. (160*64 MB = 10 Gb)

32 CPU cores and GB memory are used.

The memory consumption for each machine is 4.5 GB (40 GB in total) (the points data itself is 10 Gb X 2, and the intermediate data (K, v) after map => (INT, (vector, 1) (about 10 Gb)

Final result:

0.505246194 s

Final centers: Map (5-> (13.997101228817169, 9.208875044622895,-2.494072457488311), 8-> (-2.33522333047955, 9.128892414676326, 1.7923150585737604), 7-> (8.658031587043952, 2.162306996983008, 17.670646829079146 ), 3-> (11.530154433698268, 0.17834347219956842, 9.224352885937776), 4-> (12.722903153986868, 8.812883284216143, 0.6564509961064319), 1-> (6.458644369071984, 11.345681702383024, 7.041924994173552), 6-> (12.887793408866614, -1.5189406469928937, 9.526393664105957), 2-> (2.3345459304412164, 2.0173098597285533, 1.4772489989976143 ))

50 MB/S 10 GB = & gt; 3.5 min

10 Mb/s 10 GB => 15 min

Test on 20 GB of data

Point number

377,370,313 (0.3 billion million 3D points)

Capacity

20 GB

HDFS location

/User/lijiexu/kmeans/Square-20GB.txt

Run the test command:

Root @ master:/opt/spark #. /run spark. examples. sparkkmeans master @ master: 5050 HDFS: // master: 9000/user/lijiexu/kmeans/Square-20GB.txt 8 2.0 | tee mylogs/sqaure-20GB-kmeans.log

Obtain the clustering result:

Final centers: Map (5-> (-0.47785701742763115,-1.5901830956323306,-0.18453046159033773), 8-> (1.1073911553593858, 9.051671594514225,-0.44722211311446924), 7-> (1.4960397239284795, 10.173412443492643, -1.7932911100570954), 3-> (-1.4771114031182642, 9.046878176063172,-2.4747981387714444), 4-> (-0.2796747780312184, 0.06910629855122015, 10.268115903887612), 1-> (10.467618592186486,-1.168580362309453, -1.0462842137817263), 6-> (0.7569895433952736, 0.8615441990490469, 9.552726007309518), 2-> (10.807948500515304,-0.5368803187391366, 0.04258123037074164 ))

Basically eight centers

Memory consumption: (each node is about 5.8 GB), about 50 GB in total.

Memory analysis:

20 GB of raw data and 20 GB of map output

Iterations

Time

1

108 S

2

0.93 s

12/06/05 11:11:08 info spark. cachetracker: Looking for RDD Partition

12/06/05 11:11:08 info spark. cachetracker: Found partition in cache!

Test on 20 GB of data (more iterations)

Root @ master:/opt/spark #./run spark. Examples. sparkmeans master @ master: 5050 HDFS: // master: 900

0/user/lijiexu/kmeans/Square-20GB.txt 8 0.8

Number of tasks: 320

Time:

Iterations

Time

1

100.9 s

2

0.93 s

3

4.6 s

4

3.9 s

5

3.9 s

6

3.9 s

Impact of iterations on memory capacity:

There is basically no impact, the main memory consumption: 20 GB of input data RDD, 20 GB of intermediate data.

Final centers: Map (5-> (-bytes-5, 3.17334874733142e-5,-2.0605806380414582e-4), 8-> (1.1841686358289191e-4, 10.000062966002101, 9.999933240005394 ), 7-> (9.999976672588097, 10.000199556926772,-2.0695123602840933e-4), 3-> (-1.3506815993198176e-4, 9.999948270638338, Percentile-5), 4-> (3.2493629851483764e-4, -7.892413981250518e-5, 10.00002515017671), 1-> (10.00004313126956, 7.431996896171192e-6, 7.590402882208648e-5), 6-> (9.999982611661382, 10.000144597573051, 10.000037734639696), 2-> (9.999958673426654, -1.1917651103354863e-4, 9.99990217533504 ))

Result Visualization

2. hdfstest

Test logic:

package spark. examples

Import spark. _

Object hdfstest {

def main (ARGs: array [String]) {

Val SC = New sparkcontext (ARGs (0 ), "hdfstest")

Val file = SC. textfile (ARGs (1)

Val mapped = file. map (S => S. length ). cache ()

for (ITER <-1 to 10) {

Val Start = system. currenttimemillis ()

for (x <-mapped) {x + 2}

// println ("processing:" + x)

Val end = system. currenttimemillis ()

println ("iteration" + ITER + "took" + (end-Start) + "Ms ")

}

First, read a text file from HDFS and save it in file

Calculate the number of characters in each row in the file again and save it in the ed of the memory RDD.

Then read the number of each character in mapped, add it to 2, and calculate the read + add time consumption.

Only map, no reduce.

Test 10 Gb Wiki

The read performance of RDD is tested.

Root @ master:/opt/spark #./run spark. Examples. hdfstest master @ master: 5050 HDFS: // master: 9000:/user/lijiexu/Wikipedia/TXT/enwiki-20110405.txt

Test results:

Iteration 1 took 12900 MS = 12 s

Iteration 2 took 388 MS

Iteration 3 took 472 MS

Iteration 4 took 490 MS

Iteration 5 took 459 MS

Iteration 6 took 492 MS

Iteration 7 took 480 MS

Iteration 8 took 501 MS

Iteration 9 took 479 MS

Iteration 10 took 432 MS

Memory consumption for each node is 2.7 GB (9.4 GB x 3 in total)

The read performance of RDD is tested.

Root @ master:/opt/spark #./run spark. Examples. hdfstest master @ master: 5050 HDFS: // master: 9000/user/lijiexu/Wikipedia/TXT/enwiki-20110405.txt

Test 90 GB randomtext data

Root @ master:/opt/spark #./run spark. Examples. hdfstest master @ master: 5050 HDFS: // master: 9000/user/lijiexu/randomtext90gb/randomtext90gb

Time consumed:

Iterations

Time consumed

1

111.905310882 s

2

4.681715228 s

3

4.469296148 s

4

4.441203887 s

5

1.999792125 s

6

2.151376037 s

7

1.889345699 s

8

1.847487668 s

9

1.827241743 s

10

1.747547323 s

The total memory consumption is about 30 GB.

Resource consumption of a single node:

3. Test wordcount

Write Program:

ImportSpark. sparkcontext

ImportSparkcontext ._

ObjectWordcount {

DefMain (ARGs: array [String]) {

If(ARGs. Length <2 ){

System. Err. println ("Usage: wordcount <master> <jar> ")

System. Exit (1)

}

ValSP =NewSparkcontext (ARGs (0), "wordcount", "/opt/spark", list (ARGs (1 )))

ValFile = sp. textfile ("HDFS: // master: 9000/user/lijiexu/Wikipedia/TXT/enwiki-20110405.txt ");

ValCounts = file. flatmap (line => line. Split (""). Map (WORD => (word, 1). performancebykey (_ + _)

Counts. saveastextfile ("HDFS: // master: 9000/user/output/wikiresult3 ")

}

}

Package it into myspark. jar and upload it to/opt/spark/newprogram of the master.

Run the program:

Root @ master:/opt/spark #./run-CP newprogram/myspark. Jar wordcount master @ master: 5050 newprogram/myspark. Jar

Mesos automatically copies the JAR file to the execution node and then executes the file.

Memory consumption: (10 Gb input file + 10 Gb flatmap + 15 GB map intermediate result (word, 1 ))

Some memory does not know where to allocate.

Time consumed: 50 sec (not sorted)

Hadoop wordcount elapsed time: 120 sec to 140 Sec

Unordered results

Single Node:

Hadoop testing kmeans

Run kmeans in mahout

Root @ master:/opt/mahout-distribution-0.6 # bin/mahout Org. apache. mahout. clustering. syntheticcontrol. kmeans. job-dmapred. reduce. tasks = 36-I/user/lijiexu/kmeans/Square-20GB.txt-O output-T1 3-t2 1.5-CD 0.8-K 8-x 6

Running (320 maps and 1 reduce)

Canopy driver running buildclusters over input: output/Data

Slave Resource Consumption

Completed jobs

Jobid

Name

Map total

Reduce total

Time

Job_201206050916_0029

Input driver running over input:/user/lijiexu/kmeans/Square-10GB.txt

160

0

1 minute 2 seconds

Job_201206050916_0030

Kmeans driver running runiteration over clustersin: output/clusters-0/part-randomseed

160

1

1 minute 6 seconds

Job_201206050916_0031

Kmeans driver running runiteration over clustersin: output/clusters-1

160

1

1 minute 7 seconds

Job_201206050916_0032

Kmeans driver running runiteration over clustersin: output/clusters-2

160

1

1 minute 7 seconds

Job_201206050916_0033

Kmeans driver running runiteration over clustersin: output/clusters-3

160

1

1 minute 6 seconds

Job_201206050916_0034

Kmeans driver running runiteration over clustersin: output/clusters-4

160

1

1 minute 6 seconds

Job_201206050916_0035

Kmeans driver running runiteration over clustersin: output/clusters-5

160

1

1 minute 5 seconds

Job_201206050916_0036

Kmeans driver running clusterdata over input: output/Data

160

0

55 seconds

Job_201206050916_0037

Input driver running over input:/user/lijiexu/kmeans/Square-20GB.txt

320

0

1 minute 31 seconds

Job_201206050916_0038

Kmeans driver running runiteration over clustersin: output/clusters-0/part-randomseed

320

36

1 minute 46 seconds

Job_201206050916_0039

Kmeans driver running runiteration over clustersin: output/clusters-1

320

36

1 minute 46 seconds

Job_201206050916_0040

Kmeans driver running runiteration over clustersin: output/clusters-2

320

36

1 minute 46 seconds

Job_201206050916_0041

Kmeans driver running runiteration over clustersin: output/clusters-3

320

36

1 minute 47 seconds

Job_201206050916_0042

Kmeans driver running clusterdata over input: output/Data

320

0

1 minute 34 seconds

Run kmeans on 10 GB and 20 GB for multiple times, and consume resources

Hadoop wordcount Test

Spark Interactive Operation

Enter/opt/spark of the master

Run

Master = Master @ master: 5050./spark-shell

Open spark of mesos version

On master: 8080, you can see the framework

Active frameworks

ID

User

Name

Running tasks

CPUs

Mem

Max Share

Connected

201206050924 to 0-0018

Root

Spark Shell

0

0

0.0 MB

0.00

21:12:56

Scala & gt; Val file = SC. textfile ("HDFS: // master: 9000/user/lijiexu/Wikipedia/TXT/enwiki-20110405.txt ")

Scala> file. First

Scala> Val words = file. Map (_. Split (''). Filter (_. Size <100) // obtain RDD [array [String]

Scala> words. Cache

Scala> words. Filter (_. Contains ("Beijing"). Count

12/06/06 22:12:33 info sparkcontext: job finished in 10.862765819 s

RES1: Long = 855.

Scala> words. Filter (_. Contains ("Beijing"). Count

12/06/06 22:12:52 info sparkcontext: job finished in 0.71051464 s

RES2: Long = 855

Scala> words. Filter (_. Contains ("Shanghai"). Count

12/06/06 22:13:23 info sparkcontext: job finished in 0.667734427 s

RES2: Long = 614

Scala> words. Filter (_. Contains ("Guangzhou"). Count

12/06/06 22:13:42 info sparkcontext: job finished in 0.800617719 s

Res4: Long-term = 134

Due to GC problems, you cannot cache large datasets.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.