ArticleDirectory
- Based on Spark-0.4 and Hadoop-0.20.2
Spark-0.4 based and Hadoop-0.20.21. kmeans
Data: self-generated 3D data, which is centered around the eight vertices of a square
{0, 0, 0}, {0, 10, 0}, {0, 0, 10}, {0, 10 },
{10, 0, 0}, {10, 0, 10}, {10, 10, 0}, {10, 10}
Point number |
189,918,082 (0.1 billion million 3D points) |
Capacity |
10 GB |
HDFS location |
/User/lijiexu/kmeans/Square-10GB.txt |
ProgramLogic:
Read the block on HDFS to the memory. Each block is converted to RDD, which contains a vector. Then map the RDD and extract the class number corresponding to each vector (point). The output (K, v) is (class, (point, 1) to form a new RDD. Before reduce, combine each new RDD and calculate the center and sum of each class in RDD. The output of each RDD has a maximum of K kV pairs. Finally, reduce to get the new RDD (the key of the content is class, the value is the center and sum, and then get the final center after map. |
First upload to HDFS, and then run on Master
Root @ master:/opt/spark #./run spark. Examples. sparkmeans master @ master: 5050 HDFS: // master: 9000/user/lijiexu/kmeans/Square-10GB.txt 8 2.0 |
Iterative kmeansAlgorithm.
A total of 160 tasks. (160*64 MB = 10 Gb)
32 CPU cores and GB memory are used.
The memory consumption for each machine is 4.5 GB (40 GB in total) (the points data itself is 10 Gb X 2, and the intermediate data (K, v) after map => (INT, (vector, 1) (about 10 Gb)
Final result:
0.505246194 s Final centers: Map (5-> (13.997101228817169, 9.208875044622895,-2.494072457488311), 8-> (-2.33522333047955, 9.128892414676326, 1.7923150585737604), 7-> (8.658031587043952, 2.162306996983008, 17.670646829079146 ), 3-> (11.530154433698268, 0.17834347219956842, 9.224352885937776), 4-> (12.722903153986868, 8.812883284216143, 0.6564509961064319), 1-> (6.458644369071984, 11.345681702383024, 7.041924994173552), 6-> (12.887793408866614, -1.5189406469928937, 9.526393664105957), 2-> (2.3345459304412164, 2.0173098597285533, 1.4772489989976143 )) |
50 MB/S 10 GB = & gt; 3.5 min
10 Mb/s 10 GB => 15 min
Test on 20 GB of data
Point number |
377,370,313 (0.3 billion million 3D points) |
Capacity |
20 GB |
HDFS location |
/User/lijiexu/kmeans/Square-20GB.txt |
Run the test command:
Root @ master:/opt/spark #. /run spark. examples. sparkkmeans master @ master: 5050 HDFS: // master: 9000/user/lijiexu/kmeans/Square-20GB.txt 8 2.0 | tee mylogs/sqaure-20GB-kmeans.log |
Obtain the clustering result:
Final centers: Map (5-> (-0.47785701742763115,-1.5901830956323306,-0.18453046159033773), 8-> (1.1073911553593858, 9.051671594514225,-0.44722211311446924), 7-> (1.4960397239284795, 10.173412443492643, -1.7932911100570954), 3-> (-1.4771114031182642, 9.046878176063172,-2.4747981387714444), 4-> (-0.2796747780312184, 0.06910629855122015, 10.268115903887612), 1-> (10.467618592186486,-1.168580362309453, -1.0462842137817263), 6-> (0.7569895433952736, 0.8615441990490469, 9.552726007309518), 2-> (10.807948500515304,-0.5368803187391366, 0.04258123037074164 )) |
Basically eight centers
Memory consumption: (each node is about 5.8 GB), about 50 GB in total.
Memory analysis:
20 GB of raw data and 20 GB of map output
Iterations |
Time |
1 |
108 S |
2 |
0.93 s |
12/06/05 11:11:08 info spark. cachetracker: Looking for RDD Partition
12/06/05 11:11:08 info spark. cachetracker: Found partition in cache!
Test on 20 GB of data (more iterations)
Root @ master:/opt/spark #./run spark. Examples. sparkmeans master @ master: 5050 HDFS: // master: 900 0/user/lijiexu/kmeans/Square-20GB.txt 8 0.8 |
Number of tasks: 320
Time:
Iterations |
Time |
1 |
100.9 s |
2 |
0.93 s |
3 |
4.6 s |
4 |
3.9 s |
5 |
3.9 s |
6 |
3.9 s |
Impact of iterations on memory capacity:
There is basically no impact, the main memory consumption: 20 GB of input data RDD, 20 GB of intermediate data.
Final centers: Map (5-> (-bytes-5, 3.17334874733142e-5,-2.0605806380414582e-4), 8-> (1.1841686358289191e-4, 10.000062966002101, 9.999933240005394 ), 7-> (9.999976672588097, 10.000199556926772,-2.0695123602840933e-4), 3-> (-1.3506815993198176e-4, 9.999948270638338, Percentile-5), 4-> (3.2493629851483764e-4, -7.892413981250518e-5, 10.00002515017671), 1-> (10.00004313126956, 7.431996896171192e-6, 7.590402882208648e-5), 6-> (9.999982611661382, 10.000144597573051, 10.000037734639696), 2-> (9.999958673426654, -1.1917651103354863e-4, 9.99990217533504 )) |
Result Visualization
2. hdfstest
Test logic:
package spark. examples Import spark. _ Object hdfstest { def main (ARGs: array [String]) { Val SC = New sparkcontext (ARGs (0 ), "hdfstest") Val file = SC. textfile (ARGs (1) Val mapped = file. map (S => S. length ). cache () for (ITER <-1 to 10) { Val Start = system. currenttimemillis () for (x <-mapped) {x + 2} // println ("processing:" + x) Val end = system. currenttimemillis () println ("iteration" + ITER + "took" + (end-Start) + "Ms ") } |
First, read a text file from HDFS and save it in file
Calculate the number of characters in each row in the file again and save it in the ed of the memory RDD.
Then read the number of each character in mapped, add it to 2, and calculate the read + add time consumption.
Only map, no reduce.
Test 10 Gb Wiki
The read performance of RDD is tested.
Root @ master:/opt/spark #./run spark. Examples. hdfstest master @ master: 5050 HDFS: // master: 9000:/user/lijiexu/Wikipedia/TXT/enwiki-20110405.txt |
Test results:
Iteration 1 took 12900 MS = 12 s Iteration 2 took 388 MS Iteration 3 took 472 MS Iteration 4 took 490 MS Iteration 5 took 459 MS Iteration 6 took 492 MS Iteration 7 took 480 MS Iteration 8 took 501 MS Iteration 9 took 479 MS Iteration 10 took 432 MS |
Memory consumption for each node is 2.7 GB (9.4 GB x 3 in total)
The read performance of RDD is tested.
Root @ master:/opt/spark #./run spark. Examples. hdfstest master @ master: 5050 HDFS: // master: 9000/user/lijiexu/Wikipedia/TXT/enwiki-20110405.txt |
Test 90 GB randomtext data
Root @ master:/opt/spark #./run spark. Examples. hdfstest master @ master: 5050 HDFS: // master: 9000/user/lijiexu/randomtext90gb/randomtext90gb |
Time consumed:
Iterations |
Time consumed |
1 |
111.905310882 s |
2 |
4.681715228 s |
3 |
4.469296148 s |
4 |
4.441203887 s |
5 |
1.999792125 s |
6 |
2.151376037 s |
7 |
1.889345699 s |
8 |
1.847487668 s |
9 |
1.827241743 s |
10 |
1.747547323 s |
The total memory consumption is about 30 GB.
Resource consumption of a single node:
3. Test wordcount
Write Program:
ImportSpark. sparkcontext ImportSparkcontext ._ ObjectWordcount { DefMain (ARGs: array [String]) { If(ARGs. Length <2 ){ System. Err. println ("Usage: wordcount <master> <jar> ") System. Exit (1) } ValSP =NewSparkcontext (ARGs (0), "wordcount", "/opt/spark", list (ARGs (1 ))) ValFile = sp. textfile ("HDFS: // master: 9000/user/lijiexu/Wikipedia/TXT/enwiki-20110405.txt "); ValCounts = file. flatmap (line => line. Split (""). Map (WORD => (word, 1). performancebykey (_ + _) Counts. saveastextfile ("HDFS: // master: 9000/user/output/wikiresult3 ") } } |
Package it into myspark. jar and upload it to/opt/spark/newprogram of the master.
Run the program:
Root @ master:/opt/spark #./run-CP newprogram/myspark. Jar wordcount master @ master: 5050 newprogram/myspark. Jar |
Mesos automatically copies the JAR file to the execution node and then executes the file.
Memory consumption: (10 Gb input file + 10 Gb flatmap + 15 GB map intermediate result (word, 1 ))
Some memory does not know where to allocate.
Time consumed: 50 sec (not sorted)
Hadoop wordcount elapsed time: 120 sec to 140 Sec
Unordered results
Single Node:
Hadoop testing kmeans
Run kmeans in mahout
Root @ master:/opt/mahout-distribution-0.6 # bin/mahout Org. apache. mahout. clustering. syntheticcontrol. kmeans. job-dmapred. reduce. tasks = 36-I/user/lijiexu/kmeans/Square-20GB.txt-O output-T1 3-t2 1.5-CD 0.8-K 8-x 6 |
Running (320 maps and 1 reduce)
Canopy driver running buildclusters over input: output/Data
Slave Resource Consumption
Completed jobs
Jobid |
Name |
Map total |
Reduce total |
Time |
Job_201206050916_0029 |
Input driver running over input:/user/lijiexu/kmeans/Square-10GB.txt |
160 |
0 |
1 minute 2 seconds |
Job_201206050916_0030 |
Kmeans driver running runiteration over clustersin: output/clusters-0/part-randomseed |
160 |
1 |
1 minute 6 seconds |
Job_201206050916_0031 |
Kmeans driver running runiteration over clustersin: output/clusters-1 |
160 |
1 |
1 minute 7 seconds |
Job_201206050916_0032 |
Kmeans driver running runiteration over clustersin: output/clusters-2 |
160 |
1 |
1 minute 7 seconds |
Job_201206050916_0033 |
Kmeans driver running runiteration over clustersin: output/clusters-3 |
160 |
1 |
1 minute 6 seconds |
Job_201206050916_0034 |
Kmeans driver running runiteration over clustersin: output/clusters-4 |
160 |
1 |
1 minute 6 seconds |
Job_201206050916_0035 |
Kmeans driver running runiteration over clustersin: output/clusters-5 |
160 |
1 |
1 minute 5 seconds |
Job_201206050916_0036 |
Kmeans driver running clusterdata over input: output/Data |
160 |
0 |
55 seconds |
Job_201206050916_0037 |
Input driver running over input:/user/lijiexu/kmeans/Square-20GB.txt |
320 |
0 |
1 minute 31 seconds |
Job_201206050916_0038 |
Kmeans driver running runiteration over clustersin: output/clusters-0/part-randomseed |
320 |
36 |
1 minute 46 seconds |
Job_201206050916_0039 |
Kmeans driver running runiteration over clustersin: output/clusters-1 |
320 |
36 |
1 minute 46 seconds |
Job_201206050916_0040 |
Kmeans driver running runiteration over clustersin: output/clusters-2 |
320 |
36 |
1 minute 46 seconds |
Job_201206050916_0041 |
Kmeans driver running runiteration over clustersin: output/clusters-3 |
320 |
36 |
1 minute 47 seconds |
Job_201206050916_0042 |
Kmeans driver running clusterdata over input: output/Data |
320 |
0 |
1 minute 34 seconds |
Run kmeans on 10 GB and 20 GB for multiple times, and consume resources
Hadoop wordcount Test
Spark Interactive Operation
Enter/opt/spark of the master
Run
Master = Master @ master: 5050./spark-shell |
Open spark of mesos version
On master: 8080, you can see the framework
Active frameworks
ID |
User |
Name |
Running tasks |
CPUs |
Mem |
Max Share |
Connected |
201206050924 to 0-0018 |
Root |
Spark Shell |
0 |
0 |
0.0 MB |
0.00 |
21:12:56 |
Scala & gt; Val file = SC. textfile ("HDFS: // master: 9000/user/lijiexu/Wikipedia/TXT/enwiki-20110405.txt ") Scala> file. First Scala> Val words = file. Map (_. Split (''). Filter (_. Size <100) // obtain RDD [array [String] Scala> words. Cache Scala> words. Filter (_. Contains ("Beijing"). Count 12/06/06 22:12:33 info sparkcontext: job finished in 10.862765819 s RES1: Long = 855. Scala> words. Filter (_. Contains ("Beijing"). Count 12/06/06 22:12:52 info sparkcontext: job finished in 0.71051464 s RES2: Long = 855 Scala> words. Filter (_. Contains ("Shanghai"). Count 12/06/06 22:13:23 info sparkcontext: job finished in 0.667734427 s RES2: Long = 614 Scala> words. Filter (_. Contains ("Guangzhou"). Count 12/06/06 22:13:42 info sparkcontext: job finished in 0.800617719 s Res4: Long-term = 134 |
Due to GC problems, you cannot cache large datasets.