1. Kmeans
Data: three-dimensional data produced by oneself, each around 8 vertices of a square
{0, 0, 0}, {0, 10, 0}, {0, 0, 10}, {0, 10, 10},
{10, 0, 0}, {10, 0, 10}, {10, 10, 0}, {10, 10, 10}
Point number
189,918,082 (190 million three-dimensional points)
Capacity
10GB
HDFS Location
/user/lijiexu/kmeans/square-10gb.txt
Program logic:
Read block to memory on HDFs, each block converted to Rdd, which contains vectors.
Then the map operation is performed on the RDD, the class number of each vector (point) is extracted, and the output (K,V) is (class, (point,1)) to form a new rdd.
Then, before you reduce it, each new RDD is combine and the center of each class is calculated within the RDD. So that the output of each RDD is only a maximum of k kv pairs.
Finally, reduce gets the new Rdd (the key to the content is the center of the Class,value and the final center after the map.
Upload first to HDFs, then run on master
root@master:/opt/spark#/run spark.examples.http://www.aliyun.com/zixun/aggregation/13383.html ">SparkKMeans master@master:5050 Hdfs://master:9000/user/lijiexu/kmeans/square-10gb.txt 8 2.0
Iterative execution of the Kmeans algorithm.
Altogether 160 tasks. (160 * 64MB = 10GB)
Leverages 32 CPU CORES,18.9GB memory.
Memory consumption for each machine is 4.5GB (total 40GB) (Points data 10gb*2,map (K, V) => (int, (vector, 1)) (approximately 10GB)
Final results:
0.505246194 s
Final Centers:map (5-> (13.997101228817169, 9.208875044622895, -2.494072457488311), 8-> (-2.33522333047955, 9.128892414676326, 1.7923150585737604), 7-> (8.658031587043952, 2.162306996983008, 17.670646829079146), 3-> ( 11.530.54433698268, 0.17834347219956842, 9.224352885937776), 4-> (12.722903153986868, 8.812883284216143, 0.6564509961064319), 1-> (6.458644369071984, 11.345681702383024, 7.041924994173552), 6-> (12.887793408866614,- 1.5189406469928937, 9.526393664105957), 2-> (2.3345459304412164, 2.0173098597285533, 1.4772489989976143))
50mb/s 10GB => 3.5min
10mb/s 10GB => 15min
test
on 20GB data
Point number
377,370,313 (370 million three-dimensional points)
Capacity
20GB
HDFS Location
/user/lijiexu/kmeans/square-20gb.txt
To run the test command:
root@master:/opt/spark#/run Spark.examples.SparkKMeans master@master:5050 Hdfs://master:9000/user/lijiexu/kmeans /square-20gb.txt 8 2.0 | Tee Mylogs/sqaure-20gb-kmeans.log
Get clustering Result:
Final Centers:map (5-> ( -0.47785701742763115, -1.5901830956323306, -0.18453046159033773), 8-> ( 1.1073911553593858, 9.051671594514225, -0.44722211311446924), 7-> (1.4960397239284795, 10.173412443492643,- 1.7932911100570954), 3-> ( -1.4771114031182642, 9.046878176063172, -2.4747981387714444), 4-> (- 0.2796747780312184, 0.06910629855122015, 10.268115903887612), 1-> (10.467618592186486,-1.168580362309453,- 1.0462842137817263), 6-> (0.7569895433952736, 0.8615441990490469, 9.552726007309518), 2-> (10.807948500515304, -0.5368803187391366, 0.04258123037074164)
Basically, 8 centers.
Memory consumption: (approximately 5.8GB per node), a total of about 50GB.
Memory Analysis:
20GB raw data, 20GB map output
Number of iterations
Time
1
108 S
2
0.93 s
12/06/05 11:11:08 INFO Spark. Cachetracker:looking for RDD partition 2:302
12/06/05 11:11:08 INFO Spark. Cachetracker:found Partition in cache!