RDD can be well applied to batch analysis applications that support parallel data, including data mining, machine learning, and graph algorithms, because these programs usually perform the same operation on many records. RDD is not suitable for applications that asynchronously update shared states, such as parallel web crawlers. Therefore, our goal is to provide effective programming models for most analytical applications, while other types of applications are handed over to specialized systems.
For details about RDD, see:
Elastic distributed Dataset: memory-based cluster computing capacity (2): elastic distributed dataset (RDD)
Hardware environment:
The development machine is equipped with three Intel (R) Xeon (R) CPUs E5440 @ 2.83GHz dual-core 2.8 GB 4G memory
Operating System:
Red Hat Enterprise Linux Server release 5.7 (Tikanga)
Spark Configuration:
Each node has 2 GB memory, 14 dimensions, 100 categories, and 10 iterations. Different sample files are used for analysis.
Conclusion 1: 0.8 (data size/2048/3) is defined as the three-node threshold. When the running data is within the threshold, the performance increases monotonically. When the threshold is exceeded, the performance decreases sharply, when the threshold is exceeded 2%, the Performance drops by 53.11937%. When the threshold is exceeded 34.01326%, the Performance drops by 70.80896%.
Test data:
Serial number |
Data File size (MB) |
Number of records |
Time consumed |
Data File/time consumed |
Data/memory |
Data/memory/node count |
0 |
33.33 |
147,106 |
10 |
3.333344 |
0.016274 |
0.005425 |
1 |
100 |
441,319 |
13 |
7.692317 |
0.048828 |
0.016276 |
2 |
166.67 |
735,533 |
15 |
11.11118 |
0.081382 |
0.027127 |
3 |
233.33 |
1,029,746 |
20 |
11.66652 |
0.113931 |
0.037977 |
4 |
341.33 |
1,506,371 |
23 |
14.8406 |
0.166665 |
0.055555 |
5 |
512 |
2,259,557 |
30 |
17.06666 |
0.25 |
0.083333 |
6 |
682.67 |
3,012,743 |
42 |
16.25402 |
0.333335 |
0.111112 |
7 |
853.33 |
3,765,929 |
45 |
18.96291 |
0.416665 |
0.138888 |
8 |
1,024.00 |
4,519,115 |
57 |
17.96494 |
0.5 |
0.166667 |
9 |
1,194.67 |
5,272,301 |
65 |
18.37953 |
0.583335 |
0.194445 |
10 |
1,365.33 |
6,025,487 |
73 |
18.70316 |
0.666665 |
0.222222 |
11 |
1,536.00 |
6,778,673 |
80 |
19.20001 |
0.75 |
0.25 |
12 |
1,706.67 |
7,531,859 |
95 |
17.96491 |
0.833335 |
0.277778 |
13 |
1,877.33 |
8,285,044 |
147 |
12.77097 |
0.916665 |
0.305555 |
14 |
2,048.00 |
9,038,230 |
104 |
19.6923 |
1 |
0.333333 |
15 |
2,218.66 |
9,791,416 |
113 |
19.63417 |
1.08333 |
0.36111 |
16 |
2,389.33 |
10,544,602 |
124 |
19.26881 |
1.166665 |
0.388888 |
17 |
2,560.01 |
11,297,788 |
175 |
14.62861 |
1.250005 |
0.416668 |
18 |
2,730.66 |
12,050,974 |
184 |
14.84056 |
1.33333 |
0.444443 |
19 |
2,901.34 |
12,804,160 |
164 |
17.69109 |
1.41667 |
0.472223 |
20 |
3,072.00 |
13,557,346 |
155 |
19.81934 |
1.5 |
0.5 |
21 |
3,242.67 |
14,310,532 |
162 |
20.01647 |
1.583335 |
0.527778 |
22 |
3,413.34 |
15,063,718 |
166 |
20.56231 |
1.66667 |
0.555557 |
23 |
3,754.68 |
16,570,089 |
179 |
20.97585 |
1.83334 |
0.611113 |
24 |
4,266.68 |
18,829,646 |
189 |
22.57501 |
2.08334 |
0.694447 |
25 |
4,500.01 |
19,859,392 |
209 |
21.53114 |
2.197271 |
0.732424 |
26 |
4,666.68 |
20,594,925 |
202 |
23.10235 |
2.278652 |
0.759551 |
27 |
4,766.68 |
21,036,244 |
202 |
23.5974 |
2.32748 |
0.775827 |
28 |
4,866.68 |
21,477,563 |
226 |
21.53396 |
2.376309 |
0.792103 |
29 |
4,966.68 |
21,918,882 |
220 |
22.5758 |
2.425137 |
0.808379 |
30 |
5,066.68 |
22,360,201 |
458 |
11.06261 |
2.473965 |
0.824655 |
31 |
5,120.01 |
22,595,577 |
463 |
11.05834 |
2.500005 |
0.833335 |
32 |
6,656.01 |
29,374,250 |
1010 |
6.59011 |
3.250005 |
1.083335 |
Performance trend chart:
Spark Configuration:
One node, 2 GB memory, 14 dimensions, 100 categories, 10 iterations.
Conclusion 2: 0.9 (data size/2048) is defined as the three-node threshold. When the running data is within the threshold, the performance increases monotonically. When the threshold is exceeded, the performance decreases sharply, when the threshold is exceeded 8.3334961%, the Performance drops by 57.61797318%. When the threshold is exceeded 18.18167291%, the Performance drops by 66.4701143%.
When it exceeds 36.36441116%, the Performance drops by 94.14757913%.
Serial number |
Data File size (MB) |
Number of records |
Time consumed |
Data File/time consumed |
Data/memory |
0 |
33.33 |
147,106 |
10 |
3.333344 |
0.016274 |
1 |
100.00 |
441,319 |
20 |
5.000006 |
0.048828 |
2 |
166.67 |
735,533 |
27 |
6.17288 |
0.081382 |
3 |
233.33 |
1,029,746 |
34 |
6.862657 |
0.113931 |
4 |
341.33 |
1,506,371 |
45 |
7.585197 |
0.166665 |
5 |
512.00 |
2,259,557 |
64 |
7.999997 |
0.25 |
6 |
682.67 |
3,012,743 |
85 |
8.031401 |
0.333335 |
7 |
853.33 |
3,765,929 |
102 |
8.365989 |
0.416665 |
8 |
1,024.00 |
4,519,115 |
118 |
8.67798 |
0.5 |
9 |
1,194.67 |
5,272,301 |
137 |
8.720216 |
0.583335 |
10 |
1,365.33 |
6,025,487 |
153 |
8.923729 |
0.666665 |
11 |
1,536.00 |
6,778,673 |
176 |
8.727279 |
0.75 |
12 |
1,706.67 |
7,531,859 |
193 |
8.84283 |
0.833335 |
13 |
1,877.33 |
8,285,044 |
223 |
8.41853 |
0.916665 |
14 |
2,048.00 |
9,038,230 |
574 |
3.567944 |
1 |
15 |
2,218.66 |
9,791,416 |
786 |
2.822724 |
1.08333 |
16 |
2,389.33 |
10,544,602 |
1134 |
2.106995 |
1.166665 |
17 |
2,560.01 |
11,297,788 |
5196 |
0.492688 |
1.250005 |
Performance trend chart:
Summary: The memory required for executing the Spark program is estimated. When the threshold is exceeded, the performance will drop sharply. If there are any errors or deficiencies, please correct them.
Copyright statement: original works. reposted documents must maintain the integrity of the document. During reposted documents, you must mark the original Publishing, author information, and this statement in the form of hyperlinks. Otherwise, legal liability will be held.
Blog: http://www.ninqing.net/
Weibo address: http://weibo.com/ninqing
Posted in spark. Tagged spark, performance test.