Spark Performance Test report

Last Update:2013-11-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

RDD can be well applied to batch analysis applications that support parallel data, including data mining, machine learning, and graph algorithms, because these programs usually perform the same operation on many records. RDD is not suitable for applications that asynchronously update shared states, such as parallel web crawlers. Therefore, our goal is to provide effective programming models for most analytical applications, while other types of applications are handed over to specialized systems.

For details about RDD, see:

Elastic distributed Dataset: memory-based cluster computing capacity (2): elastic distributed dataset (RDD)

Hardware environment:

The development machine is equipped with three Intel (R) Xeon (R) CPUs E5440 @ 2.83GHz dual-core 2.8 GB 4G memory

Operating System:

Red Hat Enterprise Linux Server release 5.7 (Tikanga)

Spark Configuration:

Each node has 2 GB memory, 14 dimensions, 100 categories, and 10 iterations. Different sample files are used for analysis.

Conclusion 1: 0.8 (data size/2048/3) is defined as the three-node threshold. When the running data is within the threshold, the performance increases monotonically. When the threshold is exceeded, the performance decreases sharply, when the threshold is exceeded 2%, the Performance drops by 53.11937%. When the threshold is exceeded 34.01326%, the Performance drops by 70.80896%.

Test data:

Serial number	Data File size (MB)	Number of records	Time consumed	Data File/time consumed	Data/memory	Data/memory/node count
0	33.33	147,106	10	3.333344	0.016274	0.005425
1	100	441,319	13	7.692317	0.048828	0.016276
2	166.67	735,533	15	11.11118	0.081382	0.027127
3	233.33	1,029,746	20	11.66652	0.113931	0.037977
4	341.33	1,506,371	23	14.8406	0.166665	0.055555
5	512	2,259,557	30	17.06666	0.25	0.083333
6	682.67	3,012,743	42	16.25402	0.333335	0.111112
7	853.33	3,765,929	45	18.96291	0.416665	0.138888
8	1,024.00	4,519,115	57	17.96494	0.5	0.166667
9	1,194.67	5,272,301	65	18.37953	0.583335	0.194445
10	1,365.33	6,025,487	73	18.70316	0.666665	0.222222
11	1,536.00	6,778,673	80	19.20001	0.75	0.25
12	1,706.67	7,531,859	95	17.96491	0.833335	0.277778
13	1,877.33	8,285,044	147	12.77097	0.916665	0.305555
14	2,048.00	9,038,230	104	19.6923	1	0.333333
15	2,218.66	9,791,416	113	19.63417	1.08333	0.36111
16	2,389.33	10,544,602	124	19.26881	1.166665	0.388888
17	2,560.01	11,297,788	175	14.62861	1.250005	0.416668
18	2,730.66	12,050,974	184	14.84056	1.33333	0.444443
19	2,901.34	12,804,160	164	17.69109	1.41667	0.472223
20	3,072.00	13,557,346	155	19.81934	1.5	0.5
21	3,242.67	14,310,532	162	20.01647	1.583335	0.527778
22	3,413.34	15,063,718	166	20.56231	1.66667	0.555557
23	3,754.68	16,570,089	179	20.97585	1.83334	0.611113
24	4,266.68	18,829,646	189	22.57501	2.08334	0.694447
25	4,500.01	19,859,392	209	21.53114	2.197271	0.732424
26	4,666.68	20,594,925	202	23.10235	2.278652	0.759551
27	4,766.68	21,036,244	202	23.5974	2.32748	0.775827
28	4,866.68	21,477,563	226	21.53396	2.376309	0.792103
29	4,966.68	21,918,882	220	22.5758	2.425137	0.808379
30	5,066.68	22,360,201	458	11.06261	2.473965	0.824655
31	5,120.01	22,595,577	463	11.05834	2.500005	0.833335
32	6,656.01	29,374,250	1010	6.59011	3.250005	1.083335

Performance trend chart:

Spark Configuration:

One node, 2 GB memory, 14 dimensions, 100 categories, 10 iterations.

Conclusion 2: 0.9 (data size/2048) is defined as the three-node threshold. When the running data is within the threshold, the performance increases monotonically. When the threshold is exceeded, the performance decreases sharply, when the threshold is exceeded 8.3334961%, the Performance drops by 57.61797318%. When the threshold is exceeded 18.18167291%, the Performance drops by 66.4701143%.

When it exceeds 36.36441116%, the Performance drops by 94.14757913%.

Serial number	Data File size (MB)	Number of records	Time consumed	Data File/time consumed	Data/memory
0	33.33	147,106	10	3.333344	0.016274
1	100.00	441,319	20	5.000006	0.048828
2	166.67	735,533	27	6.17288	0.081382
3	233.33	1,029,746	34	6.862657	0.113931
4	341.33	1,506,371	45	7.585197	0.166665
5	512.00	2,259,557	64	7.999997	0.25
6	682.67	3,012,743	85	8.031401	0.333335
7	853.33	3,765,929	102	8.365989	0.416665
8	1,024.00	4,519,115	118	8.67798	0.5
9	1,194.67	5,272,301	137	8.720216	0.583335
10	1,365.33	6,025,487	153	8.923729	0.666665
11	1,536.00	6,778,673	176	8.727279	0.75
12	1,706.67	7,531,859	193	8.84283	0.833335
13	1,877.33	8,285,044	223	8.41853	0.916665
14	2,048.00	9,038,230	574	3.567944	1
15	2,218.66	9,791,416	786	2.822724	1.08333
16	2,389.33	10,544,602	1134	2.106995	1.166665
17	2,560.01	11,297,788	5196	0.492688	1.250005

Performance trend chart:

Summary: The memory required for executing the Spark program is estimated. When the threshold is exceeded, the performance will drop sharply. If there are any errors or deficiencies, please correct them.

Copyright statement: original works. reposted documents must maintain the integrity of the document. During reposted documents, you must mark the original Publishing, author information, and this statement in the form of hyperlinks. Otherwise, legal liability will be held.

Blog: http://www.ninqing.net/

Weibo address: http://weibo.com/ninqing

Posted in spark. Tagged spark, performance test.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Performance Test report

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark Performance Test report

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support