With hadoop calculate PI value, hadoop calculate PI value

Last Update:2014-12-22 Source: Internet

Author: User

Keywords nbsp ; unit number of passes can be

Tags code data distributed examples files hadoop how to it is

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, calculate the PI value of the way and principle

& Baidu, a lot of ways to calculate the PI, but the comments in the hadoop examples code to write: is to use Quasi-Monte Carlo algorithm to estimate the value of PI.

Quasi-Monte Carlo's description of Wikipedia more theoretical, a lot of difficult to understand the formula.

Fortunately, a Google, found an article on Stanford University website: "By throwing darts can also come to the value of PI? "The article is short, illustrated, and well understood.

Here I cut off the important part of that article:

A little explanation of the figure above:
1, Figure 2 is the top right corner of Figure1.
2, dart Figure 2 several times (a large number), and each time are still at different points.
3, if the number of throwing is very large, Figure2 will be stabbed "riddled with holes."
4, at this time, "the number of throwing in the circle" divided by "the total number of throws," and then multiplied by 4, is the value of PI! (Specific derivation process see the original text)

In this algorithm, it is important to know how to "throw Figure 2 at random", that is, how to make each point on Figure2 the same probability of being hit.

In the hadoop examples code, Halton sequence is used to guarantee this. For the Halton sequence, we can refer to Wikipedia.

Let me summarize the role of the Halton sequence here: In a 1 by 1 square, there is a non-repeating, uniform point. The horizontal and vertical coordinates of each point are between 0 and 1. That's it, to be sure to "throw at Figure 2 randomly."

Someone concluded that this is actually called the Monte Carlo algorithm, we take a unit square (1 × 1) which make an inscribed circle (unit circle), the unit square area: inscribed unit area = square Number of darts: The number of darts within the inscribed unit circle, calculated by calculating the number of darts round the unit area, through the area, in the calculation of the pi. Note that accuracy is proportional to the number of darts you throw.

Second, running hadoop PI command

The meaning of the following two numeric parameters:
The first 100 refers to 100 map tasks to run
The second number refers to how many times each map task is to be thrown

The product of two parameters is the total number of throws.

The result of my run:

Third, the summary

The methods used to calculate PIs in the examples of hadoop are statistical methods that use a large number of samples or are data-intensive ones.

Reproduced, please indicate the source: http://www.ming-yue.cn/hadoop-pi/

How to calculate the exact value of PI?

For the unit circle, and then for its internal positive N-sided (N is a positive integer power of 2) first calculate the internal circumscribed positive N-shaped perimeter, available formula C = N * 2R * sin (180 / N) , Where R is the unit circle radius of 1. sin (180 / N) can be used in conjunction with the Nth half-angle formula (because N is a positive integer power of 2). Finally, we can get the approximate value of pi by C / 2 (since N can be an infinite value, Close to the pi)

Hadoop do what computing appropriate?

Mainly for large data files, the best data size G, T level, Hadoop bulk data cutting and distributed storage, the small data due to system overhead and other reasons processing speed is not necessarily more than a single string The procedure is obvious. In addition, Hadoop's mapreduce calculation model generates intermediate result files through the map task, and the reduce task processes these intermediate result files to produce the final result file and outputs it. Because intermediate result files are stored in local memory or on disk of each distributed computing node, if the intermediate result files generated by the calculation are huge, the reduce process needs to obtain these intermediate result files through remote procedure calls, which increases the overhead of network transmission , It is not suitable to use hadoop processing. So when to use hadoop to deal with the data, the above two points must be considered, for large-scale statistical analysis of data, such as the desired variance, or for distributed data query is suitable to use hadoop. Ha ha ~ I do not know whether to answer your question clearly

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More