Http://blog.sina.com.cn/s/blog_537b7f1a0100m0xc.html
Google's
Sawzall, Pig and Microsoft Dryad
Greg recently wrote a blog about the distributed architecture of Google, Yahoo, and Microsoft. This is: Google's Sawzall, Yahoo's pig
Pig and Microsoft Dryad.
This is really an information explosion era. In this context, the computing that consumes the most CPU will gradually shift from "Improving the performance of the software itself" to the information processing process. Moore's law describing the increase in computing speed is said to be still valid, but the famous saying "Andy giveth, and Bill taketh away" should seem to be changed to: "Andy giveth, andgoogle (...) taketh away "now.
Let's get down to the truth. During the May Day, the Yahoo swine holiday was released: pig.
. (Pig) Yahoo pig is
In February March, cutting joined Yahoo's parallel processing architecture. With pig, common programmers were able to analyze and process gigantic datasets. With hadoop, you have basically entered the practical stage. Amazon EC2
And S3 are already using hadoop.
Yahoo pig has the following features:
1. Focus on massive data set analysis (Ad-Hoc analysis), Ad-hoc representative:A solution that has been custom designed for a specific problem );
2. Run on the cluster computing architecture. Yahoo pig provides multi-layer abstraction to simplify parallel computing for common users; these abstractions automatically translate user requests to queries into effective parallel evaluation plans, and then execute these plans on the physical cluster;
3. provides SQL-like Operation syntax;
4. Open source code;
From the perspective of Yahoo pig, we recommend that you use Google sawzall and Microsoft Dryad.
Google sawzall was released by Google Labs very early. Although both of them are based on the distributed parallel computing architecture, their implementation methods are quite different.
Sawzall is based on mapreduce and becomes a syntax similar to Java and C.
The following is an example of the Sawzall code:
proto "querylog.proto"
static RESOLUTION: int = 5; # minutes; must be divisor of 60
log_record: QueryLogProto = input;
queries_per_degree: table sum[t: time][lat: int][lon: int] of int;
loc: Location = locationinfo(log_record.ip);
if (def(loc)) {
t: time = log_record.time_usec;
m: int = minuteof(t); # within the hour
m = m - m % RESOLUTION;
t = trunctohour(t) + time(m * int(MINUTE));
emit queries_per_degree[t][int(loc.lat)][int(loc.lon)] <- 1;
}
The following is an example of pig code:
a = COGROUP QueryResults BY url, Pages BY url;
b = FOREACH a GENERATE FLATTEN(QueryResults.(query, position)), FLATTEN(Pages.pagerank);
c = GROUP b BY query;
d = FILTER c BY checkTop5(*);
Obviously, if you need to analyze and process structured (semi-structured) data, the pig SQL syntax is easier to grasp.
For details, refer to other yahoo pig examples:
Pig Latin examples:
Example 1: Word Count
Example 2: map/reduce
Example 3: pages and queries
Example 4: PageRank
Coincidentally, Microsoft's Dryad integrated LINQ (with the official release of. NET 2.0) is called:
Dryadlinq. From a personal point of view, I have always been optimistic about the LINQ product. I came from aders and said that the combination of programming language and data processing is one pair of simple insert, update, delete, and query. That's why I like rails.
Currently, MicrosoftDryad already has
Microsoft's adcenter is in use.
I would like to use Yahoo pig for log analysis.
References:
Sourcelab
Dryad: distributed data-parallel programs from sequential building blocks
Mapreduce BBS