5 lines of code how to achieve the wordcount of Hadoop?

Last Update:2015-01-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Novice Programmers know the meaning of Hello World, and when you first print Hello worlds from the console, it means that you are already moving into the world of programming, which is a little bit like the first person who eats crabs, although this is not a good metaphor.

If you learn to use Hello World as a means of stepping into a single-machine programming portal, then learning to use wordcount in a distributed environment means that you are stepping into the door of distributed programming. Imagine that your program can run in clusters of hundreds of thousands of machines, is it a monumental thing? Whether in Hadoop or spark, the first example of the initial learning of these two open source frameworks is undoubtedly wordcount, so long as our wordcount can run successfully, we can go deep into it in a bold way.

Pull more, the following hurriedly into the topic, see How to use 5 lines of code to achieve the wordcount of Hadoop, in Hadoop if you use Java to write a wordcount at least dozens of lines of code, if through the Hadoop Streaming's way of using python,php, or C + + to write, almost also have 10 lines of code around. If you are using the spark-based approach to HDFS, you can write wordcount,5 line code in the Scala language, but if you use Spark, Java-based API to write, then it is bloated, there is no dozens of lines of code, is also uncertain.

Today, the streaming is not written in either Spark's Scala or the python of Hadoop, to see how we can use our pig script to get the job done, and test the data as follows:

Java code 650) this.width=650; "class=" star "src=" Http://qindongliang.iteye.com/images/icon_star.png "alt=" Favorites Code "style= "border:0px;"/>

I&NBSP;AM&NBSP;HADOOP&NBSP;&NBSP;
i am hadoop
i am lucene
I&NBSP;AM&NBSP;HBASE&NBSP;&NBSP;
I&NBSP;AM&NBSP;HIVE&NBSP;&NBSP;
i am hive sql
i am pig

the full script for Pig is as follows:

Pig code 650) this.width=650; "class=" star "src=" Http://qindongliang.iteye.com/images/icon_star.png "alt=" Favorite Code "style=" border:0px; "/>

--Big Data exchange Group:376932160(ads not included)
--load the TXT data of the text and put each line as a text
A = Load ' $in ' as (F1:chararray);
--splits each row of data by a specified delimiter (space is used here) and into a flat structure
b = foreach a Generate Flatten (tokenize (F1, "));
--Grouping of words
c = Group B by $0;
--count the occurrences of each word
D = foreach C generate Group, COUNT ($1);
--Store the result data
Stroe d into ' $out '

The processing results are as follows:

Java code 650) this.width=650; "class=" star "src=" Http://qindongliang.iteye.com/images/icon_star.png "alt=" Favorites Code "style= "border:0px;"/>

(I,7)
(AM,7)
(Pig,1)
(SQL,1)
(Hive,2)
(HBase,1)
(Hadoop,2)
(Lucene,1 )

Yes, you're right, that's 5 lines of code, to achieve the data reading, segmentation, conversion, grouping, statistics, storage and other functions. Very simple and convenient!

There's nothing more concise than spark, but it's just a job, and if it's in demand, adding a sort of result, taking TOPN, this time in pig, it's pretty simple, just add 2 lines of code, but in Spark, It may take several lines of code.

We look at the pig code after the change, add a sort, take the function of topn:

Pig code 650) this.width=650; "class=" star "src=" Http://qindongliang.iteye.com/images/icon_star.png "alt=" Favorite Code "style=" border:0px; "/>

--load the TXT data of the text and put each line as a text
A = Load ' $in ' as (F1:chararray);
--splits each row of data by a specified delimiter (space is used here) and into a flat structure
b = foreach a Generate Flatten (tokenize (F1, "));
--Grouping of words
c = Group B by $0;
--count the occurrences of each word
D = foreach C generate Group, COUNT ($1);
--Descending by statistic count
E = Order D by $1 desc;
--Take TOP2
f = Limit e 2;
--Store the result data
Stroe f into ' $out '

The output results are as follows:

Java code 650) this.width=650; "class=" star "src=" Http://qindongliang.iteye.com/images/icon_star.png "alt=" Favorites Code "style= "border:0px;"/>

(I,7)
(AM,7)

If you use Java to write this mapreduce job, the subsequent sorting statistics TOPN, you have to rewrite a job to execute, because the MapReduce is very simple to do, a job only deal with a function, and in pig it will automatically, Help us analyze the syntax tree to build multiple dependent mapreduce jobs, and we don't have to worry about the underlying code implementations, just focus on our business.

In addition, Pig is also a very flexible batch processing framework, through the custom UDF module, we can use pig to do a lot of things, see the last article of the scattered fairy friends, should know that the original Yahoo company not only use pig analysis log, search content, Pangerank rankings, But also using pig to build their web inverted index and other extensions, we can be implemented through the pig UDF, it can be our business and mapreduce specific implementation decoupling, and very strong reusability, we write any tool class, Can be easily run on a large-scale hadoop cluster with pig stability.

Scan code to pay attention to the public: I am a Siege division (WOSHIGCS), if there are any questions, technical issues, career issues, etc., welcome to the public number on the message with me to discuss! Let's do a different siege division! Thank you!

650) this.width=650; "Src=" http://dl2.iteye.com/upload/attachment/0104/9948/ 3214000f-5633-3c17-a3d7-83ebda9aebff.jpg "style=" Border:0px;font-family:helvetica, Tahoma, Arial, Sans-serif; Font-size:14px;line-height:25.200000762939453px;white-space:normal;background-color:rgb (255,255,255); "Alt=" 3214000f-5633-3c17-a3d7-83ebda9aebff.jpg "/>

reproduced please specify the original address, thank you for your cooperation! http://qindongliang.iteye.com/

This article is from the "7936494" blog, please be sure to keep this source http://7946494.blog.51cto.com/7936494/1602633

5 lines of code how to achieve the wordcount of Hadoop?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

5 lines of code how to achieve the wordcount of Hadoop?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

5 lines of code how to achieve the wordcount of Hadoop?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support