5 lines of code how to achieve the wordcount of Hadoop?

Source: Internet
Author: User

Novice Programmers know the meaning of Hello World, and when you first print Hello worlds from the console, it means that you are already moving into the world of programming, which is a little bit like the first person who eats crabs, although this is not a good metaphor.

If you learn to use Hello World as a means of stepping into a single-machine programming portal, then learning to use wordcount in a distributed environment means that you are stepping into the door of distributed programming. Imagine that your program can run in clusters of hundreds of thousands of machines, is it a monumental thing? Whether in Hadoop or spark, the first example of the initial learning of these two open source frameworks is undoubtedly wordcount, so long as our wordcount can run successfully, we can go deep into it in a bold way.


Pull more, the following hurriedly into the topic, see How to use 5 lines of code to achieve the wordcount of Hadoop, in Hadoop if you use Java to write a wordcount at least dozens of lines of code, if through the Hadoop Streaming's way of using python,php, or C + + to write, almost also have 10 lines of code around. If you are using the spark-based approach to HDFS, you can write wordcount,5 line code in the Scala language, but if you use Spark, Java-based API to write, then it is bloated, there is no dozens of lines of code, is also uncertain.


Today, the streaming is not written in either Spark's Scala or the python of Hadoop, to see how we can use our pig script to get the job done, and test the data as follows:

Java code 650) this.width=650; "class=" star "src=" Http://qindongliang.iteye.com/images/icon_star.png "alt=" Favorites Code "style= "border:0px;"/>

    1. I AM HADOOP  

    2. i am hadoop  

    3. i am lucene   

    4. I AM HBASE  

    5. I AM HIVE  

    6. i am hive sql  

    7. i am  pig  


the full script for Pig is as follows:  

Pig code 650) this.width=650; "class=" star "src=" Http://qindongliang.iteye.com/images/icon_star.png "alt=" Favorite Code "style=" border:0px; "/>

  1. --Big Data exchange Group:376932160(ads not included)

  2. --load the TXT data of the text and put each line as a text

  3. A = Load ' $in ' as (F1:chararray);

  4. --splits each row of data by a specified delimiter (space is used here) and into a flat structure

  5. b = foreach a Generate Flatten (tokenize (F1, "));

  6. --Grouping of words

  7. c = Group B by $0;

  8. --count the occurrences of each word

  9. D = foreach C generate Group, COUNT ($1);

  10. --Store the result data

  11. Stroe d into ' $out '


The processing results are as follows:  

Java code 650) this.width=650; "class=" star "src=" Http://qindongliang.iteye.com/images/icon_star.png "alt=" Favorites Code "style= "border:0px;"/>

  1. (I,7)

  2. (AM,7)

  3. (Pig,1)

  4. (SQL,1)

  5. (Hive,2)

  6. (HBase,1)

  7. (Hadoop,2)

  8. (Lucene,1 )    


Yes, you're right, that's 5 lines of code, to achieve the data reading, segmentation, conversion, grouping, statistics, storage and other functions. Very simple and convenient!  

There's nothing more concise than spark, but it's just a job, and if it's in demand, adding a sort of result, taking TOPN, this time in pig, it's pretty simple, just add 2 lines of code, but in Spark, It may take several lines of code.  

We look at the pig code after the change, add a sort, take the function of topn:  

Pig code 650) this.width=650; "class=" star "src=" Http://qindongliang.iteye.com/images/icon_star.png "alt=" Favorite Code "style=" border:0px; "/>

  1. --load the TXT data of the text and put each line as a text

  2. A = Load ' $in ' as (F1:chararray);

  3. --splits each row of data by a specified delimiter (space is used here) and into a flat structure

  4. b = foreach a Generate Flatten (tokenize (F1, "));

  5. --Grouping of words

  6. c = Group B by $0;

  7. --count the occurrences of each word

  8. D = foreach C generate Group, COUNT ($1);

  9. --Descending by statistic count

  10. E = Order D by $1 desc;

  11. --Take TOP2

  12. f = Limit e 2;

  13. --Store the result data

  14. Stroe f into ' $out '


The output results are as follows:  

Java code 650) this.width=650; "class=" star "src=" Http://qindongliang.iteye.com/images/icon_star.png "alt=" Favorites Code "style= "border:0px;"/>

    1. (I,7)

    2. (AM,7)



If you use Java to write this mapreduce job, the subsequent sorting statistics TOPN, you have to rewrite a job to execute, because the MapReduce is very simple to do, a job only deal with a function, and in pig it will automatically, Help us analyze the syntax tree to build multiple dependent mapreduce jobs, and we don't have to worry about the underlying code implementations, just focus on our business.

In addition, Pig is also a very flexible batch processing framework, through the custom UDF module, we can use pig to do a lot of things, see the last article of the scattered fairy friends, should know that the original Yahoo company not only use pig analysis log, search content, Pangerank rankings, But also using pig to build their web inverted index and other extensions, we can be implemented through the pig UDF, it can be our business and mapreduce specific implementation decoupling, and very strong reusability, we write any tool class, Can be easily run on a large-scale hadoop cluster with pig stability.




Scan code to pay attention to the public: I am a Siege division (WOSHIGCS), if there are any questions, technical issues, career issues, etc., welcome to the public number on the message with me to discuss! Let's do a different siege division! Thank you!

650) this.width=650; "Src=" http://dl2.iteye.com/upload/attachment/0104/9948/ 3214000f-5633-3c17-a3d7-83ebda9aebff.jpg "style=" Border:0px;font-family:helvetica, Tahoma, Arial, Sans-serif; Font-size:14px;line-height:25.200000762939453px;white-space:normal;background-color:rgb (255,255,255); "Alt=" 3214000f-5633-3c17-a3d7-83ebda9aebff.jpg "/> 


reproduced please specify the original address, thank you for your cooperation! http://qindongliang.iteye.com/ 


This article is from the "7936494" blog, please be sure to keep this source http://7946494.blog.51cto.com/7936494/1602633

5 lines of code how to achieve the wordcount of Hadoop?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.