Cloud computing is in great demand at this stage

Last Update:2014-09-03 Source: Internet

Author: User

Keywords Hadoop distributed applications

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

At the Techonomy meeting a few years ago, Google CEO Eric Schmidt said vividly when attending the discussion that the information we create every two days now is about the same as the information we have created throughout 2003. The proliferation of information has led to a series of technological breakthroughs, but it has also extended the organization's data storage to hundreds of billions of bytes and beyond. Google's contribution to this area is particularly prominent, including its work on MapReduce, which is almost a large distributed data processing method, which Google uses to record, keywords or phrases that reside in indexed resources Down and then return to the user records and lists of these locations. Mapping and reduction operations can cover pattern recognition, graphical analysis, forecasting models and risk management.

While Google's MapReduce installation is proprietary, there are many open source installations of the MapReduce concept, including Apache Hadoop. In fact, Hadoop is already the de facto solution for distributed data processing, and dozens of international companies have invested heavily in both implementation and development. Users like Adobe, Amazon, AOL, Baidu, EBay, Facebook, Hulu, IBM, Last.fm, LinkedIn, Ning, Twitter and Yahoo have all become users. Many universities, hospitals and research centers are also users. People's restrictions.

Hadoop project introduction

Like many of the Apache Software Foundation's ("ASF") projects, Hadoop is an umbrella term that distributes all Foundation measures to produce "credible, scalable and distributed computing open source software." The current measures consist of four sub-projects, including:

Hadoop Common: Hadoop Common forms the heart of the Hadoop project, providing the "plumbing" needed through the sibling project that follows. HDFS: Hadoop Distributed File System (HDFS) is a storage system that is responsible for replicating and distributing data throughout a compute cluster. MapReduce: MapReduce is the software architecture that developers use to write applications that handle data stored in HDFS. ZooKeeper: ZooKeeper is responsible for coordinating the network-related services required to configure data, synchronize processes, and all other distributed applications for their effective operation. So, while you do download Hadoop as a single archive, keep in mind that you are actually downloading four subprojects that work together to implement mapping and reduction.

Experiment with Hadoop

Although the issues Hadoop attempts to solve are inherently complex, getting started with this project can be easy. As an example, I think it would be interesting to use Hadoop to do the word frequency analysis in my "Simplify PayPal with PHP." This quest takes a closer look at the entire book (about 130 pages in length) and produces a grouped list of words appearing in all the books, along with the frequency of each of these words.

After installing Hadoop, I used Caliber to convert my book from PDF to text document. The Hadoop wiki also contains similar instructions, but due to the recent changes in the Hadoop configuration process, the previous resources contain slightly more recent instructions.

Next I used the following command to copy this book from a temporary location to the Hadoop Distributed File System:

$ ./bin/hadoop dfs -copyFromLocal / tmp / easypaypalwithphp / easypaypalwithphp

You can confirm the successful copy by using the following command:

$ ./bin/hadoop dfs -ls
drwxr-xr-x - hadoop supergroup 0 2011-01-04 12:48 / user / hadoop / easypaypalwithphp

Next, perform the word frequency analysis using the sample WordCount script packaged with Hadoop:

$ ./bin/hadoop jar hadoop-mapred-examples-0.21.0.jar wordcount \
> easypaypalwithphp easypaypalwithphp-output ...
11/01/04 12:51:38 INFO mapreduce.Job: map 0% reduce 0%
11/01/04 12:51:48 INFO mapreduce.Job: map 100% reduce 0%
11/01/04 12:51:57 INFO mapreduce.Job: map 100% reduce 100%
11/01/04 12:51:59 INFO mapreduce.Job: Job complete: job_201101041237_0002
11/01/04 12:51:59 INFO mapreduce.Job: Counters: 33
FileInputFormatCounters
BYTES_READ = 274440

Finally, you can view the output with the following command:

$ ./bin/hadoop dfs -cat easypaypalwithphp-output / part-r-00000
...
Next 21
Next, 8
No 5
NoAutoBill 1
Norwegian 1
Not 2
Notably, 2
Note 5
Notice 6
Notification 13
...

The sample WordCount frequency analysis script is very basic, assigning the same weight, including code, to each column in the book's text. But modifying a script to parse a DocBook-formatted file and ignore the code can be cumbersome. Regardless, consider the case where you want to create a service like the Google Global Book Journal, which looks at key words in more than 5.2 million books.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More