Cloud computing is in great demand at this stage

Source: Internet
Author: User
Keywords Hadoop distributed applications

At the Techonomy meeting a few years ago, Google CEO Eric Schmidt said vividly when attending the discussion that the information we create every two days now is about the same as the information we have created throughout 2003. The proliferation of information has led to a series of technological breakthroughs, but it has also extended the organization's data storage to hundreds of billions of bytes and beyond. Google's contribution to this area is particularly prominent, including its work on MapReduce, which is almost a large distributed data processing method, which Google uses to record, keywords or phrases that reside in indexed resources Down and then return to the user records and lists of these locations. Mapping and reduction operations can cover pattern recognition, graphical analysis, forecasting models and risk management.

While Google's MapReduce installation is proprietary, there are many open source installations of the MapReduce concept, including Apache Hadoop. In fact, Hadoop is already the de facto solution for distributed data processing, and dozens of international companies have invested heavily in both implementation and development. Users like Adobe, Amazon, AOL, Baidu, EBay, Facebook, Hulu, IBM, Last.fm, LinkedIn, Ning, Twitter and Yahoo have all become users. Many universities, hospitals and research centers are also users. People's restrictions.

Hadoop project introduction

Like many of the Apache Software Foundation's ("ASF") projects, Hadoop is an umbrella term that distributes all Foundation measures to produce "credible, scalable and distributed computing open source software." The current measures consist of four sub-projects, including:

Hadoop Common: Hadoop Common forms the heart of the Hadoop project, providing the "plumbing" needed through the sibling project that follows. HDFS: Hadoop Distributed File System (HDFS) is a storage system that is responsible for replicating and distributing data throughout a compute cluster. MapReduce: MapReduce is the software architecture that developers use to write applications that handle data stored in HDFS. ZooKeeper: ZooKeeper is responsible for coordinating the network-related services required to configure data, synchronize processes, and all other distributed applications for their effective operation. So, while you do download Hadoop as a single archive, keep in mind that you are actually downloading four subprojects that work together to implement mapping and reduction.

Experiment with Hadoop

Although the issues Hadoop attempts to solve are inherently complex, getting started with this project can be easy. As an example, I think it would be interesting to use Hadoop to do the word frequency analysis in my "Simplify PayPal with PHP." This quest takes a closer look at the entire book (about 130 pages in length) and produces a grouped list of words appearing in all the books, along with the frequency of each of these words.

After installing Hadoop, I used Caliber to convert my book from PDF to text document. The Hadoop wiki also contains similar instructions, but due to the recent changes in the Hadoop configuration process, the previous resources contain slightly more recent instructions.

Next I used the following command to copy this book from a temporary location to the Hadoop Distributed File System:

$ ./bin/hadoop dfs -copyFromLocal / tmp / easypaypalwithphp / easypaypalwithphp

You can confirm the successful copy by using the following command:

$ ./bin/hadoop dfs -ls
drwxr-xr-x - hadoop supergroup 0 2011-01-04 12:48 / user / hadoop / easypaypalwithphp

Next, perform the word frequency analysis using the sample WordCount script packaged with Hadoop:

$ ./bin/hadoop jar hadoop-mapred-examples-0.21.0.jar wordcount \
> easypaypalwithphp easypaypalwithphp-output ...
11/01/04 12:51:38 INFO mapreduce.Job: map 0% reduce 0%
11/01/04 12:51:48 INFO mapreduce.Job: map 100% reduce 0%
11/01/04 12:51:57 INFO mapreduce.Job: map 100% reduce 100%
11/01/04 12:51:59 INFO mapreduce.Job: Job complete: job_201101041237_0002
11/01/04 12:51:59 INFO mapreduce.Job: Counters: 33
FileInputFormatCounters
BYTES_READ = 274440

Finally, you can view the output with the following command:

$ ./bin/hadoop dfs -cat easypaypalwithphp-output / part-r-00000
...
Next 21
Next, 8
No 5
NoAutoBill 1
Norwegian 1
Not 2
Notably, 2
Note 5
Notice 6
Notification 13
...

The sample WordCount frequency analysis script is very basic, assigning the same weight, including code, to each column in the book's text. But modifying a script to parse a DocBook-formatted file and ignore the code can be cumbersome. Regardless, consider the case where you want to create a service like the Google Global Book Journal, which looks at key words in more than 5.2 million books.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.