Hadoop is an open source distributed parallel programming framework that realizes the MapReduce computing model, with the help of Hadoop, programmers can easily write distributed parallel program, run it on computer cluster, and complete the computation of massive data. This paper will introduce the basic concepts of MapReduce computing model, distributed parallel computing, and the installation and deployment of Hadoop and its basic operation methods.
Introduction to Hadoop
Hadoop is an open source, distributed, parallel programming framework that runs on large clusters, because distributed storage is essential for distributed programming, and includes a distributed filesystem HDFS (Hadoop Distributed File System). Perhaps so far, Hadoop is not so well known, its latest version is only 0.16, and it seems to be a long distance from 1.0, but two other open-source projects, Nutch and Lucene, which refer to Hadoop in the same vein, are the founders of Doug Cu. tting), that's definitely famous. Lucene is a Java developed open source high-performance full-text Search Toolkit, it is not a complete application, but a set of Easy-to-use API. Around the world, there are countless software systems, Web sites based on Lucene to achieve full text search function, and later Doug cutting opened the first open source Web search engine (http://www.nutch.org) Nutch, which is based on the Lucene Added network crawler and some web-related functions, some to parse various kinds of document format plug-ins, in addition, Nutch also contains a distributed file system for storing data. After Nutch version 0.8.0, Doug cutting the Distributed file system in Nutch and the code to implement the MapReduce algorithm to form a new Open-source Hadoop. Nutch also evolved into an open source search engine based on Lucene Full-text search and the Hadoop distributed computing platform.
Based on Hadoop, you can easily write a distributed, parallel program that handles massive amounts of data and run it on a large cluster of computers made up of hundreds of nodes. From the current situation, Hadoop is destined to have a glorious future: "Cloud computing" is currently a hot moxibustion technology terminology, the world's major IT companies are investing in and promoting this new generation of computing models, and Hadoop is one of the major companies in its "cloud computing" environment important basic software, such as: Yahoo is using the power of Hadoop open source platform to fight Google, in addition to funding the Hadoop development team, is also developing a Hadoop based open source project Pig, this is a focus on mass data set analysis of distributed computing programs. Amazon S3 Amazon simple Storage service based on Hadoop, offering reliable, fast, scalable networked storage services, and a commercially available cloud platform, Amazon EC2 (Amazon El Astic Compute Cloud). Hadoop is also an important underlying software in the "Blue Cloud Project" of IBM's cloud computing program. Google is working with IBM to promote the cloud based on Hadoop.
Welcome the change of the programming way
Under the action of Moore's Law, in the past programmers do not have to consider the performance of the computer will not be able to keep up with the development of software, because about every 18 months, the CPU will increase the frequency, performance will be increased by one times, the software does not have to make any changes, you can enjoy free performance. However, since the transistor circuit has gradually approached its physical performance limit, Moore's law began to fail in about 2005 years, and humans can no longer expect a single CPU to double every 18 months, providing us with more and more fast computing performance. Intel, AMD, IBM and other chip manufacturers start from the multi-core perspective to explore the performance potential of the CPU, the multi-core era and the advent of the Internet era, will make a major change in software programming methods, based on multi-core multithreaded concurrent programming and large-scale computer cluster distributed parallel programming is the main way to improve software performance in the future.
Many people believe that this major change in programming will lead to a software concurrency crisis, because our traditional software approach is basically the order of single order data flow execution, this sequential execution is very consistent with human thinking habits, but with concurrent parallel programming is incompatible. Distributed parallel programming based on clustering enables software and data to run simultaneously on many computers connected to a network, where each computer can be a common PC. The biggest advantage of such a distributed parallel environment is that it is easy to expand the new computing nodes by adding computers, and thus obtains the incredible massive computation ability, but also has the very strong fault tolerant ability, the batch computation node invalidation does not affect the computation normal performance as well as the result correctness. Google does this by using a parallel programming model called MapReduce, which runs distributed and parallel programming, running on a distributed file system called GFS, which provides search services to hundreds of millions of users around the world.
Hadoop implements Google's MapReduce programming model, provides an Easy-to-use programming interface, and provides its own distributed file system HDFS, unlike Google, where Hadoop is open source, and anyone can use it for parallel programming. If the difficulty of distributed parallel programming is enough to intimidate ordinary programmers, the advent of open source Hadoop has dramatically lowered its threshold, and after reading this article, you will find that programming based on Hadoop is so simple that you can easily develop distributed parallel programs without any parallel development experience, and allow It incredibly runs on hundreds of machines at the same time, and then completes massive data calculations in a short time. You might think you can't have hundreds of machines running your parallel program, and in fact, with the popularity of "cloud computing", anyone can easily get such a huge amount of computing power. Amazon's cloud computing platform, Amazon EC2, for example, has already provided this on-demand rental service, and interested readers can find out about the third part of the article.
Mastering the knowledge of distributed parallel programming is essential for future programmers, and Hadoop is so simple and easy to use, why not try it? Maybe you're impatient to try something based on Hadoop, but after all, this programming model is very different from the traditional sequential program, and a little bit of basic knowledge is needed to better understand how the distributed parallel programs based on Hadoop are written and run. So this article will first introduce the MapReduce computing model, the Distributed file system in Hadoop HDFS, how Hadoop implements parallel computing, and then describes how to install and deploy the Hadoop framework and how to run the Hadoop program.