"Book pick" large data development of the first knowledge of Hadoop

Source: Internet
Author: User
Keywords Cloud computing Big data Hadoop
Tags .mall access addressing analysis apache api application applications
This paper is an excerpt from the book "The Authoritative Guide to Hadoop", published by Tsinghua University Press, which is the author of Tom White, the School of Data Science and engineering, East China Normal University. This book begins with the origins of Hadoop, and integrates theory and practice to introduce Hadoop as an ideal tool for high-performance processing of massive datasets. The book consists of 16 chapters, 3 appendices, covering topics including: Haddoop;mapreduce;hadoop Distributed File System, Hadoop I/O, MapReduce application development, MapReduce working mechanism, MapReduce type and format MapReduce characteristics, how to build Hadoop cluster, how to manage Hadoop;pig;hbase;hive;zookeeper, open Source Tool Sqoop, and finally provide rich case analysis. This book is an authoritative reference for Hadoop, where programmers can explore how to analyze massive datasets, from which an administrator knows how to install and run a Hadoop cluster.

The following covers all the elements of chapter I:

1.1 Data! Data!

1.2 Data storage and analysis

1.3 Advantages over other systems

A brief history of 1.4 Hadoop development

1.5 Apache Hadoop and Hadoop ecosystem

1.6 Hadoop Releases

The 1th chapter first knowledge of Hadoop

In ancient times, people used cows to pull heavy objects. When a cow can't pull a log, people never think about nurturing a stronger cow. Similarly, we should not try to build supercomputers, but should do everything possible to integrate more computers to solve the problem.

--Grace Sch How Cooper (Grace Hopper)

1.1 Data! Data!

We live in this era of data explosion, it is difficult to estimate the total number of data stored in global electronic devices. International Data Corporation (IDC) has published a report that the 2006 Digital World (Digital Universe) project statistics of the total amount of global data for the 0.18 ZB and predicted in 2011 will reach 1.8 ZB. 1 ZB equals 1021 bytes, equal to 1000 EB (exabytes), 1 000000 PB (petabytes), equals more familiar 1 billion TB (terrabytes)! This is equivalent to the total amount of data saved on one hard drive per person in the world!

There are many sources of data "torrent". Take the example listed below:

The New York Stock Exchange generates daily trading data as many as 1 TB Facebook stores photos of about 10 billion, with storage capacity of about 1 PB genealogy sites ancestry.com stored data about 2.5 PB Internet Archive (the Internet Archive The data stored is about 2 PB and continues to grow at least TB per month. The Large Hadron Collider near Geneva, Switzerland, produces annual data about PB

There are plenty of other data. But how do you think it might affect you? As everyone knows, most of the data is tightly locked up in large internet companies, such as search engine companies or scientific and financial institutions. Does the so-called "big data" affect only small organizations and individuals?

I personally think so. Take photos For example, my wife's grandfather is a bone-grade photography enthusiast. He has been taking pictures since he was an adult. His entire album, including ordinary film, slides, 35mm film, after scanning into high-resolution pictures, about ten GB. By contrast, in 2008, my home digital camera photographed a total of 5GB. Compared to Grandpa's photo generation speed, my home is his old man 35 times times! And the speed is growing, because it's getting easier to take pictures now.

One situation is more prevalent, and data generated by individuals is growing rapidly. The Mylifebits Project (http://research.microsoft.com/enus/projects/mylifebits/default.aspx) of Microsoft news shows that in the near future, personal information files will become increasingly popular. An experiment in mylifebits is to obtain and save personal contact information (including telephones, emails and documents) for future access. The data collected includes photographs taken every minute, and the amount of data is about 1GB per month. As storage costs plummet so that audio and video can be stored, the Mylifebits project will store many times more data in the future.

Preserving all the data generated during personal growth seems to be becoming mainstream, but more importantly, computers may produce more data than we personally generate. Machine logs, RFID detectors, sensor networks, car GPS and retail trading data-all of which will produce huge amounts of data.

The data released on the Internet is also increasing every year. Organization or enterprise, in order to succeed in the future, not only need to manage their own data, but also need to obtain valuable information from other organizations or enterprise data.

The vanguard of this is Amazon Web Services (http://aws.amazon.com/publicdatasets), infochimps.org (http://infochimps.org/) and theinfo.org (http://theinfo.org), the shared datasets they publish, are facilitating information sharing (information Commons) for everyone to download and analyze freely (or just pay a reasonable price to share through the AWS platform). Information from different sources, after mixing and processing, can bring unexpected effects and applications that we cannot imagine today.

Take Astrometry.net (http://astrometry.net) as an example, primarily to view and analyze the star photos taken by the Star Robotics team on the Flickr site. It analyzes each photograph and identifies which part of it comes from the stars or other celestial bodies (such as stars and galaxies). Although the study is still in the experimental stage, it also shows that if enough data is available (in this case, tagged picture data), subsequent applications through them may be beyond the original image of those who took the photo (picture analysis).

There is a good saying: "Big data is better than good algorithms." "It means that for some applications (such as recommending movies and music based on past preferences), no matter how much the algorithm is, the recommendations based on small data are often less recommended than those based on a large number of available data."

Now that we have a lot of data, that's good news. Unfortunately, we have to find ways to store and analyze the data well.

1.2 Data storage and analysis

The problem we encounter is simple: the speed of access (hard drive data reads) has not been progressing with the times, while the hard disk storage capacity has increased over the years. A common hard drive can store 1370MB data in 1990, with a speed of 4.4 MB/s, so it takes only 5 minutes to read the entire hard drive. 20 years on, 1TB hard drive has become mainstream, but its data transmission speed is about 100MB/S, read the entire hard drive data at least spend 2.5 hours.

It takes a long time to read the data on the entire hard drive and write the data. A simple way to reduce reading time is to read data from multiple hard disks at the same time. Just imagine, if we have 100 hard drives, each hard disk stores 1% of the data, parallel reads, then you can read all the data in less than two minutes.

Using only 1% of the hard disk's capacity seems wasteful. But we can store 100 datasets, 1TB per dataset, and read the shared hard disk. As you can imagine, users will be happy to shorten data analysis time by hard disk sharing, and from a statistical point of view, the user's analysis work is done at different points in time, so the interference between each other is not too big.

However, there are more problems to be addressed in parallel reading and writing to data on multiple hard disks. The first thing to solve is a hardware failure problem. Once you start using multiple hardware, the individual hardware is likely to fail. In order to avoid data loss, the most common practice is replication (replication): The system holds a copy of the data (replica) and, if a system fails, a separate saved copy can be used. For example, redundant hard disk arrays (RAID) are implemented by this principle, and the file system of Hadoop (Hdfs,hadoop distributed filesystem) is a class, but it takes a slightly different approach, as described in the following article.

The second problem is that most analysis tasks need to combine most of the data in some way to perform the analysis together, that is, data read from one hard disk may need to be combined with data read from another 99 hard disks. Various distributed systems allow the analysis of data from different sources, but ensuring its correctness is a great challenge. MapReduce presents a programming model that abstracts the read and write problems of these hard disks and converts them to calculations of a dataset (composed of key-value pairs). This model is discussed in detail in this article, which consists of map and reduce, and only the two parts provide an external interface. Similar to HDFs, MapReduce has high reliability.

In short, Hadoop provides us with a reliable shared storage and analysis system. HDFs realizes the data storage, MapReduce realizes the data analysis and processing. While Hadoop has other features, HDFs and MapReduce are its core values.

1.3 Advantages over other systems

MapReduce seems to employ a brute force method. Each query needs to handle the vast majority of the entire dataset or at least one data set. But conversely, that's what it's capable of. MapReduce is a batch query processor that can handle dynamic queries for the entire dataset within a reasonable time range. It changed our traditional view of the data, freeing up data that had previously only been on tape and hard drives. It gives us the opportunity to innovate on the data. Problems that used to take a long time to get results have now become readily available and can lead to new problems and new insights.

For example, the mail Department of Rackspace company Mailtrust uses Hadoop to process mail logs. They write dynamic queries to find out the geographic distribution of users. They say: "This data is very useful and we run the MapReduce task once a month to help us decide which rackspace data centers need to add new mail servers." ”

By consolidating hundreds of gigabytes of data and using MapReduce to analyze the data, Rackspace's engineers discovered data that had never been noticed before, and even used that information to improve existing services. The 16th chapter will describe in detail how Hadoop is used within the Rackspace company.

1.3.1 Relational database management system

Why can't we use a database to analyze mass data on a large number of hard drives? Why do we need MapReduce?

The answer to these two questions comes from another development in the computer's hard disk: Addressing time is much more than the increase in transmission rates. Addressing is the process of moving the head to a specific hard disk location for read-write operations. It is the main reason for the delay in hard disk operation, and the transfer rate depends on the bandwidth of the hard disk.

If the data access pattern contains a large number of hard disk addressing, reading a large set of data will necessarily take a longer time (compared to the stream data read mode, stream reading depends mainly on the transmission rate). On the other hand, if the database system updates only a small number of records, then the traditional B-tree is more advantageous (a data structure used in relational databases is limited by the proportion of addressing). However, if a database system has a large number of data updates, the efficiency of the B-tree is significantly behind mapreduce, because you need to use the sort/merge (Sort/merge) to rebuild the database.

In many cases, MapReduce can be viewed as a complement to a relational database management system. The differences between the two systems are shown in table 1-1.

MapReduce is more suitable for batch processing to analyze the whole dataset, especially the dynamic analysis. The RDBMS is suitable for point query and update, and after the dataset is indexed, the database system can provide low latency data retrieval and fast little data updates. MapReduce is suitable for once write, read data multiple applications, relational database is more suitable for continuously updated dataset.

Table 1-1. Comparison of relational database and MapReduce

Traditional relational database MapReduce data size GB PB data Access interactive and batch batch update multiple read/write once write, multiple read structure static mode dynamic mode integrity high and low horizontal extend non-linear linear Another difference between mapreduce and relational databases is the degree to which the datasets they manipulate are structured. Structured data (structured) is a materialized data with an established format, such as an XML document or a database table that satisfies a specific predefined format. This is what the RDBMS includes. semi-structured data, on the other hand, is loosely semi-structured and, although it may be formatted, is often ignored, so it can only be used as a general guideline for data structures. For example, a spreadsheet, which is structurally a grid of cells, but can hold any form of data within each cell. Unstructured data (unstructured) has no particular internal structure, such as plain text or image data. MapReduce is very effective for unstructured or semi-structured data because it interprets data when it is processed. In other words, the keys and values entered by MapReduce are not intrinsic to the data, but are chosen by the person who analyzed the data.

Relational data is often canonical (normalized) to preserve the integrity of its data without redundancy. The specification poses a problem for mapreduce because it makes record reading a non-local operation, and one of the core assumptions of MapReduce is that it allows for (high speed) stream reads and writes.

The Web server log is a typical, non-standard data record (for example, each time you need to record the full name of the client host, which can cause the full name of the same client to appear multiple times), which is one of the reasons why MapReduce is very useful for parsing various log files.

MapReduce is a linear, scalable programming model. Programmers write two functions, the map function and the reduce function, each defining the mapping from one key value pair to another. These functions do not have to be concerned with the size of the dataset and the clusters it uses, and can be applied intact to small datasets or large datasets. More importantly, if you enter the amount of data is twice times the original, then run time will need twice times. But if the cluster is twice times the original, the operating speed is still as fast as the original. SQL queries typically do not have this feature.

However, in the near future, the differences between relational database systems and MapReduce systems are likely to become blurred. Relational databases are beginning to absorb some of MapReduce's ideas (such as the Aster data database and the Greenplum database), on the other hand, MapReduce based advanced query languages (such as pig and Hive) Make it easier for programmers of traditional databases to accept mapreduce systems.

1.3.2 Grid Computing

High-performance Computing (high configured COMPUTING,HPC) and Grid computing (grid Computing) organizations have been studying large-scale data processing for many years, primarily using a messaging interface like the message passing INTERFACE,MPI) API. In a broad sense, high-performance computing takes the form of spreading jobs across clusters of machines that access the shared file system of a storage Area network (SAN). This is more useful for compute-intensive jobs, but if the node needs to access more data (up to hundreds of gb,mapreduce to start its magic), many compute nodes will have to sit down and wait for the network bandwidth bottleneck.

Mapreduc as much as possible to store data on compute nodes to achieve local fast access to data. Data localization (locality) features are the core features of MapReduce, and thus obtain good performance. After realizing that network bandwidth is the most valuable resource in the data Center environment (where replicating data can easily deplete network bandwidth), MapReduce retains network bandwidth through an explicit network topology. Note that this arrangement does not reduce the ability of mapreduce to analyze compute-intensive data.

Although MPI gives programmers a lot of control, programmers need to explicitly control the data flow mechanism, including constructing low-level functional modules (such as sockets) and high-level data analysis algorithms in C language. The MapReduce, on the other hand, performs the task at a higher level, that is, the programmer considers the execution of the task only from the point of view of the key value to the function, and the data stream is implied.

In a large-scale distributed computing environment, the coordination of the implementation of various processes is a big challenge. The hardest part is dealing with a partial failure of the system--without knowing if a remote process is dead--and continuing to complete the calculation. With MapReduce, programmers do not have to worry about partially failing systems because their own system implementations can detect and rerun those failed map or reduce tasks. Because of the shared-nothing framework, MapReduce is able to implement failure detection, which means that each task is independent of each other. Therefore, from a programmer's point of view, the order in which tasks are executed does not matter. In contrast, MPI programs must explicitly manage their own checkpoints and recovery mechanisms, although the programmer's control is increased, but programming is more difficult.

MapReduce sounds like a fairly rigorous programming model, and in a sense it does: qualifying users to use a key-value pair with a specific association, mapper and Reducer are very limited in their coordination (each mapper passes a key-value pair to reducer). Thus, we naturally associate ourselves with the question: Can you do something useful or practical with this programming model?

The answer is yes. MapReduce was developed by Google's engineers to build the index of search engines, and it has proved that it can solve the problem over and over again (MapReduce is inspired by traditional functional programming, distributed computing, and database communities), but since then, The model has many other applications in other industries. We are delighted to find that many algorithms can be expressed by mapreduce, from image graph analysis to a variety of image analysis based problems, to machine learning algorithms. Of course, it's not a panacea, it doesn't solve all the problems, but it's really a universal data-processing tool.

We will introduce some typical applications of Hadoop in the 16th chapter.

1.3.3 Volunteer Computing

When people first heard of Hadoop and MapReduce, they often asked the question: "How are they different from SETI@home?" "SETI is all called Search for extra-terrestrial FDI (searching for extraterrestrial Intelligence), the project name is SETI@home (http://setiathome.berkeley.edu). In the project, volunteers are using their computer's CPU spare time to analyze wireless telescope data to find extraterrestrial intelligence life signals. SETI@home is famous for having a large volunteer team, as well as the "search for Big primes" (Great Internet Mersenne Prime Search) Project and Folding@home project (understanding protein composition and its relationship to disease).

The volunteer computing project divides the problem into chunks, each of which is called a unit of work (work) and is sent to computers around the world for analysis. For example, SETI@home's work unit is 0.35MB radio telescope data, to analyze the size of the data, an ordinary computer will take several hours or days to complete. When the analysis is completed, the result is sent back to the server, and the client then obtains another unit of work. To prevent spoofing, each unit of work is sent to 3 different machines to execute, and at least two of the results received are accepted.

On the surface, SETI@home and MapReduce seem to be about the same (breaking the problem into separate chunks and then computing in parallel), but there are still a lot of obvious differences. The SETI@home problem is CPU-intensive and more suitable to run on tens of thousands of computers around the world, because the computation takes far more time than the work unit data. In other words, the volunteers contribute CPU cycles, not network bandwidth.

MapReduce has three major design goals: (1) Provide services for jobs that can be completed in just a few minutes or hours; (2) run in a data center with a high-speed network connection; (3) The computers in the data center are reliable, customized hardware. SETI@home, by contrast, runs long hours on untrusted computers connected to the Internet, which have different network bandwidth and no requirement for data localization.

A brief history of 1.4 Hadoop development

Hadoop was created by the Apache Lucene founder Doug Cutting, and Lucene is a widely used text search system library. Hadoop originated from the open source web search engine, Apache Nutch, which itself is part of the Lucene project.

Hadoop's name, Hadoop, is not abbreviated, it's a sporogenous word. The father of Hadoop, Doug Cutting, explains the origins of Hadoop: "The name is from a plush toy my kid gave him." My naming standard is good spelling, broad, and not used elsewhere. Children are masters of the field. Googol is the name of a child. "The names used by the subprojects and subsequent modules of Hadoop are often not related to their functions, and are usually named after elephants or other animals (for example, pig)." Smaller components, the names are usually better descriptive (and therefore more low-brow). This is a good principle, meaning that we can look at the meaning of the text, for example jobtracker[in this book we use lowercase jobtracker to represent the entity (PAN), and the hump body Jobtracker to represent the implementation of the Java class. To see it is used to track mapreduce operations.

Building a web search engine from scratch is an ambitious plan, not just because it is complicated to write a crawler, but because it has to be done by a dedicated team--a project that contains many active parts that need to be modified at any time. At the same time, the cost of building such a system is very high--according to Mike Cafarella and Doug Cutting estimates, a 1 billion-page indexing system, hardware input alone as high as $500,000, and a monthly up to 30,000 dollars in operational costs. [Mike Cafarella and Doug Cutting published an article on the ACM Queue in April 2004, "Building Nutch:open Source Search", on the website http://queue.acm.org/ detail.cfm?id=988408. However, they think the job is still worth investing in because it creates a platform for optimizing search engine algorithms.

2004, Google published a paper to introduce their mapreduce system to the world. At the beginning of 2005, Nutch's developers implemented a mapreduce system on Nutch, and by the middle of the year, all of Nutch's major algorithms were ported and run with MapReduce and NDFs.

Nutch's NDFs and MapReduce implementations are not only applicable to the search field. In February 2006, developers moved NDFs and mapreduce out of Nutch to form a subproject of Lucene, named Hadoop. At about the same time, Doug cutting joined Yahoo, where Yahoo organized a dedicated team and resources to develop hadoop into a system capable of processing Web data (see the Supplemental "Hadoop at Yahoo" later). In February 2008, Yahoo announced that the index used by Yahoo's search engine was built on a Hadoop cluster with 10,000 cores. [See the February 19, 2008 article, "Yahoo launches the world's largest application of Hadoop products" (Yahoo! Lauches Global largest Hadoop productionapplications), the Web site is http://developer. yahoo.com/blogs/hadoop/posts/2008/02/yahoo-worlds-largest-production-hadoop/. ]

January 2008, Hadoop has become the top project of Apache, proving its success, diversity and vitality. So far, in addition to Yahoo, there are a lot of companies using Hadoop, such as Last.fm, Facebook and the New York Times. The 16th chapter and the Hadoop wiki page (English) Introduce some cases (http://wiki.apache.org/hadoop/PoweredBy).

The New York Times case was widely circulated, and they took the archive scan from 1851 to 1980 to get 4 TB files and use Amazon's EC2 cloud service to store files in PDF format for online sharing. [See Derek Gottfrid's article, published on November 1, 2007, "Self-Service, prorated Super Computing fun!" (Self-service proportional distribution supercomputing fun!) ), the Web site is http://open.blogs.nytimes. com/2007/11/01/self-service-prorated-super-computing-fun/. The entire process uses 100 computers and takes less than 24 hours. Without Amazon's pay-per-view model, which allows the New York Times to access a large number of machines in the short term, and Hadoop's handy concurrent programming model, the project is unlikely to start and complete so soon.

April 2008, Hadoop broke the world record and became the fastest terabytes data sorting system. In a 910-node cluster, Hadoop finished sorting 1TB data within 209 seconds (less than 3.5 minutes), defeating the 297-second champion of the previous year (see Section 15.5 for additional material "TB-level data processing in Apache Hadoop"). In November of the same year, Google said in its report that it MapReduce 1 TB of data in just 68 seconds. [See the full text of the November 21, 2008 article "Sorting 1PB with MapReduce" (MapReduce processing 1 PB data), the website is http://googleblog.blogspot.com/2008/11/ Sorting-1pb-with-mapreduce.html. When the 1th edition of the book was published in May 2009, it was reported that one of Yahoo's teams had spent 62 seconds using Hadoop to sort 1 terabytes of data.

Since then, Hadoop has jumped into a mainstream deployment system for the enterprise. In industry, Hadoop is already recognized as a large data storage and analysis platform, the fact that a large number of direct use or indirect auxiliary Hadoop system products springing up. Some large companies have also released Hadoop releases, including Emc,ibm,microsft and Oracle, as well as some companies specialising in Hadoop, such as the cloudera,hortonworks[Editor Note: The company was founded by several of Yahoo's core developers, Mainly provides Hadoop support and consulting services, they have established a strategic partnership with Microsoft in 2011 to help Microsoft transfer Hadoop to Windows Server and Azure. ] and MAPR.

Hadoop in Yahoo





Author: Owen O ' Melly





to build Internet-scale search engines can not be separated from a large number of data, so also rely on a large number of machines to deal with huge amounts of data. Yahoo Search engine (Yahoo! Search) has 4 main components: Crawler, crawl Web pages from Web server, WebMap, build a link map of a known web page, Indexer, build a reverse index for the best page, Runtime, and process the user's query. The link graph generated by WebMap is very large, including about 1 trillion (1012) edges (each side represents a page link) and 100 billion (1011) nodes (each node represents a different URL). Creating and parsing such large graphs requires a large number of computers to run for a long time. By the beginning of 2005, WebMap's underlying architecture Dreadnaught need to be redesigned to extend to more nodes later.





Dreadnaught successfully expanded from 20 nodes to 600, but needs to be fully redesigned to be expanded. Dreadnaught and MapReduce are similar in many ways, but they are more flexible and loosely structured. To be specific, each fragment of a dreadnaught job (fragment, also known as "chunking") can be transported to the next stage of the various fragments to continue execution, sorting is done through the library function. But the reality is that most webmap stages are 221 pairs, corresponding to MapReduce. Therefore, WebMap applications do not need to do a lot of refactoring operations to adapt to MapReduce.





Eric Baldeschwieler (ERIC14) formed a small team, so we started designing and using C + + on GFS and MapReduce to build a prototype of a new framework and intend to replace dreadnaught with it. While it is imperative that we need a new WEBMAP framework, it is even more clear that the criteria for building a Yahoo search engine batch platform are more important to us. Making the platform more generic to support other users is a better way to achieve a balanced investment in the new platform.





at the same time, we are also concerned about the progress of Hadoop (also part of the Nutch). January 2006, Yahoo hired Doug Cutting. One months later, we decided to abandon the prototype and use Hadoop instead. The advantage of Hadoop, compared to our prototypes and designs, is that it has actually been applied on 20 nodes (Nutch). In this way, we can build a research cluster within two months and be able to help our customers use the new framework at a very fast pace. Another notable advantage is that Hadoop is open source, easier (though not easily imagined!). From Yahoo's legal department to obtain permission to further study the open source system. As a result, we set up a 200-node research cluster in early 2006 and shelved the WebMap program, in turn providing Hadoop support and optimization services for research users.





Hadoop Memorabilia





2004 Doug Cutting and Mike Cafarella implemented HDFs and MapReduce's first edition December 2005 Nutch transplant to the new framework, Hadoop runs stably on 20 nodes January 2006 Doug Cutting joins Yahoo February 2006 Apache Hadoop Project officially launched, supporting MapReduce and HDFs Independent development February 2006 Yahoo Grid computing team using Hadoop April 2006 on 188 nodes (10GB per node) run the sort test set takes 47.9 hours May 2006 Yahoo set up a 300-node Hadoop research cluster May 2006 running a sort test set on 500 nodes 42 hours (hardware configuration better than April) November 2006 research cluster increased to 600 nodes December 2006 The sort test set runs 1.8 hours on 20 nodes, 100 nodes run 3.3 hours, 500 nodes run 5.2 hours, 900 nodes run 7.8 hours January 2007 research cluster increased to 900 nodes April 2007 Research cluster increased to two clusters of 1000 nodes in April 2008 running the 1TB sorting test set on 900 nodes is only 209 seconds, becoming the world's fastest October 2008 research cluster load 10TB data per day March 2009 17 clusters 24,000 Nodes April 2009 Wins in sort per minute, 59 seconds to sort GB (on 1400 nodes) and 173 minutes to sort 100TB data (on 3,400 nodes


1.5 Apache Hadoop and Hadoop ecosystem

Although Hadoop is famous for its mapreduce and its Distributed File System (HDFS, renamed by NDFs), the name Hadoop is also used to refer to a set of related projects that use the underlying platform for distributed computing and mass data processing.

Most of the core projects mentioned in this book are supported by the Apache Software Foundation (http://hadoop.apache.org/), which provides support to the open source project community, including the original HTTP server project. With the development of the Hadoop ecosystem, there are a growing number of new projects, including non-Apache executives, which are good complements to Hadoop or provide some higher-level abstractions.

Here's a quick reference to the Hadoop project mentioned in this book:

Common: A series of components and interfaces for distributed file systems and common I/O (serialization, Java RPC, and persistent data structures) Avro: A serialization system to support efficient, cross-language RPC and persistent data storage MapReduce: Distributed data processing model and execution environment, run in large-scale commercial machine cluster HDFS: Distributed File system, run in large commercial machine cluster Pig: Data Flow Language and operating environment, to explore a very large dataset. Pig runs on mapreduce and HDFs clusters Hive: A distributed, storage-by-column data Warehouse. Hive manages the data stored in HDFs and provides an sql-based query language (translated by the Run-time engine into a mapreduce job) to query data HBase: A distributed, Database-by-column repository. HBase uses HDFs as the underlying storage, while supporting MapReduce batch computing and point query (Random Read) Zookeeper: A distributed, high-availability coordination Service. Zookeeper provides basic services such as distributed locks for building distributed Applications Sqoop: The tool is used to efficiently bulk transfer data between structured data storage (such as relational databases) and HDFs Oozie: The service is used to run and dispatch Hadoop jobs such as MapReduce, Pig,hive and Sqoop Operations)

1.6 Hadoop Releases

Which version of Hadoop should I use? Of course, the answer to this question always changes over time and depends on the characteristics you need. The summary features of the current version of the Hadoop release series are summarized here.

There is a series of active releases. The 1.x release series is a continuation of the 0.20 release series and contains the current most stable version of the Hadoop release. This series contains secure Kerberos authentication, which avoids access to Hadoop data by unauthorized users (see the security-related content described in chapter 9th). Almost all of the clusters are running these distributions or extended versions (for example, commercial versions).

0.22 and 2.x release series [in the publication of this book, the Hadoop community voted to rename the 0.23 release series to the 2.x release series. The abbreviated "1.x version" used in this book refers to the 0.22 and 2.x (previous 0.23) Release series. It's not very stable at the moment (early 2012), but the release series has changed since you read the book because these versions are being tested more and more for real applications (see the Apache Hadoop release page for the latest status). The 2.x contains the following major new features.

has built a new operating environment, called Resource 2, on the new yarn system (verb Another negotiator mapreduce) system. Yarn is a common resource manager for running distributed applications. MapReduce 2 replaces the "classic" runtime environment in the previous release. Please refer to section 6.1.2 for specific details. HDFS federal administration, which disperses HDFs namespaces into multiple namenode to support clusters that contain large data files. For more information, see Section 3.2.3. HDFs high availability, enable standby Namenode for system crashes to avoid Namenode single point of failure. See section 3.2.4 for details.

Table 1-2 contains only some of the features of HDFs and MapReduce. Some of the other projects in the Hadoop ecosystem are evolving, and it is challenging to select a subset of these components to use together. Luckily, now we don't have to do these configurations ourselves. The Apache Bigtop Project (http://incubator.apache.org/bigtop/) internally tests the software stack of the Hadoop component and provides Linux installation packages (RPM and Debian installation packages). At the same time, some manufacturers provide a compatible suite of Hadoop version.

Table 1-2. Features supported by the Hadoop release series

Feature 1.x 0.22 2.x security certification is whether the old configuration name is deprecated and discard the new configuration name whether it is > the old MapReduce API is the new MapReduce API is (add partially missing class library) is the MapReduce 1 operating environment Code) is whether the MapReduce 2 operating Environment (YARN) is the HDFs federal administration is not whether it is hdfs high availability is 1.6.1 the content contained in this book


This book contains all the distributions in table 1-2. Here we describe the features included in some special distributions.

The code contained in this book can be run on any release, except that a few of the few examples we specify cannot run on any release. The sample code provided by the companion website of this book has been tested several times on all distributions included in the table.

1. Configuration name

In order to have a more canonical naming structure, the configuration attribute naming in the release version after 1.x is different from the previous version. For example, the HDFs attribute associated with Namenode adds a prefix dfs.namenode, so Dfs.name.dir has been modified to Dfs.namenode.name.dir. Similarly, the MapReduce property increases the MapReduce prefix rather than the original mapred prefix, so mapred.job.name has been modified to mapreduce.job.name.

For properties already contained in the 1.x version, this book still uses the original (deprecated) naming method, as these names can still be used in the Hadoop releases listed in the table. If you use the version after 1.x, you may want to use the new property name in the configuration file to avoid warnings when you use the discarded name. The Hadoop Web site (http://hadoop.apache.org/common/docs/r0.23.0/hadoop-project-dist/hadoop-common/DeprecatedProperties.html) lists deprecated names and their alternate names.

2. MapReduce API

The Supplemental contents of section 2.4.1, "Old and new Java MapReduce APIs", provide two sets of Java Mapreduceapi. This book uses the new APIs to provide sample code, but there are still some examples of the new APIs that are not available in the 1.x version, which can be used in all distributions in this book. The sample code in this book that uses the old API version (in the OLDAPI package) can be downloaded from this book's Companion web site.

There are essential differences between these two types of APIs, which we will describe in detail.

1.6.2 compatibility

When you upgrade the Hadoop version to another version, you need to think carefully about the steps you need to upgrade. There are several things to consider: API compatibility, data compatibility, and connection compatibility.

API compatibility focuses on the contrast between user code and the published Hadoop API, such as the Java MapReduce API. The main release version (for example, from 1.x.y to 2.0.0) is allowed to break API compatibility, so the user's program is modified and recompiled. Secondary focus releases (for example, from to 1.0.x to 1.1.0) and single point distributions (for example, from 1.0.1 to 1.0.2) should not compromise compatibility.

Hadoop uses a taxonomy pattern for API functions to characterize its stability. According to previous naming conventions, API compatibility includes marking as interfacestability.stable. The publicly released Hadoop API contains part of the function, labeled Interfacestability. Evolving or interfacestability. Unstable (the above annotations are included in the Org.apache.hadoop.classification package), which means that they are allowed to compromise compatibility in both the secondary and single point distributions.

Data compatibility is primarily concerned with the format of persistent data and metadata, such as the format used to store persisted data in HDFs Namenode. These formats allow modifications between the primary and secondary versions, but such modifications are transparent to the user because the data is automatically migrated when the system is upgraded. There are some limitations to the system upgrade path, which are included in the release notes. For example, during a system upgrade, you might need to upgrade from one intermediate release to the next, instead of one step directly to the latest version. This is discussed in detail in the 10.3.3 section, "Upgrade".

Connection compatibility is primarily concerned with the interoperability of clients and servers through connection agreements such as RPC and HTTP. There are two types of clients: External clients (run by the user) and internal clients (running the cluster as part of the entire system, such as the Datanode and tasktracker background processes). In short, internal clients need to be upgraded in the lock state, and older versions of Tasktracker cannot work with the new version of Jobtracker. Upgrades may be supported in the future, in which case the cluster daemon will need to be upgraded in stages so that the cluster is still available to external clients during the upgrade process.

For external clients that the user is running (for example, a program from HDFs to read or write files or MapReduce to submit jobs) the client and the server must have the same major version number, but a lower minor version number and a single point distribution are allowed (for example, Client version 1.0.1 can work with server 1.0.2 or 1.1.0, but not with server 2.0.0 versions. All exceptions are described in detail in the release notes.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.