Getting Started with Hadoop learning notes---part1

Last Update:2014-12-28 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

With the graduation design, the university officially came to an end in four years. Let you play four years of the university's last assignment finally in the intense topic of dust settled. Regardless of the choice of the topic, regardless of the final outcome is how, for the university's last job, but also hope that they can do their best to do well. It is because of the topic and Hadoop, now formally began to learn Hadoop. Organize your notes in this, and hope to communicate with like-minded friends.

itred         Email: [email protected]       personal Blog Link:http://www.cnblogs.com/itred

Well, don't say much nonsense. Get to the Chase! Start your Hadoop learning.

Part1 's main task is to understand Hadoop projects, understand the terminology associated with Hadoop, understand the deployment patterns of Hadoop, and prepare a clustered environment for pseudo-distributed patterns .

A Introduction to Hadoop projects:

This is the logo of Hadoop. And the image of Hadoop is this little brown elephant. is actually a child's stuffed toy, if interested in understanding the origin of the name of Hadoop. Can Baidu.

The introduction to Hadoop is just a starting point, and I'm sure no documentation is available to compare to the official Hadoop documentation. Posted on the Hadoop website: http://hadoop.apache.org/, so the best way to read the official documentation is to really understand the technology. Here, I talk about my own understanding. Hadoop is a distributed system infrastructure, but it allows developers to develop distributed programs and applications without having to understand the distributed underlying. And it can make full use of its cluster's ability to perform high-speed operations and storage. In a word, it is a distributed storage and computing platform that is suitable for hitting data.

the core design of Hadoop is HDFs (Hadoop Distributed File System) and MapReduce. the introduction of these terms I will explain in the following article. HDFS provides storage for massive amounts of data, and MapReduce provides calculations for massive amounts of data.

The main modules included in the Hadoop project are:

Hadoop Common: This Common suite is primarily to support other Hadoop modules;

HDFS: A Distributed file system that supports high-throughput application data;

Hadoop Yern: A framework of task scheduling and group resource management;

Hadoop MapReduce: Yern relies on systems that are primarily used to process data.

Two Hadoop Key Terminology understands:

The Hdfs:hadoop Distributed File System (Hadoop distributed filesystem) defaults to the most basic storage unit of 64M data blocks. In HDFs, if a file is smaller than the size of a block of data, it does not occupy the entire chunk of storage space.

MapReduce: is a programming model for parallel operations of large datasets (larger than 1TB). "Map" and "Reduce", as well as their main ideas, are borrowed from functional programming languages, and features borrowed from vector programming languages, too. It is greatly convenient for programmers to run their own programs on distributed systems without distributed parallel programming. The current software implementation is to specify a map function that maps a set of key-value pairs into a new set of key-value pairs, specifying the concurrent reduce (return) function, which is used to guarantee that each of the mapped key-value pairs share the same set of keys.

Three The deployment model for Hadoop is divided into three main types:

Local mode: Hadoop uses a local file system instead of a distributed file system, and it does not start any Hadoop daemons, and the map and reduce tasks are executed as separate parts of the same process. As a result, Hadoop in local mode only runs natively. This mode is only used to develop or debug a mapreduce application but avoids complex subsequent operations.

pseudo distribution mode:Hadoop runs all processes on the same host, but at this point Hadoop uses a distributed file system, and jobs is a separate process managed by the Jobtracker service. At the same time, because the pseudo-distributed Hadoop cluster has only one node, the block replication of HDFs will be limited to a single copy, and its secondary-master and slave will also run on the local host. This pattern, in addition to its not really meaningful distribution, has a program execution logic that is completely similar to a fully distributed one, and is often used by developers to test program execution.

cluster mode:the Hadoop daemon runs on a cluster.

Four To install the Hadoop preparation environment (pseudo-distribution mode): Realize the construction of Hadoop cluster environment in pseudo-distribution mode.

Prepare the Software:

Load a Linux system in the virtual machine;

JDK (version selected according to your actual situation)

When installing the JDK in Linxu, you need to give the JDK permission command: chmod o+x jdk. (according to the actual version);

Then use the./jdk .... (version) for installation;

The path after installation is:/usr/java

Then add Java to the environment variable: #vi/etc/profile

java_home=/usr/java/Jdk....path= $PATH:/usr/java/jdk.. /Binclasspath=.:/ usr/java/jdk.../jre/lib/rt.jarexport java_home PATH CLASSPATH

To turn off the Linux firewall:

// View firewall status #service iptables stop   // Turn off firewall #chkconfig iptables off  // set the auto-start of the firewall to OFF, which disables self-booting

　　Modify IP: network link mode is set to Host-only, according to your actual host IP address to make changes

VI  /etc/sysconfig/network-s/ifcfg-eth0  // modify with VI

　　Modify hostname:

Command: #VI /etc/sysconfig/network    // after machine restart effective            HOSTNAME=hadoop

　　SSH Automatic Login: Note is not java three big frame!

~/. SSH # Ssh-keygen –t RSA   // then always press ENTER, press the default to proceed;#CP  id_rsa.pub  authorized_keys# ssh localhost  // Use this command to verify

When you are finished, enter SSH localhost to complete the verification.

Just focus on the Hadoop project today, and the installation will take place in the Hadoop introductory learning Note---part2.

itred        Email: [email protected]            Blog:http://www.cnblogs.com/itred* * * Copyright: This article copyright belongs to the author and the blog Park, Welcome to reprint, but please mark the article in a conspicuous position of the source. I reserve all rights to be held accountable for his use without my written consent.

Getting Started with Hadoop learning notes---part1

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More