Hadoop getting started

Last Update:2016-05-11 Source: Internet

Author: User

Tags hdfs dfs hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop getting started

Hadoop is a big data application platform that provides support for Big Data Storage (HDFS) and big data operation (Mapreduce). This article first introduces Hadoop-related knowledge, next, we introduced how to install and configure Hadoop in mac, and finally used streaming to write mapreduce tasks in python.

Motivation

As a big data platform representative, Hadoop is worth learning for every big data developer. What I want to do after I get started is a big data platform-related project, therefore, we need to learn about Hadoop in advance, including the use of hive and MapReduce.

Target

Hadoop finds data by itself, builds the environment, and writes a wordcount using streaming and python.

Hadoop Introduction

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. all the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus shocould be automatically handled in software by the framework.

The term "Hadoop" has come to refer not just to the base modules above, but also to the "ecosystem ", or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark, and others.

HDFS (Hadoop Distributed File System)

The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. A Hadoop cluster has nominally a single namenode plus a cluster of datanodes, although redundancy options are available for the namenode due to its criticality. each datanode serves up blocks of data over the network using a block protocol specific to HDFS. the file system uses TCP/IP sockets for communication. clients use remote procedure call (RPC) to communicate between each other.

MapReduce

Above the file systems comes the MapReduce Engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. the JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible.

The process is as follows:

Map(k1,v1) → list(k2,v2)Reduce(k2, list (v2)) → list(v3)

Hive

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. while initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix. amazon maintains a software fork of Apache Hive that is supported in Amazon Elastic MapReduce on Amazon Web Services.

It provides an SQL-like language called HiveQL with schema on read and transparently converts queries to map/reduce, Apache Tez and Spark jobs.

Hadoop Installation

Use mac Yosemite (10.10.3)

brew insall hadoop$ hadoop version    Hadoop 2.7.0    Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r d4c8d4d4d203c934e8074b31289a28724c0842cf    Compiled by jenkins on 2015-04-10T18:40Z    Compiled with protoc 2.5.0    From source with checksum a9e90912c37a35c3195d23951fd18f    This command was run using /usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/common/hadoop-common-2.7.0.jar

Configure JAVA_HOME in Hadoop

In.bashrcOr.zshrcJoinJAVA_HOMESettings:

# set java home[ -f /usr/libexec/java_home ] && export JAVA_HOME=$(/usr/libexec/java_home)

Make the settings take effect:

source ~/.bashrc  # source ~/.zshrc

Configure ssh

1. Generate the Public Key (if it has been generated, it will not be used)

ssh-keygen -t rsa

2. Set Mac to allow remote login

"System Preferences"-> "Sharing". Check "Remote Login"

3. Set password-free Logon

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

4. Test local Logon

$ ssh localhost  Last login: Fri Jun  19 16:30:53 2015$ exit

Configure a single node for Hadoop

Here, we use a single node for learning and using the configuration file directory./usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoopLower

Configure hdfs-site.xml

Set the number of replicas to 1:

<configuration>  <property>    <name>dfs.replication</name>    <value>1</value>  </property></configuration>

Configure core-site.xml

Set the port for file system access:

<configuration>  <property>    <name>fs.defaultFS</name>    <value>hdfs://localhost:9000</value>  </property></configuration>

Configure mapred-site.xml

Set the framework used by MapReduce:

<configuration>  <property>    <name>mapreduce.framework.name</name>    <value>yarn</value>  </property></configuration>

Configure yarn-site.xml

<configuration>    <property>        <name>yarn.nodemanager.aux-services</name>        <value>mapreduce_shuffle</value>    </property></configuration>

Alias

alias hstart="start-dfs.sh ; start-yarn.sh"alias hstop="stop-yarn.sh ; stop-dfs.sh"

Format a File System

$ hdfs namenode -format

Start Hadoop

hstart

Create a user space

Hdfs dfs-mkdir/userhdfs dfs-mkdir/user/$ (whoami) # Here is the user

View the processes started by Hadoop

jps

The normal situation is as follows:

$ jps24610 NameNode24806 SecondaryNameNode24696 DataNode25018 NodeManager24927 ResourceManager25071 Jps

Process ID and process name

Disable Hadoop

hstop

Hadoop instance

When running the instance, the current directory is set/usr/local/Cellar/hadoop/2.7.0/libexec

1. Upload the test file to HDFS.

hdfs dfs -put etc/hadoop input

Set localetc/hadoopUpload some files to the HDFSinput.

You can view the uploaded file under the user you just created:/user/$ (whoami)/input

2. Run the example provided by Hadoop in the uploaded data

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'

Use MapReduce to run the uploaded datagrep, ComputedfsThe number of times the start word appears, and the result is savedoutput.

3. view the running result

Hdfs dfs-cat output/part-r-00000 # file name can be seen from [Browse Directory] (http: // localhost: 50070/explorer.html #/): 4dfs. class4dfs. audit. logger3dfs. server. namenode.2dfs. period2dfs. audit. log. maxfilesize2dfs. audit. log. maxbackupindex1dfsmetrics. log1dfsadmin1dfs. servers1dfs. replication1dfs. file

4. Delete the generated file.

hdfs dfs -rm -r /user/$(whoami)/inputhdfs dfs -rm -r /user/$(whoami)/output

Use python to complete wordcount through streaming

Although Hadoop is developed in Java, it supports the development of MapReduce programs in other languages:

Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.
Hadoop Pipes is a SWIG-compatible C ++ API to implement MapReduce applications (non JNI™Based ).

Set the Streaming variable (for later use)

The Directory of streaming in brew is:/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar, Search by command:

find ./ -type f -name "*streaming*"

Set it to a variable for later use:

export STREAM="/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar"

Write Map and Reduce programs

By default, data is read from standard input and output to standard output. You can use input/output redirection to interact with Hadoop. This is what Streaming means, programs you write can also be debugged by yourself through pipelines.

Mapper. py

#!/usr/bin/env python# filename: mapper.pyimport sysfor line in sys.stdin:    line = line.strip()    words = line.split()    for word in words:        print '%s\t%s' % (word, 1)

Add executable permissions to the program:

chmod +x mapper.py

Test:

$ echo "this is a  test " | ./mapper.pythis1is1a1test1

CER Cer. py

#!/usr/bin/env python# filename:reducer.pyimport syscurrent_word = Nonecurrent_count = 0word = Nonefor line in sys.stdin:    line = line.strip()    word, count = line.split('\t', 1)    try:        count = int(count)    except ValueError:        continue    if current_word == word:        current_count += count    else:        if current_word:            print '%s\t%s' % (current_word, current_count)        current_count = count        current_word = wordif current_word == word:    print '%s\t%s' % (current_word, current_count)

Add executable permissions to the program:

  chmod +x reducer.py

Test:

$ echo "this is a a a  test test " | ./mapper.py | sort -k1,1 | ./reducer.pya3is1test2this1

Use Hadoop to call

1. Prepare data

The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
The Notebooks of Leonardo Da Vinci

Ulysses by James Joyce

  $ ls -l  total 7200  -rwxr-xr-x  1 user  staff      165 Jun 19 20:43 mapper.py  -rw-r-----@ 1 user  staff   674570 Jun 19 21:14 pg20417.txt  -rw-r-----@ 1 user  staff  1573151 Jun 19 21:14 pg4300.txt  -rw-r-----@ 1 user  staff  1423803 Jun 19 21:16 pg5000.txt  -rwxr-xr-x  1 user  staff      539 Jun 19 20:51 reducer.py

2. upload files to HDFS

To upload files to HDFS, you can use Hadoop MapReduce:

$ hdfs dfs -mkdir /user/$(whoami)/input$ hdfs dfs -put ./*.txt /user/$(whoami)/input #*

3. Run MapReduce

$ hadoop jar $STREAM  \-files ./mapper.py,./reducer.py \-mapper ./mapper.py \-reducer ./reducer.py \-input /user/$(whoami)/input/pg5000.txt,/user/$(whoami)/input/pg4300.txt,/user/$(whoami)/input/pg20417.txt\ -output /user/$(whoami)/output

4. view results

$ hdfs dfs -cat /user/$(whoami)/output/part-00000 | sort -nk 2 | tailwith4686it4981that6109is7401in11867to12017a12064and16904of23935the42074

It means that in normal books, there are a lot of prepositions used, and these words will be removed in many cases.

5. Improvement (use iterator and generator)

UseyieldData can be provided as needed, which is effective when a large amount of memory is occupied.

Improved mapper. py:

#!/usr/bin/env python"""A more advanced Mapper, using Python iterators and generators."""import sysdef read_input(file):    for line in file:        # split the line into words        yield line.split()def main(separator='\t'):    # input comes from STDIN (standard input)    data = read_input(sys.stdin)    for words in data:        # write the results to STDOUT (standard output);        # what we output here will be the input for the        # Reduce step, i.e. the input for reducer.py        #        # tab-delimited; the trivial word count is 1        for word in words:            print '%s%s%d' % (word, separator, 1)if __name__ == "__main__":    main()

Improved CER Cer. py:

#!/usr/bin/env python"""A more advanced Reducer, using Python iterators and generators."""from itertools import groupbyfrom operator import itemgetterimport sysdef read_mapper_output(file, separator='\t'):    for line in file:        yield line.rstrip().split(separator, 1)def main(separator='\t'):    # input comes from STDIN (standard input)    data = read_mapper_output(sys.stdin, separator=separator)    # groupby groups multiple word-count pairs by word,    # and creates an iterator that returns consecutive keys and their group:    #   current_word - string containing a word (the key)    #   group - iterator yielding all ["&lt;current_word&gt;", "&lt;count&gt;"] items    for current_word, group in groupby(data, itemgetter(0)):        try:            total_count = sum(int(count) for current_word, count in group)            print "%s%s%d" % (current_word, separator, total_count)        except ValueError:            # count was not a number, so silently discard this item            passif __name__ == "__main__":    main()

UI for viewing system status

Resource Manager: http: // localhost: 50070
JobTracker: http: // localhost: 8088
Specific Node Information: http: // localhost: 8042

You may also like the following articles about Hadoop:

Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04

Install and configure Hadoop2.2.0 on CentOS

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More