Big data: From Getting Started to XX (iii)

Source: Internet
Author: User
Tags shuffle zookeeper hadoop mapreduce

After a simple analysis of Apache's Open source project, the next step is to get a glimpse of Hadoop's tolerance distortion. Direct access to the Hadoop website, here is the official channel to learn Hadoop, the following excerpt from the official website:

What is Apache Hadoop?

The Apache Hadoop Project develops Open-source software for reliable, scalable, distributed computing.

The project includes these modules:

Hadoop Common: The Common utilities, the other Hadoop modules.

Hadoop Distributed File System (HDFS): A Distributed File System, provides high-throughput access to Application data.

Hadoop YARN: A framework for job scheduling and cluster resource management.

Hadoop MapReduce: A yarn-based system for parallel processing of large data sets.

Hadoop has a description of each module in place, needless to be told. Currently, two branches of the Hadoop project are in progress, including 2.6.x and 2.7.x. Currently the latest version is 2.7.2, click here to download.

Click on the homepage of learn about, go to the Study document page, you can also open the. \hadoop-2.7.2\share\doc\hadoop\index.html file in the binary installation package.

The Hadoop deployment consists of three modes: Local (Standalone) mode, pseudo-distributed mode, fully-distributed mode, starting with the simplest stand-alone version. The stand-alone version has three features: runs on the local file system, runs in a single Java process, and facilitates program debugging.

The Windows platform does not explore the OS supported by Hadoop, including Gnu/linux and windows, and I'm using the Redhat Linux 6.3 version. For learning virtual machine, I generally put all the software can be installed on the hook, do not give their own learning to create trouble.

Hadoop 2.7.2 Standalone Installation process:

1. Determine the operating system version

[Email protected] ~]# lsb_release-a
LSB Version:: core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64: Printing-4.0-noarch
Distributor Id:redhatenterpriseserver
description:red Hat Enterprise Linux Server release 6.3 (Santiago)
release:6.3
Codename:santiago
[Email protected] ~]# uname-a
Linux localhost.localdomain 2.6.32-279.el6.x86_64 #1 SMP Wed June 18:24:36 EDT x86_64 x86_64 x86_64 gnu/linux

2. Download from the Internet to determine the Java version

hadoop2.7.2 requires a Java version of more than 1.7, the current JDK available version is 1.8.0_92, click here to download.

3. Check the installed JDK on Linux, and remove the installed JDK

[Email protected] ~]# rpm-qa |grep ' openjdk '
Java-1.6.0-openjdk-javadoc-1.6.0.0-1.45.1.11.1.el6.x86_64
Java-1.6.0-openjdk-devel-1.6.0.0-1.45.1.11.1.el6.x86_64
Java-1.6.0-openjdk-1.6.0.0-1.45.1.11.1.el6.x86_64

[Email protected] ~]# rpm-e--nodeps java-1.6.0-openjdk-1.6.0.0-1.45.1.11.1.el6.x86_64
[Email protected] ~]# rpm-e--nodeps java-1.6.0-openjdk-devel-1.6.0.0-1.45.1.11.1.el6.x86_64
[Email protected] ~]# rpm-e--nodeps java-1.6.0-openjdk-javadoc-1.6.0.0-1.45.1.11.1.el6.x86_64

    4, installing jdk1.8.0_92

[[email  protected] local]# rpm-ivh jdk-8u92-linux-x64.rpm
preparing...                 ########################################### [100%]
   1:jdk1.8.0_92            ############### ############################ [100%]
Unpacking JAR files ...
        Tools.jar ...
        Plugin.jar ...
        Javaws.jar ...
        Deploy.jar ...
        Rt.jar ...
        Jsse.jar ...
        Charsets.jar ...
        Localedata.jar ...

5, modify the/etc/profile file, add 6 lines of information at the end of the file

[Email protected] etc]# Vi/etc/profile


java_home=/usr/java/jdk1.8.0_92
Jre_home=/usr/java/jdk1.8.0_92/jre
Path= $PATH: $JAVA _home/bin: $JRE _home/bin
Classpath=.: $JAVA _home/lib/dt.jar: $JAVA _home/lib/tools.jar: $JRE _home/lib

Export Java_home jre_home PATH CLASSPATH

6. Make the modified environment variable effective

[Email protected] etc]# Source/etc/profile

7. Add Hadoop groups, increase Hadoop users, and modify Hadoop user passwords

[Email protected] ~]# Groupadd Hadoop
[Email protected] ~]# useradd-m-G Hadoop Hadoop
[Email protected] ~]# passwd Hadoop

8. Unzip the Hadoop installation package to the Hadoop user root directory

[Email protected] ~]$ tar xxvf hadoop-2.7.2.tar.gz

9, set the environment variable under the Hadoop user, add the following two lines at the end of the file

[Email protected] ~]$ Vi. bash_profile


Export hadoop_common_home=~/hadoop-2.7.2

Export path= $PATH: ~/hadoop-2.7.2/bin:~/hadoop-2.7.2/sbin

10. Change the environment variable to take effect

[[email protected] ~]$ source. bash_profile

11. Perform the test task, match the number of occurrences in the file in the input directory according to the regular expression

[email protected] ~]$ mkdir input
[email protected] ~]$ CP./hadoop-2.7.2/etc/hadoop/*.xml input
[[email protected] ~]$ Hadoop jar./hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep Input Output ' der[a-z. + '
16/03/11 11:11:39 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where applicable
16/03/11 11:11:40 INFO Configuration.deprecation:session.id is deprecated. Instead, use Dfs.metrics.session-id
16/03/11 11:11:40 INFO JVM. Jvmmetrics:initializing JVM Metrics with Processname=jobtracker, sessionid=
16/03/11 11:11:40 INFO input. Fileinputformat:total input paths to Process:8
16/03/11 11:11:40 INFO MapReduce. Jobsubmitter:number of Splits:8
......
......
16/03/11 11:14:35 INFO MapReduce. Job:counters:30
File System Counters
File:number of bytes read=1159876
File:number of bytes written=2227372
File:number of Read operations=0
File:number of Large Read operations=0
File:number of Write Operations=0
Map-reduce Framework
Map input Records=8
Map Output records=8
Map Output bytes=228
Map output materialized bytes=250
Input Split bytes=116
Combine input Records=0
Combine Output Records=0
Reduce input groups=2
Reduce Shuffle bytes=250
Reduce input Records=8
Reduce Output records=8
Spilled records=16
Shuffled Maps =1
Failed shuffles=0
Merged Map Outputs=1
GC time Elapsed (ms) =59
Total committed heap usage (bytes) =265175040
Shuffle Errors
Bad_id=0
Connection=0
Io_error=0
Wrong_length=0
Wrong_map=0
Wrong_reduce=0
File Input Format Counters
Bytes read=390
File Output Format Counters
Bytes written=192

12, view the analysis results, display normal content

[Email protected] ~]$ RM-RF output/
[email protected] ~]$ cat output/*
2 der.
1 Der.zookeeper.path
1 Der.zookeeper.kerberos.principal
1 Der.zookeeper.kerberos.keytab
1 der.zookeeper.connection.string
1 Der.zookeeper.auth.type
1 Der.uri
1 Der.password

13. If you test again, you need to remove the output directory first

[Email protected] ~]$ RM-RF output/


This article from "Shen Jinqun" blog, declined reprint!

Big data: From Getting Started to XX (iii)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.