Write scalable, distributed data-intensive programs and basics
Understanding Hadoop and MapReduce
Write and run a basic MapReduce program
1. What is HadoopHadoop is an open-source framework for writing and running distributed applications to handle large-scale data.What makes Hadoop unique is the following points:
Convenient--hadoop run on a
Environment[Email protected] soft]#Cat/etc/Issuecentos Release6.5(Final) Kernel \ r \m[[email protected] soft]#uname-Alinux vm80282.6. +-431. el6.x86_64 #1SMP Fri Nov A Geneva: the: theUtc -x86_64 x86_64 x86_64 gnu/Linux[[email protected] soft]# Hadoop versionhadoop2.7.1Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git-r 15ecc87ccf4a0228f35af08fc56de536e6ce657aCompiled by Jenkins on -- .-29t06:04zcompiled with Protoc2.5.0From source with c
1. The virtual machine installation hadoop,windows cannot access the Hadoop Web page http://master:50070/through the host name. Windows Ping Master also pings the method: Add Linux under Windows native C:\Windows\System32\drivers\etc\hosts files Hosts configure the hostname and IP address of the Hadoop machine to add in.
Issue 2, Windows Eclipse runni
monitoring the Apache Hadoop cluster, the Apache Hadoop cluster includes the Hadoop hdfs,hadoop hive,hadoop Mapreduce,hcatalog, Hbase,zookeeper,oozie,pig and Sqoop. Ambari also provides a dashboard for monitoring the health of vi
Etwork design components for a given scenario
4. Hadoop Cluster installation andadministration (25%)
Given a scenario, identify how the cluster would handle disk and machine failures
Analyze A logging configuration and logging configuration file format
Understand the basics of Hadoop metrics and cluster health monitoring
Identify the function and purpose of available tools for cluster monitoring
First, download the Hadoop websitehttp://hadoop.apache.orghttps://archive.apache.org/dist/hadoop/common/hadoop-2.6.0 Administrator Identity Decompression D:\Hadoop\hadoop-2.6.0Second, the download of winutilsAlso need to download Winutils.exe,requires a corresponding version
Part 1: hadoop BinThe following hadoop bin is based on the actual needs of the project:Hadoop ShellHadoop-config.sh, which is used to assign values to some variablesHadoop_home (hadoop installation directory ).Hadoop_conf_dir (hadoop configuration file directory ). Hadoop_slaves (-- the address of the file specified by
Ubuntu installation (Here I do not catch a map, just cite a URL, I believe that everyone's ability)Ubuntu Installation Reference Tutorial: http://jingyan.baidu.com/article/14bd256e0ca52ebb6d26129c.htmlNote the following points:1, set the virtual machine's IP, click the network connection icon in the bottom right corner of the virtual machine, select "Bridge mode", so as to assign to your LAN IP, this is very important because the back Hadoop to use th
The previous several are mainly Sparkrdd related foundation, also used Textfile to operate the document of this machine. In practical applications, there are few opportunities to manipulate common documents, and more often than not, to manipulate Kafka streams and files on Hadoop.
Let's build a Hadoop environment on this machine. 1 Installation configuration Hadoop
Pre-Preparation 1. Create a Hadoop-related directory (easy to manage) 2, give Hadoop users and all group permissions to the/opt/* directorysudo chrown-r hadoop:hadoop/opt/*3, JDK installation and configuration configuration Hdfs/yarn/mamreduce1, decompression HadoopTAR-ZXF hadoop-2.5.0.tar.gz-c/opt/modules/(delete Doc's help document, save space) rm-rf/opt/module
-p '-F/HOME/U/.SSH/ID_DSASsh-keygen indicates that the key is generated-T means the specified generated key typeDSA is the meaning of DSA key authentication, that is, the key type-P provides a passphrase-f Specifies the generated key file(4) # cat/home/u/.ssh/id_dsa.pub >>/home/u/.ssh/authorized_keys# Add the public key to the public key file for authentication, Authorized_keys is the public key file for authentication(5) # Ssh-version# Verify that SSH installation is complete and the correct in
additional openssh-clients(3) # Mkdir-p ~/.ssh # Assume that after you install SSH, these folders are not actively generated by yourself, please create your own(4) # ssh-keygen-t Dsa-p "-F ~/.SSH/ID_DSASsh-keygen indicates that the key is generated-T means the specified generated key typeDSA is the meaning of DSA key authentication, that is, the key type-P provides a passphrase-f Specifies the generated key file(5) # cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys# Add the public key to the pub
Compile the hadoop 2.x Hadoop-eclipse-plugin plug-in windows and use eclipsehadoopI. Introduction
Without the Eclipse plug-in tool after Hadoop2.x, we cannot debug the code on Eclipse. We need to package MapReduce of the written java code into a jar and then run it on Linux, therefore, it is inconvenient for us to debug the code. Therefore, we compile an Eclipse plug-in so that we can debug it locally. Afte
"hadoopclusternetwork" created by the author ".
Open the following port for the virtual machine, that is, set the following Endpoints in the virtual machine configuration.
Enable port for Virtual machines
7180 (Cloudera Manager web UI)
8020,500 10, 50020,500 70, 50075 (HDFS NameNode and DataNode)
8021 (MapReduce JobTracker)
8888 (Hue web UI)
9083 (Hive/HCatalog metastore)
41415 (Flume agent)
11000 (Oozie server)
21050 (
-volume log aggregation, which can be used for log data collection, log processing, and log transfer.
Apache giraph: is a scalable distributed iterative processing system based on the Hadoop platform, inspired by the BSP (bulk synchronous parallel) and Google Pregel.
Apache Oozie: is a workflow engine server that manages and coordinates the tasks that run on the Had
Knowing and learning about Hadoop, we have to understand the composition of Hadoop, and based on my own experience, I introduce the Hadoop component, the big data processing process, and the three aspects of Hadoop core:
Hadoop Components
650) this.width=650;
Full-text index-lucene,solr,nutch,hadoop LuceneFull-text index-lucene,solr,nutch,hadoop SOLRI was in last year, I want to lucene,solr,nutch and Hadoop a few things to give a detailed introduction, but because of the time of the relationship, I still only wrote two articles, respectively introduced the Lucene and SOLR, then did not write, but my heart is still loo
Hadoop Streaming provides a toolkit for MapReduce programming that enables Mapper and Reducer based on executable commands, scripting languages, or other programming languages to take advantage of the benefits and capabilities of the Hadoop parallel computing framework, To handle big data.All right, I admit the above is a copy. The following is the original dry goodsThe first deployment of the
Today, HDFS, the core of hadoop, is very important. It is a distributed file system. Why does hadoop support massive data storage? In fact, it depends mainly on the HDFS capability, mainly on the ability of HDFS to store massive data.
1. Why can HDFS store massive data?
In the beginning, let's think about this problem. I don't need to talk about the basic concepts of HDFS ~ We focus on usage rather than "re
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.