Hadoop getting started
Hadoop is a big data application platform that provides support for Big Data Storage (HDFS) and big data operation (Mapreduce). This article first introduces Hadoop-related knowledge, next, we introduced how to install and configure Hadoop in mac, and finally used streaming to write mapreduce tasks in python.
Motivation
As a big data platform representative, Hadoop is worth learning for every big data developer. What I want to do after I get started is a big data platform-related project, therefore, we need to learn about Hadoop in advance, including the use of hive and MapReduce.
Target
Hadoop finds data by itself, builds the environment, and writes a wordcount using streaming and python.
Hadoop Introduction
Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. all the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus shocould be automatically handled in software by the framework.
The term "Hadoop" has come to refer not just to the base modules above, but also to the "ecosystem ", or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark, and others.
HDFS (Hadoop Distributed File System)
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. A Hadoop cluster has nominally a single namenode plus a cluster of datanodes, although redundancy options are available for the namenode due to its criticality. each datanode serves up blocks of data over the network using a block protocol specific to HDFS. the file system uses TCP/IP sockets for communication. clients use remote procedure call (RPC) to communicate between each other.
MapReduce
Above the file systems comes the MapReduce Engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. the JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible.
The process is as follows:
Map(k1,v1) → list(k2,v2)Reduce(k2, list (v2)) → list(v3)
Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. while initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix. amazon maintains a software fork of Apache Hive that is supported in Amazon Elastic MapReduce on Amazon Web Services.
It provides an SQL-like language called HiveQL with schema on read and transparently converts queries to map/reduce, Apache Tez and Spark jobs.
Hadoop Installation
Use mac Yosemite (10.10.3)
brew insall hadoop$ hadoop version Hadoop 2.7.0 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r d4c8d4d4d203c934e8074b31289a28724c0842cf Compiled by jenkins on 2015-04-10T18:40Z Compiled with protoc 2.5.0 From source with checksum a9e90912c37a35c3195d23951fd18f This command was run using /usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/common/hadoop-common-2.7.0.jar
Configure JAVA_HOME in Hadoop
In.bashrc
Or.zshrc
JoinJAVA_HOME
Settings:
# set java home[ -f /usr/libexec/java_home ] && export JAVA_HOME=$(/usr/libexec/java_home)
Make the settings take effect:
source ~/.bashrc # source ~/.zshrc
Configure ssh
1. Generate the Public Key (if it has been generated, it will not be used)
ssh-keygen -t rsa
2. Set Mac to allow remote login
"System Preferences"-> "Sharing". Check "Remote Login"
3. Set password-free Logon
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
4. Test local Logon
$ ssh localhost Last login: Fri Jun 19 16:30:53 2015$ exit
Configure a single node for Hadoop
Here, we use a single node for learning and using the configuration file directory./usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop
Lower
Configure hdfs-site.xml
Set the number of replicas to 1:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property></configuration>
Configure core-site.xml
Set the port for file system access:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property></configuration>
Configure mapred-site.xml
Set the framework used by MapReduce:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property></configuration>
Configure yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property></configuration>
Alias
alias hstart="start-dfs.sh ; start-yarn.sh"alias hstop="stop-yarn.sh ; stop-dfs.sh"
Format a File System
$ hdfs namenode -format
Start Hadoop
hstart
Create a user space
Hdfs dfs-mkdir/userhdfs dfs-mkdir/user/$ (whoami) # Here is the user
View the processes started by Hadoop
jps
The normal situation is as follows:
$ jps24610 NameNode24806 SecondaryNameNode24696 DataNode25018 NodeManager24927 ResourceManager25071 Jps
Process ID and process name
Disable Hadoop
hstop
Hadoop instance
When running the instance, the current directory is set/usr/local/Cellar/hadoop/2.7.0/libexec
1. Upload the test file to HDFS.
hdfs dfs -put etc/hadoop input
Set localetc/hadoop
Upload some files to the HDFSinput
.
You can view the uploaded file under the user you just created:/user/$ (whoami)/input
2. Run the example provided by Hadoop in the uploaded data
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
Use MapReduce to run the uploaded datagrep
, Computedfs
The number of times the start word appears, and the result is savedoutput
.
3. view the running result
Hdfs dfs-cat output/part-r-00000 # file name can be seen from [Browse Directory] (http: // localhost: 50070/explorer.html #/): 4dfs. class4dfs. audit. logger3dfs. server. namenode.2dfs. period2dfs. audit. log. maxfilesize2dfs. audit. log. maxbackupindex1dfsmetrics. log1dfsadmin1dfs. servers1dfs. replication1dfs. file
4. Delete the generated file.
hdfs dfs -rm -r /user/$(whoami)/inputhdfs dfs -rm -r /user/$(whoami)/output
Use python to complete wordcount through streaming
Although Hadoop is developed in Java, it supports the development of MapReduce programs in other languages:
- Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.
- Hadoop Pipes is a SWIG-compatible C ++ API to implement MapReduce applications (non JNI™Based ).
Set the Streaming variable (for later use)
The Directory of streaming in brew is:/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar
, Search by command:
find ./ -type f -name "*streaming*"
Set it to a variable for later use:
export STREAM="/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar"
Write Map and Reduce programs
By default, data is read from standard input and output to standard output. You can use input/output redirection to interact with Hadoop. This is what Streaming means, programs you write can also be debugged by yourself through pipelines.
Mapper. py
#!/usr/bin/env python# filename: mapper.pyimport sysfor line in sys.stdin: line = line.strip() words = line.split() for word in words: print '%s\t%s' % (word, 1)
Add executable permissions to the program:
chmod +x mapper.py
Test:
$ echo "this is a test " | ./mapper.pythis1is1a1test1
CER Cer. py
#!/usr/bin/env python# filename:reducer.pyimport syscurrent_word = Nonecurrent_count = 0word = Nonefor line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print '%s\t%s' % (current_word, current_count) current_count = count current_word = wordif current_word == word: print '%s\t%s' % (current_word, current_count)
Add executable permissions to the program:
chmod +x reducer.py
Test:
$ echo "this is a a a test test " | ./mapper.py | sort -k1,1 | ./reducer.pya3is1test2this1
Use Hadoop to call
1. Prepare data
2. upload files to HDFS
To upload files to HDFS, you can use Hadoop MapReduce:
$ hdfs dfs -mkdir /user/$(whoami)/input$ hdfs dfs -put ./*.txt /user/$(whoami)/input #*
3. Run MapReduce
$ hadoop jar $STREAM \-files ./mapper.py,./reducer.py \-mapper ./mapper.py \-reducer ./reducer.py \-input /user/$(whoami)/input/pg5000.txt,/user/$(whoami)/input/pg4300.txt,/user/$(whoami)/input/pg20417.txt\ -output /user/$(whoami)/output
4. view results
$ hdfs dfs -cat /user/$(whoami)/output/part-00000 | sort -nk 2 | tailwith4686it4981that6109is7401in11867to12017a12064and16904of23935the42074
It means that in normal books, there are a lot of prepositions used, and these words will be removed in many cases.
5. Improvement (use iterator and generator)
Useyield
Data can be provided as needed, which is effective when a large amount of memory is occupied.
Improved mapper. py:
#!/usr/bin/env python"""A more advanced Mapper, using Python iterators and generators."""import sysdef read_input(file): for line in file: # split the line into words yield line.split()def main(separator='\t'): # input comes from STDIN (standard input) data = read_input(sys.stdin) for words in data: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 for word in words: print '%s%s%d' % (word, separator, 1)if __name__ == "__main__": main()
Improved CER Cer. py:
#!/usr/bin/env python"""A more advanced Reducer, using Python iterators and generators."""from itertools import groupbyfrom operator import itemgetterimport sysdef read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1)def main(separator='\t'): # input comes from STDIN (standard input) data = read_mapper_output(sys.stdin, separator=separator) # groupby groups multiple word-count pairs by word, # and creates an iterator that returns consecutive keys and their group: # current_word - string containing a word (the key) # group - iterator yielding all ["<current_word>", "<count>"] items for current_word, group in groupby(data, itemgetter(0)): try: total_count = sum(int(count) for current_word, count in group) print "%s%s%d" % (current_word, separator, total_count) except ValueError: # count was not a number, so silently discard this item passif __name__ == "__main__": main()
UI for viewing system status
- Resource Manager: http: // localhost: 50070
- JobTracker: http: // localhost: 8088
- Specific Node Information: http: // localhost: 8042
You may also like the following articles about Hadoop:
Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04
Install and configure Hadoop2.2.0 on CentOS
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition