Introduction to MapReduce and HDFsWhat is Hadoop?
Google proposes a programming model for its business needs MapReduce and Distributed file systems Google File system, and publishes relevant papers (available on Google Research's web site: GFS, MapReduce). Doug Cutting and Mike Cafarella the two papers when they developed the search engine Nutch, the MapReduce and HDFs of the same name, together with Hadoop
If the executable file, script, or configuration file required for the program to run does not exist on the compute nodes of the Hadoop cluster, you first need to distribute the files to the cluster for a successful calculation.
Hadoop provides a mechanism for automatically distributing files and compressing packages by simply configuring the appropriate parameters when you start the
The streaming framework allows programs implemented in any program language to be used in hadoopmapreduce to facilitate the migration of existing programs to the Hadoop platform. So it can be said that the scalability of Hadoop is significant. Next we use C + +, PHP, Python language to implement Hadoopwordcount. Combat one: C + + language implementation WordCount
PreviousArticleThis section describes various streaming parameters.
Example of submitting a hadoop task:
$ Hadoop_home/bin/hadoop streaming \
-Input/user/test/input-output/user/test/output \
-Mapper "mymapper. Sh"-reducer "myreducer. Sh "\
-File/Home/work/mymapper.
large files and archives in Hadoop streaming
Tasks use the-cachefile and-cachearchive options to distribute files and archives in the cluster, and the parameters of the options are the URIs of the files or archives that the user has uploaded to HDFs. These files and archives are cached between different jobs. The user can configure the parameters by Fs.default.name.config the host and fs_port that the file
Put the command first:Hadoop jar/usr/hadoop-1.2. 1/contrib/streaming/hadoop-streaming-1.2. 1. Jar-mapper mapper.py- file mapper.py-reduce reduce.py- file reduce.py-file Params.txt-file params2.txt-input/data/*-output/outputWhere output does not exist can only be.The output of the mapper.py is passed directly to reduce.
We know that the hadoop streaming framework uses '/t' as the Separator by default, and takes the part before the first'/t' in each line as the key, and the remaining content as the value, if no '/t' separator exists, the entire line is used as the key.Key/TValueAnd serves as the reduce input. Hadoop provides configuration for you to set separators.-DStream. Map.
If the executable file, script, or configuration file required for the program to run does not exist on the compute nodes of the Hadoop cluster, you first need to distribute the files to the cluster for a successful calculation. Hadoop provides a mechanism for automatically distributing files and compressing packages by simply configuring the appropriate parameters when you start the
Streaming supports using scripts as map and reduceProgram. The following describes a program for calculating the total number of rows of all files in a distributed manner.
1. Put the data to be retrieved into HDFS$ Hadoop FS-put localfile/user/hadoop/hadoopfile
2. Write the map and reduce scripts. Remember to add executable permissions to the scripts.Mapper.
/samples/cachefile/input.txtcache/file (cache is the extracted directory name, with the # redefined alias, participate in the following) Cache/file2hadoop_ home=/home/hadoop/hadoop-2.3.0-cdh5.1.3$hadoop_home/bin/hadoopfs-rmr/cacheout/$HADOOP _home/ bin/hadoopjar $HADOOP _home/share/
.
Look at the submit job script, which is also important:
#!/bin/bash
export hadoop_home=/home/q/hadoop-2.2.0
sudo-u flightdev HADOOP jar $HADOOP _home/share/hadoop/ Tools/lib/hadoop-streaming-2.2.0.jar \
D-mapred.job.
Recently, I want to briefly learn streaming, mainly using python. Python + hadoop also has an exception in the previous blog post. It's interesting. If C ++ has the opportunity to try it.
Record some webpages that you see as memos
Http://hadoop.apache.org/docs/r0.19.2/cn/streaming.html#Hadoop+Streaming Chinese, alt
Reprint http://www.cnblogs.com/shapherd/archive/2012/12/21/2827860.htmlHadoop supports the functionality of the reduce multi-output, where a reduce can be exported to multiple part-xxxxx-x files, where X is one of a-Z letters, and the program appends the "#X" suffix to the value after the output How to useYou need to specify-outputformat Org.apache.hadoop.mapred.lib.SuffixMultipleTextOutputFormat or-outputformat in the startup script Org.apache.hadoop.mapred.lib.SuffixMultipleSequenceFileOutputF
Video materials are checked one by one, clear high quality, and contains a variety of documents, software installation packages and source code! Perpetual FREE Updates!Technical teams are permanently free to answer technical questions: Hadoop, Redis, Memcached, MongoDB, Spark, Storm, cloud computing, R language, machine learning, Nginx, Linux, MySQL, Java EE,. NET, PHP, Save your time!Get video materials and technical support addresses----------------
divided (the above explanation does not find the relevant document, nor the original) Example1output output (keys) because-D stream.num.map.output.key.fields=4Specify the first 4 output lines of the map as key, followed by value11.12.1.2 11.14.2.3 11.11.4.1 11.12.1.1 11.14.2.2divided into 3 reducer (the first 2 fields as the keys of partition)11.11.4.1-----------11.12.1.2 11.12.1.1-----------11.14.2.3 11.14.2.2The reducer is sorted within each division (4 fields are used for sorting at the same
Label:Train Spark architecture Development!from basic to Advanced, one to one Training! [Technical qq:2937765541]--------------------------------------------------------------------------------------------------------------- ------------------------Course System:Get video material and training answer technical support addressCourse Presentation ( Big Data technology is very wide, has been online for you training solutions!) ):Get video material and training answer technical support addressSpa
There have also been recent studies using spark streaming for streaming. This article is a simple example of how to do spark streaming programming with the flow-based count of word counts.1. Dependent jar PackagesRefer to the article "Using Eclipse and idea to build the Scala+spark development environment," which speci
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.