Hadoop Basic Teaching Helloword

Source: Internet
Author: User
Tags hadoop fs

In the previous chapter, we downloaded, installed, and ran Hadoop, and finally executed a Hello world program to see the results. Now let's read this Hello Word.

OK, let's take a look at what we entered at the command line:

?
12345678 $mkdir input$cd input$echo "hello world">test1.txt$echo "hello hadoop">test2.txt$cd ..$bin/hadoop dfs -put input in$bin/hadoop jar build/hadoop-0.20.2-examples.jar wordcount in out$bin/hadoop dfs -cat out/*

Line 1th, easy to understand, we have built an input subfolder under the Hadoop folder;

Line 2nd, enter the input folder;

Line 3rd, Echo is the back display, can be understood as print, greater than (>) redirection, Normal echo is displayed on the screen, and after redirection, the content is displayed in the Text1.txt file. What this means is to create a test1.txt file with the content "Hello World". The 4th line is similar;

Line 5th, back to the top level directory

Line 6th, where a Hadoop command is run, the parameter is dfs–put input in meaning to upload the input folder to the Hadoop file system and coexist in directory in.

Line 7th, also the Hadoop command, the parameter jar Xxx.jar WordCount in out refers to the program that runs the WordCount class in the JAR program and passes in the parameter in out. In for the input directory out for the output directory, two directories are directories in the Hadoop file system, not the current operating system directory. After the 7th line, you will see the screen in the brush, which is in the calculation.

Line 8th, Cat is a common command for Linux, which outputs the text content in the specified file. So here cat out/* refers to the output of the text content of all the files under the Out folder, and note that this is DFS in the Hadoop file system, and this out is the directory that is output in the 7th step program. So after entering this command, we see the following result:

What this simple program purpose is, in fact, it is easier to see, is to count the number of words in each file, and merge the results to show.

Some people may think, what Ah, this program we C #, Java a few lines of code has been implemented, what is special? Indeed, the first look is the process. But let's take a closer look. The core design of the Hadoop framework is: HDFs and MapReduce.

HDFS is distributed data storage, this is not the same, that is, I need to count the number of files here, there may not be a machine, and there are different machines, do not need us to control, but to the automatic completion of Hadoop, and we only need a unified interface (bin/ Hadoop Dfs) Visit:

MapReduce, of course, is responsible for the calculation, looking back, indeed this program is not simple, statistics a file word frequency is easy to appear, but if these files are distributed on different machines, and then need to be the result can be easily merged together, it is not a few lines of code can be done. So MapReduce is here to take care of this piece.

See here, we understand the above Hello World, but immediately think, what is the Hadoop application scenario? Or why is it this now so bull, so popular?

Now is a big data age, first of all a storage problem, Hadoop provides a good distributed file system, so that we save a lot of data, while providing a unified interface.

Secondly, having big data does not mean that it can generate value, it must use this data for calculation (query, analysis, etc.), but the traditional calculation is to deploy the program on one or more machines and then fetch the data through the interface to the program for analysis, which is called Mobile data . Instead of Hadoop, the program is automatically distributed to the Hadoop nodes for computation, and then the results are aggregated by a mechanism that is then returned, which is called Mobile computing . Obviously, mobile computing is much lower than the cost of moving data .

Application scenario: Is the search engine, the internet is now a huge amount of data, how to store and search becomes a difficult point, that Hadoop's two core framework is in line with this use, the web crawler fetch a huge amount of web data stored in the distributed library, and then when the search, through the sub-nodes concurrent search, the data back to the merged display. The creation of Hadoop, which was Google's troika, released three technical papers on GFS, MapReduce and BigTable from 2003 to 2004. The HDFs for Hadoop corresponds to Google's bigtable on Google's gfs,mapreduce Mapreduce,hadoop hbase. (Note: HBase is the software for similar data operations developed by Hadoop).

Application Scenario Two: biomedical, a large number of DNA data storage, at the same time to do a comparison work, with Hadoop again.

Of course there are more N other application scenarios ...

To this day, the core value of Hadoop is finally understood, one is distributed storage, and the second is mobile computing.

In order to support these functions, there will certainly be a lot of progress, and now we have to understand the processes and the corresponding commands.

We know that running $bin/start-all.sh to start the entire Hadoop. Then run $BIN/JPS to see all the running processes:

These processes are now installed on the same machine, while the actual distributed deployment, such as:

NameNode: is the HDFs daemon that is responsible for documenting how files are partitioned into chunks and allocated to which DATANODE nodes to centrally manage memory and I/O. There will only be one namenode in a system.

DataNode: Data node that is responsible for reading and writing packets to the hard disk. When the client needs data communication, first ask Namenode to which datanode to store, then the client communicates directly with Datanode.

Secondarynamenode: A worker process used to monitor the HDFS state. Unlike Namenode, it does not receive or record any real-time data changes, only communicates with Namenode in order to periodically save a snapshot of the HDFs metadata. Because the Namenode is a single point, the Namenode downtime and data loss can be minimized with the Secondarynamenode snapshot feature. Secondary namenode can be used as a standby namenode in a timely manner when namenode problems occur.

jobtracker: The link between the application and the Hadooop, the code is submitted to the cluster, Jobtracker will determine the execution plan, including deciding which files to process, assigning nodes to different tasks, and monitoring the operation of all tasks, if the task fails, The Jobtracker will be restarted automatically, but the assigned nodes may be different.

Tasktracker: Responsible for performing individual tasks assigned by Jobtracker, although there is only one tasktracker on a single node, multiple map or reduce tasks can be processed in parallel using multiple JVMs (Java virtual machines).

Once we understand the process, let's see what the file manipulation commands are,

Bin/hadoop is a batch sh file (similar to bat), and the runtime needs to enter subcommands. The list of subcommands is as follows:

Here is more clear, the description of each sub-command, but here is the next FS, we commonly used, the same inside will have more than n sub-commands, such as Bin/hadoop fs-ls list the contents of the file.

Other FS parameter lists are as follows:

Command

Description

[-ls <path>]

List the contents of a folder

[-LSR <path>]

Recursively list the contents of a folder

[-du <path>]

Show space for file points

[-MV <src> <dst>]

Moving files

[-CP <src> <dst>]

Copying files

[-RM [-skiptrash] <path>]

deleting files

[-RMR [-skiptrash] <path>]

Delete a folder

[-put <localsrc> ... <dst>]

Uploading local files to the server

[-copyfromlocal <localsrc> ... <dst>]

Download the server file to a local

[-movefromlocal <localsrc> ... <dst>]

Move server files to local

[-get [-IGNORECRC] [-CRC] <src> <localdst>]

Download the server file to a local

[-getmerge <src> <localdst> [ADDNL]]

Download the files in the server folder to a local after merging

[-cat <src>]

Display the text content of a file

[-text <src>]

Show file text content

[-copytolocal [-IGNORECRC] [-CRC] <src> <localdst>]

Copy files (clips) to local

[-movetolocal [-CRC] <src> <localdst>]

Move files (clips) to local

[-mkdir <path>]

Create a folder

[-setrep [-R] [-W] <rep> <path/file>]

Set the number of copies of a file

[-touchz <path>]

Write a timestamp on the file sibling directory

[-test-[ezd] <path>]

Test whether the file exists

[-STAT [format] <path>]

Return file status

[-tail [-F] <file>]

Displaying the contents of the last 1KB of a file

[-chmod [-R] <mode[,mode] ... | Octalmode> PATH ...]

Modify file (clip) properties

[-chown [-R] [Owner][:[group]] PATH ...]

Modify File Owner Property

[-CHGRP [-R] GROUP PATH ...]

Modify file (clip) properties

[-help [CMD]]

Show Help

Hadoop Basic Teaching Helloword

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.