Hadoop-streaming practical experience and problem solving method summary __hadoop

Source: Internet
Author: User
Tags stdin

See a good hadoop-streaming actual combat experience of the article, there are most of the scenes are their own combat has been encountered. Specially reproduced over, thanks to the summary of the conscientious.

Directory

Join operations distinguish the type of join is important ...

Set the key field and the partition field in the startup program ...

Methods to control the memory of the Hadoop program ...

For the sorting problem of the digital key ...

Methods for obtaining Map_input_file environment variables in mapper ...

Method of recording data during run ...

The successful judgment of running Hadoop multiple times ...

Pretreatment of line for stdin reads ...

The connection method for the Python string ...

How to view the output of the mapper program ...

Method of naming variable names in shell scripts ...

Designing a process in advance can simplify a lot of repetitive work ...

Some other practical experience ... 1. Join operations distinguishing the type of join is important

Join operations are a very common requirement in Hadoop computing, which requires that data from two different data sources be connected to one or more key fields into a single merged data output, because the specificity of the key field data causes the join to be divided into three types, with different processing methods. If a key can be duplicated in the data, the data source is of type N, or 1 if only once.

1) Join of type 1-1

For example (school number, name) and (school number, Class) Two datasets are joined according to the school Number field, because the same number can only point to a single name and a single class, so the 1-1 type, when the map phase is tagged, the data that is received in the reduce phase is every two packets, In this case, you only need to read the first line, and then connect the Non key field to the second line.

Output data per School Number: 1*1=1

2) type 1-n or N-1 join

For example (school number, name) and (school number, elective course) Two datasets are based on the join of the number field, because each number in the second data source corresponds to many courses, so for the 1-n type join, the process is that the map phase adds a marker of 1 to the first data source (type 1). The second data source is labeled 2. In this case, the data received in the reduce phase is grouped with rows labeled 1, and the number of rows per group is greater than the 2,join method is to read the row of the tag 1 first, record its Non key field field value 1, and then iterate down, and each time the row that encounters the tag 2 1 is added to the end of the line and output.

Output data per School Number: 1*n=n*1=n

3) Type M-n join

For example (study number, elective course) and (school number, like fruit) according to the school Number field to do join, because each data source individual number will correspond to multiple corresponding data, so for the m*n type. The processing method is that the map phase adds a token 1 to the data source (the purpose is to save memory in the reduce phase), a large tag to the data source, a m*n row for each grouping, and a Mark 1 of all the marks in front of Mark 2. The Join method initializes an empty array, encounters the row marked 1, records the non key data in the array, and then encounters the row marked 2, adds the data from the array to the row after it is exported.

Output data per School Number: M*n

About join, the blogger also wrote a lot of articles.
Http://blog.csdn.net/bitcarmanlee/article/details/51694101 is an introduction to hive join data skew.
HTTP://BLOG.CSDN.NET/BITCARMANLEE/ARTICLE/DETAILS/51863358 is to introduce himself to achieve join operation with MR, interested students can see for themselves. 2. The setting of key field and partition field in startup program

In the join calculation process, there are two fields that are very important and need to be understood, that is, the partition of the Sort field key and the partition field.

Field Field description

Num.key.fields.for.partition

For partitioning, only affects which reduce machine the data is distributed to, but does not affect sorting

Stream.num.map.output.key.fields

Key means the primary key, which affects how the data is sorted according to the previous columns
Org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner If you need to sort and partition fields, you must add this setting by default

The above three configurations, in particular, affect the configuration of the join calculation:

1 if it is a single key join, because you want to add a tag field sorting, so set key=2, while setting the Partition=1 to the first field partition to ensure that the same key data on the same machine;

2 If it is the join of N joint key, first need to add the tag field, so set key=n+1, used to sort it, and then need to partition as n to its key partition.

On the use of the above parameters, the blogger has also written related articles
http://blog.csdn.net/bitcarmanlee/article/details/51881699
Students can refer to. 3. Ways to control the memory of the Hadoop program

Hadoop programs are for massive amounts of data, as a result, any operation that holds a variable will cause an n-fold of storage in memory, and if you try to record a single field of each row or row with an array, the Hadoop platform will run out of 137 memory errors and kill it.

The way to control memory is to use less variables, in particular, the array to record the data, the final implementation of the current row and the total size of the data independent, summary, m*n join processing has to record historical data, the processing to be used in time to release, while trying to record in a single variable instead of the array, For example, the summary calculation can record the cumulative value each time, instead of recording all elements before the last summary.

Note: This technique is very practical. In fact, not only in Hadoop streaming, you need to pay attention to this point when writing native Mr in Java. 4. For the number key sorting problem

If it is not processed, the number 1 in the sort process is ranked after 10. The processing method is to need to fill 0 in front of the number, for example, if all have 2 digits, the single digit to fill 1 zeros, so that 01 and 10 comparison, eventually reduce output, then turn back, you need to predict the number of digits.
In the mapper.py:

Print '%010d\t%s '% (int (key), value)

Where key is a number, you need to format the number of the output%010d that will output a 10-bit string, if not enough 10 digits, the front to fill 0.
In reducer.py, in the final output, use the method of turning int to remove the preceding 0:

Print '%d\t%s '% (int (key), Value 5. Method of obtaining Map_input_file environment variables in mapper

In mapper, sometimes in order to differentiate between different data file sources, you can use the Map_input_file variable to record the file path of the script that is currently being processed. Here are two ways to discriminate:

A) Use the filename to judge

Import os

filepath = os.environ["Map_input_file"]
filename = os.path.split (filepath) [-1]

if filename== " filename1 ":

\ #process 1

elif filename==" filename2 ":

\ #process2

b Determines whether the file path contains a determined string

filepath = os.environ["Map_input_file"]

if Filepath.find (sys.argv[2))!=-1:

\ #process

The blogger also wrote the article to illustrate the problem: http://blog.csdn.net/bitcarmanlee/article/details/51735053. 6. Methods of recording data during Operation

The Hadoop program is different from the local program's debugging methods, you can use the error log to view the error message, before submitting a task can also be local with cat input | mapper.py | Sort | reducer.py > Output This method to first filter the basic error, in the running process can also record information in the following ways:

1 can output the information directly to the STD output, after the program is run, you need to manually filter the recorded data, or use awk to view directly, but will pollute the result data

2 Most of the use of the error output method, so that after the operation can be in the stderr log to view their output data: Sys.stderr.write (' filename:%s\t '% (filename)) 7. The success of running Hadoop multiple times

If you want to run multiple Hadoop calculations, and the previous calculation is the next one, if the last calculation fails, the next time it's obvious that you don't need to start the calculation. So in the shell file you can use $ to determine if the last run was successful, and the sample code

If [$?–ne 0];then
   exit 1
fi

Very common and practical techniques, not explained. 8. Pretreatment of line for stdin reading

Mapper and reducer programs read data from standard input, but if you split directly, you find that the last field follows a ' \ n ', there are two ways to fix it:

1)  datas = line[:-1].split (' t ')

2)  Datas=line.strip (). Split (' t ')

The first method directly removes the last character \ n, then split, and the second method is to remove the spaces on both sides of the line (including wrapping) and then split. I like to use the second one, because I'm not sure if all the lines are \ nthe end, but some of the data will have a space on both sides, if the strip will hurt the data, so you can choose according to the situation.

Basically every streaming code can see the processing techniques that are not explained. 9. The connection method of Python string

The output of mapper and reducer, or intermediate processing, often requires the combination of different types of strings, and the method of implementing string concatenation in Python is formatted output, string concatenation (plus), and join operations (the need to convert each field to a character type).

Use formatted output: '%d\t%s '% (INTI,STR)
to connect using the + number of the string: '%d\t '%i+ ' \ t '. Join (list)
write the Cheng Yuanju \ t's join: ' \ t '. Join (('%d '%i, ' \ t '.) Join (list))
10. How to view the output of the mapper program

In general, the Mapper program after processing, will be sorted and then partition to different reducer to do the next step, however, in the development process often need to see whether the current mapper output is the expected results, the view of its output has two requirements.

Need one, view the direct output of mapper:

In the run script, do not set the-reducer parameter, that is, there is no reducer program, then the-D mapred.reduce.tasks=0, that is, do not need any treatment of reduce, but also to set the-output option, so that, In the output directory will see each mapper machine output a file, is the Mapper program's direct output.

Demand two, see mapper output is partition and sorted after the content, that is, reducer input is what it looks like: In the run script, do not set the-reducer parameters, that is, no own reducer program, and then the D- Mapred.reduce.tasks=1 or greater value, that is, the reduce machine, but there is no reducer program, Hadoop will think that there is reducer exists, so will continue to mapper output calls shuffle scrambling and sort operations, In this case, you see the reducer input file below the output directory, and the number equals the tasks reducer set.

This technique is also particularly common and useful. Especially debugging phase, sometimes find no problem where, try to mapper phase output, often can receive miraculous. named method of variable name in shell script

If you encounter a lot of input data sources and a lot of output intermediate results, each Hadoop output will be used to the next input, and the character also use other output, so it is best in a unified shell configuration file to configure all the file path names, and must avoid INPUTDIR1, InputDir2 such a naming method, variable naming is a skill, must practice more intuitive and obvious, so as the size of the program will not become more and more chaotic. 12. Designing the process in advance can simplify a lot of repetitive work

Recently, I received a more complex Hadoop data processing process, large and small processing estimates of more than 10 Hadoop task to complete, but fortunately did not start writing code, but the task of unified collation, and finally found that many problems can be directly merged into a class of code processing, In the process, the entire task is split into a number of small tasks in a sequence, and then solve the small task is very fast. If the tasks in the Hadoop process are complex and dependent on each other's processing results, they need to design the processing process before starting the process beforehand. 13. Some other practical experience

1 mapper and reducer scripts are written in the same Python program for easy comparison and viewing;
2 The field information and location mapping dictionary of the data source are not easily confused;
3 extract commonly used output data, read into the data module for independent functions;
4 test scripts and data, run script, map-reduce program directory placement;

Original link Address: http://www.crazyant.net/1122.html
This article on the basis of the original and slightly changed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.