Hadoop-Streaming practical experience and solutions

Source: Internet
Author: User
Directory 1 .? The Join Operation is important to distinguish the join type. 2 .? Set the key field and partition field in the startup program. 3 .? How to control the hadoop program memory 4 .? Question 5 .? Method 6 for obtaining the map_input_file environment variable in mapper .? How to record data during running 7 .? Running Hadoop for multiple times is

Directory 1 .? The Join Operation is important to distinguish the join type. 2 .? Set the key field and partition field in the startup program. 3 .? How to control the hadoop program memory 4 .? Question 5 .? Method 6 for obtaining the map_input_file environment variable in mapper .? How to record data during running 7 .? Running Hadoop for multiple times is

Directory

1 .? The Join Operation is important to distinguish the join type...

2 .? Set the key field and partition field in the startup program...

3 .? How to control the hadoop program memory...

4 .? Sorting of digital keys...

5 .? How to obtain the map_input_file environment variable in mapper...

6 .? How to record data during running...

7 .? Judge whether Hadoop is successfully run for multiple times...

8 .? Preprocessing of line read by stdin...

9 .? How to connect Python strings...

10 .? How to view mapper program output...

11 .? Naming of variable names in SHELL scripts...

12 .? Designing a process in advance can simplify a lot of repetitive work...

13 .? Other practical experiences...

1. The Join Operation is important to distinguish the join type.

The Join operation is a very common requirement in hadoop computing. It requires that the data of two different data sources be connected to one or more key fields into one merged data output. Due to the particularity of the key field data, as a result, join is divided into three types, and the processing methods are different. If a key can be repeated in the data, the data source is of the N type. If only one key can appear, the data source is of the 1 type.

1 )? Join of type 1-1

For example, join is performed on two datasets (student ID, name) and (student ID, class) based on the student ID field. Because the same student ID can only point to a single name and a single class, it is of the 1-1 type, the processing method is that after the map stage is marked, the data received by the reduce stage is each two groups. In this way, you only need to read the first row, connect the non-key field to the end of the second line.

Data output for each student ID: 1x1 = 1

2 )? Join of type 1-N or N-1

For example, the join data set (student ID, name) and (student ID, elective course) are based on the student ID field. Because each student ID in the data of the second data source corresponds to many courses, therefore, join is of the 1-N type. The processing method is to mark the first data source (type 1) with 1 and the second data source with 2 in the map stage. In this case, the data received in the reduce stage is grouped by rows marked as 1, and the number of rows in each group is greater than 2. The join method reads the rows marked with 1 first, record the non-key Field Value 1, and traverse it down. Each time a row marked with 2 is added to the end of the row and output.

Data output for each student ID: 1 * N = N * 1 = N

3 )? Join for Type M-N

For example, join (student ID, elective course) and (student ID, favorite fruit) join based on student ID field. Because each student ID of each data source corresponds to multiple corresponding data, so it is M * N type. The processing method is to add mark 1 to the small data source in the map stage (the purpose is to save memory in the reduce stage) and Mark 2 to the large data source. In the reduce stage, each group has M * N rows, and all marked 1 is before marked 2. The Join method first initializes an empty array. When a row marked with 1 is encountered, non-key data is recorded in the array, and then a row marked with 2 is encountered, add the data in the array to the output after the row.

Output data per student ID: M x N

2. Set the key field and partition field in the Startup Program

In the join calculation process, two fields are very important and need to be understood, that is, the key of the sorting field and the partition of the partition field.

Field Field description

Num. key. fields. for. partition

Used for partitioning. It only affects the reduce machine to which data is distributed, but does not affect sorting.

Stream. num. map. output. key. fields

The Key indicates the primary Key, which affects the data sorting based on the first few columns.
Org. apache. hadoop. mapred. lib. KeyFieldBasedPartitioner This setting is required by default if you want to sort and partition fields.

The above three configurations will particularly affect the configuration during join computing:

1 )? If the join operation is for a single key, because the marked field is to be sorted, the key is set to 2, at the same time, Set partition = 1 to partition the first field to ensure that data of the same Key is on the same machine;

2 )? If it is a join of n federated keys, the tag field must be added first, so the key is set to N + 1 for sorting, then, the partition is N to partition by key.

3. How to control the hadoop program memory

The Hadoop program targets massive volumes of data. Therefore, any operation to save variables will cause N times of storage in the memory. If you try to record a single field of each row or certain rows with an array, if the program is not running, the hadoop platform will kill the 137 memory error.

The method to control the memory is to use less variables, especially arrays, to record data. The processing of the current row is not related to the total data size. The summary, M * N join, and other processes have to record historical data, for this kind of processing, we need to release it in time and record it in a single variable instead of an array. For example, we can accumulate the value of each record in summary calculation, rather than record all the elements first.

4. Sorting of numeric keys

If not processed, the number 1 in the sorting process will be placed after 10, and the processing method is to add 0 before the number. For example, if there are two digits, the number of digits is supplemented by one zero, so that the values 01 and 10 are compared. When the reduce output is completed, the number of digits must be predicted first.

In mapper. py:

Print '% 010d \ t % s' % (int (key), value)

Since the key is a number, you need to format the number and output % 010d to output a 10-Bit String. If the number is not 10 characters long, add 0 in front.

In CER Cer. py, in the final output, use the int conversion method to remove the previous 0:

Print '% d \ t % s' % (int (key), value)

5. method for obtaining the map_input_file environment variable in mapper

In mapper, you can use the map_input_file variable to record the file path of the script currently being processed to distinguish different data file sources. There are two identification methods:

A )??????? Identify by file name

Import OS

Filepath = OS. environ ["map_input_file"]
Filename = OS. path. split (filepath) [-1]

If filename = "filename1 ":

# Process 1

Elif filename = "filename2 ":

# Process2

B )??????? Determine whether the file path contains a specified string

Filepath = OS. environ ["Map_input_file"]

If filepath. find (sys. argv [2])! =-1:

# Process

6. How to record data during running

The Hadoop program is different from the local program debugging method. You can use the error log to view the error information. You can also use cat input | mapper locally before submitting the task. py | sort | CER Cer. py> output is used to filter basic errors first. You can record information during the running process using the following methods:

1 )? You can directly output the information to std output. After the program runs, you need to manually filter the recorded data or use awk to directly view the data, but the result data will be contaminated.

2 )? Most of them use the error output method. After running this method, you can view the output data in the stderr log: sys. stderr. write ('filename: % s \ t' % (filename ))

7 .? Judge whether Hadoop is successfully run multiple times

If you want to run multiple hadoop computations and the previous calculation result is the input of the next computation, if the previous computation fails, the next computation obviously does not need to be started. Therefore, you can use $? To determine whether the last operation was successful. Sample Code:

If [$? -Ne 0]; then

?? Exit 1

Fi

8. Preprocessing of line read by stdin

The Mapper and reducer programs read data from standard input. However, if you split the program directly, you will find that the last field is followed by '\ n'. There are two solutions:

1 )? Datas = line [:-1]. split ('\ t ')

2 )? Datas = line. strip (). split ('\ t ')

The first method directly removes the last character \ n, then splits, the second method is to remove spaces (including line breaks) on both sides of the line, and then split. I personally like to use the second one, because I'm not sure if all rows end with \ n, but some data has spaces on both sides. If strip is dropped, it will hurt the data, so you can choose based on the scenario.

9. Python string connection method

Mapper and reducer output or intermediate processing often need to combine different types of strings. The methods for implementing string connection in python include formatted output, string connection (plus sign) and join operations (each field needs to be converted to a character type ).

Formatted output: '% d \ t % s' % (inti, str)

Use the + character of the string to connect: '% d \ t' % I +' \ t'. join (list)

Join written as \ t of the ancestor: '\ t'. join (' % d' % I, '\ t'. join (list )))

10. How to view mapper program output

Generally, after the mapper program is processed, it will be sorted and partition will be given to different reducers for the next step. However, during the development process, you often need to check whether the Current mapper output is the expected result, there are two requirements for viewing its output.

Requirement 1: View mapper's direct output:

In the running script, do not set the-CER parameter, that is, there is no reducer program, and then set-D mapred. reduce. task Ks = 0, that is, no reduce processing is required, but the-output option must be set at the same time. In this way, a file output by each mapper machine is displayed in the output directory, is the direct output of the mapper program.

Requirement 2: Check the content after the mapper output is partition and sorted, that is, what the CER input looks like: in the running script, the-CER parameter is not set, that is, you do not have your own CER program, and then set-D mapred. reduce. task Ks = 1 or a larger value, that is, reduce machine, but there is no CER program, hadoop will think that there is a reducer, so it will continue to call shuffle for the mapper output to disrupt and sort operations, in this way, we can see the CER input file under the output directory, and the number is equal to the number of tasks set by CER Cer.

11. Naming of variable names in SHELL scripts

If you encounter many input data sources and many output intermediate results, each hadoop output will use the next input, and this character also uses other outputs, in this case, it is best to configure all the file path names in a unified shell configuration file, and avoid naming methods such as InputDir1 and InputDir2. variable naming is a kind of skill, be sure to practice more intuitively and clearly, so that as the program grows, it will not become increasingly messy.

12. Designing a process in advance can simplify a lot of repetitive work.

Recently, I have received a complicated hadoop data processing process. It takes more than a dozen hadoop tasks to complete big data and small data processing estimates. Fortunately, I did not directly start writing code, instead, we sorted out these tasks in a unified manner, and finally found that many problems can be directly merged into a class of code for processing. During the process, the entire task was split into many small tasks in parallel, then solve small tasks one by one very quickly. In the Hadoop processing process, if tasks are complicated and mutually dependent on the processing results of the other party, you must design the processing process in advance before starting the process.

13. Other practical experience

1 )? Mapper and CER scripts are written in the same Python program for comparison and viewing;

2 )? It is not easy to confuse the field information and location ing Dictionary of the data source;

3 )? Extract commonly used data such as output data and read data modules as independent functions;

4 )? Test scripts and data, run scripts, and map-reduce programs are stored in different directories;

Original article address: Summary of Hadoop-Streaming practical experience and solutions. Thank you for sharing it with me.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.