Common join
The most common join policy is not affected by the size of data. It can also be called reduce side join. The most inefficient join method is completed by a mapreduce job.
First, map the large table and small table respectively. In the map shuffle stage, each map output key is converted into table_name_tag_prefix + join_column_value. However, during partition, only join_column_value is used for hash.
Each reduce accepts all the splits transmitted by map. In the shuffle phase of reducce, It discards table_name_tag_prefix before the map output key for comparison. because the number of reduce operations can be determined by the size of the small table, the reduce operation on each node must put the split of the small table into the memory and change it to hashtable. then compare each record of a large table with one record.
Map join
The calculation steps of map join are divided into two steps. The small table data is converted into hashtable and broadcast to all the map ends to split the Big Table Data reasonably, then, in the map stage, the hashtable of a small table is detected with one row of big table data. if the join key is equal, HDFS is written.
Map join is called map join because all its work is calculated on the map end.
Hive has made several optimizations on map join:
- In Hive 0.6, a large table is written after the SELECT statement, and a small table is written before it, or you can use the/* + mapjoin (map_table) */prompt to set it. in Hive 0.7, this computation is automated. It first automatically determines which table is a small table and which is a large table. auto. convert. join = true) to control. then control the small table size from (hive. smalltable. filesize = 25000000l) parameter control (25 MB by default). When a small table exceeds this size, hive converts it to common join by default. you can view hive-1642.
- First, in the map stage of a small table, it will convert itself into a mapreduce local task, then fetch all the data from the small table from HDFS, convert itself into a hashtable file, compress it into distributedcache.
Currently, there are several limitations on map join in hive. One is that it intends to use bloomfilter to implement hashtable. bloomfilter saves about 8 to 10 times of memory than hashtable, but the size of bloomfilter is difficult to control.
Currently, hashtable in distributedcache copies three copies by default. For a large table with 1000 maps, this number is too small. Most map operations are waiting for the replication of distributedcache.
Bucket map join
When creating a hive table, the hash partition is supported by specifying clustered by (col_name, XXX)Number_bucketsBuckets keyword.
When the join key of the two joined tables is the bucket column, you can use
Hive. Optimize. bucketmapjoin = true
To control hive execution of Bucket map join, note that your small tableNumber _BucketsMust be a multiple of a large table. no matter how many tables are connected, this condition must be met. (In fact, if all buckets are sharded exponentially by 2, a large table can also be a multiple of a small table, but this requires an extra calculation. This is valid for int, and long and string are unclear)
The bucket map join execution plan is divided into two steps. The first step is to convert the small table into hashtable and then broadcast it to the map end of all the large tables. The map end of the large table acceptsNumber _BucketsThe hashtable of a small table does not need to be merged into a large hashtable. The map operation can be performed directly, and the map operation will generateNumber _BucketsSplit. Each split tag is the same as the hashtable tag of a small table. When executing the projection operation, you only need to put one hashtable of the small table into the memory, the corresponding split of the large table is taken out for determination, so the memory limit is the maximum hashtable size of the small table.
Bucket map join is also an implementation of map side join. All calculations are completed on the map end, and those without reduce are called map side join, the bucket is only a hash partition Implementation of hive, and a value partition of course.
Create Table A (XXX) partition by (col_name)
However, in hive, two tables do not necessarily have the same partition key. Even if there are two tables, they are not necessarily join keys. therefore, hive does not have this value-based map side join. List partition in hive is mainly used to filter data rather than partition. the two main parameters are (hive. optimize. CP = true and hive. optimize. pruner = true)
HadoopSource codeBy default, map side join is provided. You can find several related classes in the src/contrib/data_join/src directory of hadoop source code. here, taggedmapoutput can be used to implement hash or list, depending on how you decide to partition. hadoop Definitive Guide Chapter 8th about map side join and side data distribution also provides an example of how to implement map side join for value partitions.
Sort merge bucket map join
Bucket map join does not address the limit that map join must be fully loaded into memory in small tables. If you want to load large tables and small tables on a reduce node, the two tables must be ordered on the join key. You can specify sorted by when creating the table.Join keyOr use index.
Set hive. Optimize. bucketmapjoin = true;
Set hive. Optimize. bucketmapjoin. sortedmerge = true;
Set hive. Input. format = org. Apache. hadoop. hive. QL. Io. bucketizedhiveinputformat;
Bucket columns = join columns = sort Columns
In this way, the data in a small table can be read only a part at a time, and then matched with a row in a large table. Such join operation does not limit the memory size, and can also execute a full outer join.
Skew join
Data Skew in real data is certain, and hadoop uses it by default.
Hive.exe C. reducers. bytes. Per. Cer CER = 1000000000
That is, the reduce operation of each node processes 1 GB of data by default. If your join operation also produces data skew, you can set it in hive.
Set hive. Optimize. skewjoin = true;
Set hive. skewjoin. Key = skew_key_threshold (default = 100000)
Hive has no way to determine which key will produce a large skew during operation. Therefore, this parameter is used to control the skew threshold. If this threshold is exceeded, the new value will be sent to the reduce that has not been reached, which can be set to you
(The total number of records processed/The number of reduce records) are acceptable at 2-4 times.
Skew often exists. Generally, when the number of select layers exceeds two layers, it is easy to produce skew when translating into mapreduce jobs with more than three execution plans, we recommend that you set this parameter before running a complex SQL statement. if you do not know how much to set, you can only process 1 GBAlgorithm, Then skew_key_threshold = 1g/average length. Or, by default, it is set to 250000000 directly (approximately 4 bytes of average length)
Left semi join
Hive does not have clauses such as in/exist, so you need to convert these clauses into left semi join. left semi join only transmits the join key of the table to the map stage. If the key is small enough, map join is executed. If not, common join is performed.
From http://hi.baidu.com/eagooqi/item/9f247a953462cb35326eeb07