Five data Organization Mode 4) fully sorted, mixed.

Source: Internet
Author: User

The previously mentioned partition, sub-box mode is not concerned about the order of data. The next full sort, mixed sort mode is concerned with data follows the specified keyfor parallel Sort
Full Sort explanation; Sorting is easy to implement in a sequential structure program, but it is not easy to implement in MapReduce, or in parallel programming. This is a typical "divide and conquer law".
Each reduce will sort his data by key, but this sort is not a global order. What you want to do here is a full order, and the record is the whole data set in sequence.

function: There are many useful features for sorting numbers of data, such as Time Sortingcan provide a Timeline-based views。 In a sorted DatasetsTo find a record in substitution of linear search by binary search method。 As in the case of Mr look at the first and last article of the document .Know Upper and lower bounds of data。 This feature is useful when locating records, and is one of the main features of HBase. If the data is already sorted by a primary key or index column, some data will be faster when the data is loaded in bulk.
However, it is generally not recommended to use MR for data sorting, which is expensive. You need to scan the full volume of 1+ (x~n) * times. (Why does a good design need a second order?) )

scope of Use:The sorted key must have Comparability, in addition to this data if required Completion bitNumber is required samplingWhen you are ready to add.

Structure: Full ordering requires dividing a component area by the value range, and each partition will have a subset of data of the same size, and the size of the total data (upper bound, lower bound) determines how much data each reduce will order. The data is then partitioned by a custom partitioner according to the sort key.
The full-sort mode consists of two stages:            1 Analysis Phase            The analysis phase determines the scope of each partition
                if the distribution of data does not change rapidly over time, then the analysis phase is performed only once.
                 guess partition : If the data is evenly distributed , you can guess the approximate partition. Sort the 100W ID as the ID increment. Assuming that 1000 reduce is used, then there is a partition of 1000 IDs.
            The analysis phase first randomly samples the data and then partitions it based on a random sample. The principle is that the partition can split the random sample evenly, then it can split the larger data set evenly.             2 Sequencing Stages            Sort The data in the sort stage
Full Sort Example:1 analysis
assuming that there are 1 billion of data to sort, and you plan to run the sort with 1000 reduce , then the sample (1 billion/1000=10w) 10W record does not have a partition that will get a very average. That is, the analysis phase (sampling) only needs to overwrite 0.01% records.
only one reduce is needed to store the data after the analysis. The key to the data is the Data property.  A value of Null saves space. This data is saved as a range boundary for the data.
2 Sort
Mapper phase: Traversing 1 billion data, comparing the boundary values of the analysis phase by a list. Assigns partitioner based on boundary values.

If you want to have a master sort plus a second sort, you can concatenate the two keys and separate them with separators. For example the province ^ City.

Performance Analysis:This is a costly operation because the pattern actually needs to load two and parse the data two times. The first time is to establish a partition range. The second time is the real sort of data.
because all of the data needs to be transmitted over the network, and the data is written to disk, a relatively large amount of reduce needs to be used.
    


Mixed rows:        The blending effect is opposite to the full sort effect. But the latter also cares about the order of data in the data set. (There is a need for random, anonymous, repeatable random sampling of data.) )
Structure:all mapper are output with the input record as a value and a random key .
reduce is responsible for ordering random keys, which in turn makes the data randomly distributed.
Performance Analysis:The hybrid mode has very good performance because each record of reduce is random, so the data for reduce will be evenly distributed. The more reduce, the faster the data is expanded. And the size of the schema data file can also be predicted . That is, the size of the entire data set divided by the number of reduce.










From for notes (Wiz)

Five data Organization Mode 4) fully sorted, mixed.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.