Breaking Yahoo record--Microsoft 60 SEC Processing 1401GB Data

Source: Internet
Author: User
Keywords Must Yahoo Microsoft compare Inter

The Microsoft Institute recently broke the data-finishing speed record that was maintained by Yahoo. The 9-person team at the >microsoft Institute successfully completed 1401GB data http://www.aliyun.com/zixun/aggregation/11208.html in just 60 seconds. Their tests are based on minutesort benchmarks. Minutesort is the amount of data that is sorted in a minute. Microsoft has adopted a new distributed computing system (Flat Datacenter Storage) to speed up data processing.

It is worth mentioning that Microsoft's system uses 250 hosts (1033 disks), while Yahoo's record-making system uses 1406 hosts (5624 disks).

Microsoft believes that flat Datacenter storage can use its technical advantages to help Bing improve performance, and in the future Microsoft believes that flat Datacenter storage can make a difference in the field of machine learning. Currently the most popular processing technology in the field of large data processing is Hadoop and mapreduce, but now it seems that Microsoft's flat Datacenter storage technology is more advantageous. (terminator/Compilation)

Detailed test results

Click to view larger image

Extended Reading

Minutesort is a comparison of the amount of data that is sorted in a minute, graysort the sort rate (tbs/minute) when sorting large data (at least 100TB). The benchmark rules are specific as follows:

The input data must exactly match the data generated by the data generator

When the task starts, the input data cannot be in the operating system's file cache

Input and output data are not compressed

Output cannot override input

Output file must be stored on disk

You must calculate the CRC32 of each key/value pair for the input and output data, a total of 128-bit checksums, and of course, the input and output must correspond to equal

If the output is divided into multiple output files, then it must be completely orderly, that is, the output file must be connected to a completely ordered output

Start and distribute programs to the cluster also to be recorded in the calculation time

Any sampling should also be recorded in the calculated time

Yahoo researchers used Hadoop to arrange 1TB data in 62 seconds, and 1PB data in 16.25 hours.

(Responsible editor: The good of the Legacy)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.