Large data processing tools is Hadoop a bit of a misnomer?

Source: Internet
Author: User
Keywords Large data whether name not hardware

Recently, with Baidu, Ali, Tencent, China Mobile data Center architects to communicate, but also in the online forum/community leading large data analysis examples of some discussions, with the Internet/cloud developers to communicate. Thus, I am happy to find that large data analysis is very common in China: not only Starbucks, card house and other American cultural elements in China is widely sought after; Hadoop is also widely accepted and dominates the discussions of Chinese cloud developers. But, like other popular things, people are looking for a discussion to consider whether its current heat is reasonable. "If I say Hadoop is a bit of a misnomer, will someone kick the gym?" Perhaps the world's supervisors and developers are considering the issue. At present, the company's introduction of "Big Data" and other words, Shantu to enhance the company image of the situation can be seen everywhere, on the other hand, developers to buy Hadoop books from my ascension is not uncommon.

More rational architects, however, should at least remember the actual considerations of adopting Hadoop in the first place:

A) free of charge.

b only cheap (generic) hardware--A common server cluster is still cheaper than a high-performance, dedicated machine.

c) Development facilitation. With a large group of users, the growth rate of the code base is staggering, self-taught developers are everywhere-from the BBS or personal micro-trust friends can easily find.

These advantages are hard to resist. But to achieve an automated task tracking, data replication, file sharing of parallel processing platform, the need to develop and maintain millions of platform software code, just think about it is enough to let any company's engineers head. In addition, in order to implement this system, it is also necessary to customize the hardware, and will add a few additional years, then can really start to develop analysis applications. So, do we have no choice but Hadoop?

Let's listen to the hardware architect again: Hadoop can be very inefficient for some tasks:

1 file-oriented--hadoop input comes from files, and files are used to store intermediate results, so for each map-reduce performance depends on file I/O.

2 No sharing-each node has its own local resources (CPU, DRAM, local SSD, local HDD), and relies entirely on local resources, unless remote data is requested through a Distributed File System (HDFS).

As a result, Hadoop is ideal for its original design goals, using a group of inexpensive machines to process very large data files in parallel, and to condense the resulting information into smaller files in batch mode.

Now, our friends at Microsoft, Yahoo, and Facebook have uncovered some startling statistics: unless you are indexing a whole-scale keyword, large data is often less "big" (to be stored on a personal laptop)! It can also be easily segmented into small chunks for digestion ( In other words, do not data mining for a whole year's history, but do it by day.

A the medium size map-reduce files for Microsoft and Yahoo are only 14GB.

b 90% of the Facebook Map-reduce task is less than 100GB.

Most of these analysis tasks can be placed in the primary storage of a single server. If there is a way to share the storage of servers in a single rack, there may be 99% of tasks that can be done. So is it necessary for us to churn files? Why bother to upgrade the HDD to SSDs? Any software engineer would say, "If all my data is stored in the primary store ... then I can get 100 times times faster!"

This dream is now becoming a reality. When I posted on the BBS, I saw "Spark Summit-April 19, Beijing--100 times The Big data analysis of Hadoop" flashing propaganda banner. What has spark changed, claiming to be 100 times times faster than Hadoop?

A) in-store data analysis-you can no longer need to access data via file system and disk IO, which is advantageous for multiple repetitions (map without reduce)

b No sharing--data sharing or a business request made and met by a remote node

It seems that the removal of HDFs has brought about 100 times times of ascension? So what if the hardware allows the nodes to share data structures directly between the stores? Can you bring in an extra 100 times-fold boost?

In this context, I would further share with you the hardware architect's dream of Big data:

A to build a rack with high speed interconnection

(b) Stacking a set of CPUs in the rack (CPU pool)

C increased share of DRAM and/or non-volatile storage pools

d) and shared Ssd/hdds pools

About this dream, we PMC "P-Star" for its title for the Fdio;intel called RSA; Facebook is a opencompute future; Baidu is named Scorpio 3.0. Through it, 99% of the world's big data problems may be handled within a single rack.

Recommended reading:

1. List 10 things that are not suitable for large data processing

2. From article writing to uncover the big data processing veil

3. Large data processing model--system structure, method and development trend

Original link: http://blog.csdn.net/pmc/article/details/25194467

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.