Discussion on the pre-deployment of Hadoop with large data

Source: Internet
Author: User
Keywords These solve can fit ramble

The application of Hadoop is that "data is valuable!", of course, this has been recognized by almost everyone, and in the real world, is doing so, we all want from the system log, network data, social information, such as the vast number of data to discover valuable information, such as user behavior, habits, etc., And these are the next step in marketing effective decision-making basis. After the advent of Hadoop, the discovery of the data was more vividly demonstrated, especially since a well-known Internet company began to use or deploy the Hadoop environment.

In the face of such temptation, many traditional enterprise users want to refer to the Internet business practices, very fast deployment of Hadoop, so as to quickly integrate and explore the value of existing data. But the reality is that the "ideal is very plump, the reality is too bone" that phrase of the network popular language, want to quickly deploy Hadoop, but not so easy. Why is there such a problem, I think about a few reasons:

1. Hadoop offers us only a framework, not a complete solution.

Just like buying a house, builders will always deliver it as an infrastructure, and the overall renovation will depend on the head of the household to design and implement their own style and preferences, and each head of the household has its own definition of the part of the decoration. Hadoop deployment is so, each enterprise's Hadoop environment can be said to be unique, requires enterprise users to their own data environment has a very good comb and understanding. What data do I need to analyze? What kind of information do I need to get? What do I do with this information? Only if you want to understand these issues will the Hadoop deployment reflect its value. And these, not only the technical aspects of the problem, but also the management of the cognitive or even the business level of coordination.

2. Human problems.

Hadoop is an open source architecture, and open source has its own or unresolved problems, such as the development and maintenance of open source architecture due to the uniqueness of the scenario. Hadoop also faces such problems, and there are relatively few people in the market with current Hadoop skills, and these are a lot more difficult to deploy and apply to businesses. A large number of development work requires a large number of developers, the scarcity of individuals has increased the cost and difficulty in development.

3. Only issues suitable for distributed architecture can be solved by Hadoop.

Hadoop is not a panacea and does not solve all data analysis problems. For structured data query and analysis, traditional database structure has its own advantages. Hadoop is a distributed architecture, and distributed architecture determines that "only issues that are appropriate for distributed architecture can be resolved by Hadoop". For example, a pregnant woman would have to have a baby in October, rather than having a baby in 1 months through 10 pregnant women. In the final analysis, only the problem can be split into a number of child problems, and the child problem is independent, that is, can be used in the "key-value" iterative approach to processing, and ultimately can be pushed to export the results we need. Such a problem is HAODOP can solve the problem.

4. Hadoop is not suitable for handling small files.

In fact, big and small is only a relative concept, there is no absolute contrast, the reason that Hadoop is not suitable for processing small files is determined by the namenode limitations of HDFs, each file will be in the Namenode to save the corresponding metadata information, in order to enhance efficiency, This information in the use of the process is stored in memory, if a lot of small files, it will consume a lot of namenode node memory, and for a single node, the expansion of memory has its upper limit. Conversely, if it is relatively large, such as on the GB or larger files, the relative consumption of memory will be relatively small. At the same time, in the process of data processing, the system overhead is much smaller. These architectural features and limitations determine that Hadoop is more suitable for handling "large" data. Of course, in terms of technology implementation, overkill is also possible, it is worth it to see.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.