How important is the Hadoop infrastructure in a large data environment?

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Infrastructure large data fault very

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop and large data began to become popular at the same time, and thus became synonymous. But they are not the same thing. Hadoop is a parallel programming model implemented on an integrated processor cluster, mainly for data-intensive http://www.aliyun.com/zixun/aggregation/13506.html > Distributed applications. That's where Hadoop works. Hadoop existed long before the big data was a passion. But then the meaning of Hadoop changed and was used as a structure to build large data infrastructures.

Hadoop is based on Google's MapReduce algorithm, a method of allocating applications in clusters. Google's file systems, operating systems, mapreduce applications, and Distributed file Systems (HDFS) are almost all based on Java, triggering a series of problems. Hadoop also needs to provide resilience through failover between nodes. In many clusters, when a node fails, it should be able to handle the problem in a timely manner and move to the next cluster.

In the future, I'm not sure I'll be able to relax with Hadoop. There is, in fact, a general consensus about Hadoop: It takes many aspects of the Hadoop infrastructure to work for the enterprise. First, the core of Hadoop is namenodes, which stores the metadata associated with the Hadoop cluster (the size of each device in the cluster, the capacity of each device, the purpose of the device, and the amount of workload it can withstand). This type of information is not replicated anywhere but only in one place, thus becoming a single point of failure in the Hadoop infrastructure. If an important program is being processed on the Hadoop cluster, it must be addressed. The second is jobtracker. Jobtracker is an integral part of managing mapreduce tasks and scheduling workloads for different servers, in other words, jobtracker closer to data that is analyzed in a specialized way. It should be emphasized that the jobtracker is also a single point of failure, and only in the urgent masses of a server. These are just the most obvious problems with the current Hadoop architecture.

The Hadoop technology itself is not simple. If you plan to deploy Hadoop, you need enough programs. These programs are capable of doing all sorts of things that a single program in the toolbox cannot do, knowing that pig is the abbreviation for pig correlation, which is closely related to the running environment of Hadoop. Of course, these programs also need to know Java, JavaScript target symbol language JAQL. Now find a competent PHP program is not a difficult thing to do, just find some great span of the combination can be.

So first there will be a few single points of failure. Second, Hadoop needs some special skills that are not in the technology market. Again, a performance problem arises. Every company that has deployed Hadoop has a performance problem with the Hadoop operation, so large data analysis of it will always exist. While some of the problems are related to poor write application code, it is more about the architecture itself. Many companies work additional server clusters, direct-attached storage, and additional software tools to improve the speed and feed of the Hadoop infrastructure.

Of course, infrastructure management is also a headache. Some people try to deal with Hadoop infrastructure management with zookeeper technology, while many vendors try to deal with the custom products they provide. The problem is that there is still no good hadoop management paradigm and there seems to be little hope.

Not long ago, a Forbes article expressed another important concern that I would like to share: Hadoop equates to the infrastructure that bears large data projects. Now, the business people do not understand this process, do not mind how to deal with large data. They just want business profits, make it a little faster. The authors of the article rightly observed that Hadoop might be a great fit for sizing data (the point of view of the article), but it was definitely not a quick and professional analysis or real-time analysis. As a result, the article cannot be used for business processing, but it has some value in it and is just one way of controlling the data.

That points to the core of the problem, and the ultimate real question is: Where do we use large data? Many people are not aware of this problem, except those in the market who want to use large data, whose aim is to make their products and services more specialized when targeting specific customer groups.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More