Understand the pros and cons of Hadoop with structured data

Last Update:2014-12-24 Source: Internet

Author: User

Keywords Hadoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, let's define the meaning of the log analysis. The most common log analysis use case is the use of http://www.aliyun.com/zixun/aggregation/14417.html ">apache Hadoop Processing machine generated logs (usually refers to Web applications and support Web The click stream of the application. Log analysis requires a large amount of semi-structured information, which is then pooled into more easy-to-use datasets and summarizes important information from interactions. Log-processing is the core use case for creating Hadoop, so it's not surprising that it works in this scenario.

Google, Yahoo, and many other Internet properties run through the business model, and the business model is largely dependent on these operations and is really good. However, a large 1788.html ">" is not known in time for WEB events, but rather has to undergo a certain amount of latency (not measured in hours or days, but for weeks) to understand this by clicking or blog behavior. Since the starting point is very low, it is not difficult to make a big difference.

In addition, because most companies do not want to deactivate existing data analysis systems (often by third parties specializing in WEB-click Analysis), the log analysis scheme of Hadoop can be said to be extremely low-risk, but a good starting point for enabling large data technologies. It is not a mission-critical technique. In a log analysis use case, even if the operation is wrong, the user will not be fatally affected, and will not cause a large amount of money to face risk.

For many of the traditional companies that are just starting to use log analysis technology, implementing a log-processing use case is attractive to Hadoop vendors because it relies on non-critical data, and frankly, that's not hard to do. The cost of failure and experimentation is very low and can be distinguished from other production applications and job flow independently, and can be accomplished using the command-line tools that come with the general Hadoop distribution scheme. You do not have to disclose the experiment or method to other employees in the enterprise at all.

About the drawbacks ...

The point is that using Hadoop to successfully analyze log data is not a successful prediction of a typical business scenario. The factors that drive Hadoop to adapt to log analysis may obscure real enterprise applications and success requirements. Log data structure is quite large. Although the volume of data may be considerable, it is regrettable that too much repetition is the real reason why there are not enough venues to test data from various sources and structures.

I have found that most log analysis projects are static, unpredictable projects, and therefore can only count as log ETL jobs rather than analysis jobs. There is no need to deal with information lineage issues, and there is often only one source of information, so we assume that the information is valid and the data quality "pass". In addition, governance issues are often not considered (or governance measures are not implemented even when governance issues are considered). In general, no SLA is required, and the job is run at night, so no actual effect will be made on the use case regardless of whether the job ends at four or six o'clock in the morning.

The visibility of these job requirements is very low (if required), usually because you simply "crush" the data and then use other systems or manual jobs for processing. There is no need to test the simplicity of Hadoop for non-developers. There is no connection between Hadoop and other business intelligence and reporting systems within the company. In other words, these projects are not a representative test of the actual use of successful cases. They do not use real data streams, and often cannot support second and third use cases on the same platform with the same technology.

Rather, it's not that log analysis is not a valid use case, nor is it a good idea to argue that Hadoop is bad; what I'm saying is: Don't take it for granted that the initial success of using Hadoop in the field of log analysis will certainly result in a large scale deployment of the enterprise. Don't confuse the concept of success, which is essentially just another way to perform a single domain isolation ETL, with no data quality or SLA requirements, and it's not possible to predict which approach is effective for your typical enterprise production environment.

What do you think? Is log analysis a good starting point or a bad choice to start a big data trip? Please tell us about your thoughts in the comments.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More