The father of Hadoop outlines the future of big data platforms

Source: Internet
Author: User
Keywords We we very very very very very very very very very very very very very very very much.

The Apache Haddo is a batch computing engine that is the open source software framework for large data cores. Does Hadoop not apply to online interactive data processing needed for real real-time data visibility? Doug Cutting, founder of the Hadoop creator and Apache Hadoop project (also the Cloudera company's chief architect), says he believes Hadoop has a future beyond the batch process.

"Batch processing is useful, such as when you need to move a lot of data and analyze all the data, but I think what people really want is a combination of batch processing and online computing," says cutting. Hadoop will become the core of the future mainstream data processing system for the enterprise. ”

Hadoop

At New York's O ' Reilly Strata Conference + Hadoop World Congress, cutting explained the core idea of the Hadoop stack and its direction.

"Hadoop is seen as a batch computing engine, in fact, this is where we started (combining MapReduce)," Cutting says, "MapReduce is a great tool with a lot of books on how to deploy algorithms on MapReduce." ”

MapReduce is a programming model designed by Google to process large-scale data in parallel using distributed computing. MapReduce gets an input and divides it into smaller child problems, which are assigned to nodes to handle in parallel. They then regroup the answers to the child questions to form the output.

"It's very effective," cutting says, "it allows you to move your calculations to your data, so that when you're working with the data, you don't copy the data everywhere, and it forms a shared platform." Building a distributed system is a complex process, not something that can be done ' overnight ', so we don't want to redeploy it over and over again. MapReduce has proven to be a solid foundation on which we have seen many tools, such as pig and Hive, have been developed. ”

He added: "Of course, this platform is not only for batch computing, it is a very common platform." ”

Features of the Hadoop platform

To illustrate this point, cutting describes the two core themes of Hadoop that he considers.

First, the Hadoop platform is well extensible, it applies to small datasets stored in memory, and it can scale to handle large datasets.

"One of the key factors of scalability is affordability, although we rarely hear about it," he says. "We run on commodity hardware because it allows you to expand further." If you can buy 10 times times of storage, you can store 10 times times the amount of data. So affordability is the key, and that's why we use commodity hardware because it's the most affordable platform. ”

The same thing, he points out, is that Hadoop is open source.

"Similarly, open source software is very affordable," he says, "when people develop their apps, the most conflicting platforms are free platforms." You may pay the supplier, but you pay for the value they provide, you don't have to pay for it year in, and over time the supplier needs to earn your trust and confidence by providing you with value. ”

In addition, other features of Hadoop include:

"There is a concept that when you load data, you don't need to use strict patterns to limit your data," he says, "and for Hadoop, you can save data in the original form, and then use different patterns when you use the data." ”

Another popular practice in large data areas is that, in general, analyzing more data than smarter algorithms can help you better understand your problems. That is, you should spend more time collecting data than the algorithms used to adjust smaller datasets. Intuitively, it's much like a high-resolution image, and if you're trying to parse a picture, you should choose to zoom in on a high-resolution image instead of a low-resolution one.

HBase is an example of online computing in Hadoop

He points out that batching is not a key feature of Hadoop, but Apache HBase, which is part of the Hadoop stack and a very successful open source, relational distributed database (modeled on Google's BigTable). HBase is an online computing system, not a batch computing system.

Cutting explains: "HBase also supports batching, which shares storage with HDFS and other components of the Hadoop stack, and I think that's one of the reasons why it's so popular. HBase is integrated into other parts of the system rather than being a stand-alone system that can be shared with other components of the stack: it can share availability, security, and disaster recovery. ”

Looking to the "holy Grail" of Hadoop

If Hadoop is not just a batch computing platform, but rather a more generic data-processing platform, what will it look like and how will it reach that step?

"For the ' Holy Grail ' of large data systems, we think there should be a lot of factors," cutting says. "Of course, we want it to be open source and run on normal hardware." We also want it to have a linear extension: If you need to store 10 times times the data, you only need to buy 10 times times the hardware, and no matter how big your dataset becomes, you can extend that. ”

The same is true with performance, Cutting says, and if you need more batch throughput or a smaller latency, you just need to increase the number of hardware. The same is true for interactive queries. Adding hardware will give you a linear expansion in performance and data volume.

He added: "People think that when you're using a big data platform, you need to give up something, I don't think, and I don't think we need to give up anything in the long run." ”

Google provides us with a roadmap

"Google has provided us with a roadmap," he says, "and we know where we are going." After they started releasing their GFS and MapReduce papers, we quickly copied them into the Hadoop project, which in many ways has inspired the Open-source stack over the years. Google's sawzall system spawned pig and hive, and bigtable directly inspired HBase. I am excited to see this year's Google-published article called Spanner, which describes the systems that implement transmission in distributed systems (multiple-table transmissions running on databases around the world), and many people will assume that this will not happen soon, but it shows us the way forward. ”

Cutting points out that spanner is a complex technology that does not become a part of Hadoop so quickly, but it indicates a direction. He also mentions the new database engine published by Impala,cloudera, which can use SQL queries to store datasets in HBase.

"We know where we are going, and we know how to achieve our goals," says cutting. So, I encourage you to start using Hadoop now, because in the future you will gain more. ”

(Responsible editor: Schpeppen)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.