The father of hadoop outlines the future of the Big Data Platform

Source: Internet
Author: User

"Big Data is neither a hype nor a bubble. Hadoop will continue to follow Google's footsteps in the future ." Doug cutting, creator of hadoop and founder of Apache hadoop, said recently.

As A Batch Processing computing engine, Apache hadoop is the core open-source software framework of big data. Hadoop is not suitable for online interactive data processing required for Real-time Data visibility. Is that true? Doug cutting, creator of hadoop and founder of the Apache hadoop Project (currently the chief architect of cloudera), said: "We believe hadoop has a future beyond batch processing.

Founder of hadoop, founder of Apache hadoop project, and chief architect of cloudera, Doug Cutting

"Batch processing is useful. For example, when you need to move a large amount of data and analyze all the data. But I still think that what people really want is a combination of batch processing and online computing. Hadoop will become the core of the enterprise's mainstream data processing system in the future ." Said cutting.

Where is hadoop?

At the just-concluded strata Conference + hadoop World Conference, cutting explained the core idea of hadoop stack and its future development direction. "Hadoop is seen as a batch processing computing engine. In fact, this is what we started with (combined with mapreduce ). Mapreduce is a great tool. There are many books on how to deploy various algorithms on mapreduce on the market ." Said cutting.

Mapreduce is a programming model designed by Google to use distributed computing for batch parallel processing of massive data. Mapreduce gets an input and divides it into many smaller subproblems, which are allocated to different nodes for parallel processing. Then, they recombine the answers to the subquestions to form the output.

"This is very effective," cutting says. "It allows you to move computing to data. In this way, you do not need to copy data everywhere when processing data, and it also forms a sharing platform. Building a distributed system is a complex process, so we do not want to re-deploy it. Mapreduce has proved to be a solid foundation, and many tools have been developed by mapreduce, such as pig and hive ." Key features of hadoop

To prove the versatility of the hadoop big data platform, cutting describes two core topics of hadoop that he considers as hadoop. First, the hadoop platform has good scalability. It is not only applicable to small datasets stored in memory, but also can be extended to process large datasets.

"A key factor in evaluating scalability is economic affordability. We run on a general hardware platform because it allows you to expand further. If you can purchase 10 times of storage, you can store 10 times of data. Therefore, economic affordability is the key, and this is why we use general-purpose hardware because it is the most cost-effective platform ." Said cutting.

Another key feature of hadoop is open source. Cutting points out that open-source software is very affordable. Developers can pay for the value they provide, but for the value they provide. Developers do not need to pay year after year. Over time, suppliers need to provide you with value to win the trust and confidence of developers. In addition, for hadoop, users can save data in the original form, and then use different modes when using data.

Another popular approach in the big data field is to analyze more data than smarter algorithms to help you better understand your problems. That is to say, you should spend more time collecting data, rather than adjusting the algorithm used by smaller datasets. Intuitively, this is like an image with a higher resolution. If you try to analyze the image, you should choose to enlarge the high-resolution image instead of the low-resolution image.

Cutting also pointed out that batch processing is not a typical feature of hadoop. For example, Google bigtable's hbase is part of the hadoop stack and has become a very successful open-source non-relational distributed database. Hbase is an online computing system, rather than a batch computing system. "Hbase also supports batch processing, which shares storage with HDFS and other hadoop stack components. I think this is one of the reasons why hbase is so popular. Hbase is integrated into other parts of the system, rather than being an independent system. It can be shared with other components in the stack and share features such as availability, security, and disaster recovery ." Cutting explains.

Future prospects of Technology

If hadoop is not just a batch processing computing platform, but a more general data processing platform, what will it become and where will it go? Cutting said that we certainly hope to have an open-source Big Data Platform and be able to run on common hardware. At the same time, we also hope it has linear scalability. That is to say, if you need to store 10 times of data, you only need to purchase 10 times of hardware. You can extend a dataset in this way no matter how big it is.

This is also true for performance. For batch processing performance, if you need a larger batch processing throughput or a lower latency, you only need to increase the number of hardware. The same applies to interactive queries. By adding hardware, you can achieve linear expansion in terms of performance and Data Processing magnitude. Cutting also said: "People usually think that after using big data platforms, they need to give up something. I don't think so. In the long run, we don't need to give up any features ."

For the future technological development direction of hadoop, cutting said Google has provided a roadmap. "After Google published GFS and mapreduce papers, we soon copied them to the hadoop project. Over the years, Google has inspired hadoop open-source stacks in many ways. Google's sawzall system gave birth to pig and hive, while bigtable directly inspired hbase. I am very excited to see that Google published a paper named spanner this year, which introduces the mechanism for implementing transmission in the distributed database system. Many may think that this will not soon become a reality, but it shows us the way forward ." Said cutting.

Cutting pointed out that as a complex technology, spanner will not soon become part of hadoop, but it does clarify the direction of technological development. He also mentioned impala, the latest database engine released by cloudera, which can use SQL to query datasets stored in hbase. Impala will bring users a new experience of interactive online queries. It has also followed some of Google's research results and has been released for a while. Cutting believes that Impala will develop into a general technical platform.

"We already know the way forward and how to achieve our goals. Therefore, I encourage you to start using hadoop now, because you will gain more in the future ." Said cutting.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.