The present and future of Hadoop

Source: Internet
Author: User
Tags file system hortonworks mapr

Today, big Data and Hadoop are like storms in the computer industry. From CEOs to CIOs to developers, everyone has a way of thinking about their usage. According to the Wikipedia statement:

"Apache Hadoop is an open source software framework that supports data-intensive, distributed applications, with license authorization attached to Apache V2 license. [1] It enables applications to work with Byte (petabytes)-level data and can be run on thousands of stand-alone computers. Hadoop originates from Google's MapReduce and Google File System (GFS) papers. It is now generally assumed that the complete Apache Hadoop ' platform ' consists of the Hadoop kernel, MapReduce and HDFs, as well as a number of related projects-including Apache Hive, Apache HBase, etc.

Unfortunately, this definition does not really explain Hadoop and its role in the enterprise.

In this virtual symposium, Infoq interviewed a number of Hadoop providers and users, who commented on the present and future of Hadoop and discussed the key to the continued success and further promotion of Hadoop.

Participants:

Omer Trajman, vice President, Cloudera Technology Solutions

Jim Walker,hortonworks Product Director

Ted DUNNING,MAPR Chief Application architect

Michael Segel, founder of the Chicago Hadoop user base

Problem:

How do you define Hadoop? As an architect, we think more professionally about terminology such as servers and databases. What level does Hadoop belong to in your heart?

Although people are actually talking about Apache Hadoop, they rarely download it directly from the Apache Web site. Today, most people use the Cloudera, Hortonworks, MapR, Amazon, and so on "release" to install. What do you think is the cause of this phenomenon, and what are the differences between these "distributions" (ask the suppliers to be objective, we know yours is the best).

What do you think is the most common use of Hadoop today? What about the future?

In addition to Flume, Scribe, and Scoop, Hadoop has very little integration with other enterprise computing. Do you think Hadoop will start to play a bigger role in the enterprise IT infrastructure?

In addition to the famous percolator, most of the Google projects are now implemented in Hadoop. Do you think such a project should be within the range of Apache tracking? Do you know the other directions for real time Hadoop?

Many people try to improve their skills in using Hadoop. And a lot of people are looking for people who are familiar with Hadoop. But still not clear, how to master Hadoop technology? Read this book? Take part in training? or test certification?

Question 1: How do you define Hadoop? As an architect, we think more professionally about terminology such as servers and databases. What level does Hadoop belong to in your heart?

Omer Trajman:hadoop is a new data management system, which combines traditional unstructured domain or non relational database by computing network processing ability. Although it borrows a lot of experience from the traditional large-scale parallel processing (MPP) database design pattern, Hadoop has several key differences. First, it is designed for low-cost byte economies. Hadoop can be run on almost any hardware, and can be very tolerant of heterogeneous configurations and occasional failures. Second, Hadoop is very easy to expand. The first version of Hadoop can be extended to thousands of nodes, and the current version of the test shows that it can continue to grow above tens of thousands of nodes. Using the main two-slot 8-core processor, that's 80,000-core computing power. Third, Hadoop can store and process data types very flexibly. Hadoop can accept any format, any type of data, and has a rich set of APIs for reading and writing data in any format.

Jim Walker:hadoop is a highly scalable open source data management software that makes it easy for you to get, process, and exchange any data. Hadoop can connect to almost every layer of the traditional enterprise data stack, thus occupying the central location of the data center. It will exchange data and provide data services between the system and the user. At the technical level, Hadoop does these things, but because it brings supercomputing power to the public, it also creates a business change. It is open source software, created by communities that bring large-scale parallel processing, and horizontally extend all storage on commodity hardware. It does not replace the system, which forces existing tools to become more specialized and occupies a place in the popular Data architecture Toolbox.

Ted Dunning: It's impossible to define Hadoop very precisely, at least for everyone to agree with you. Even so, assuming you consider these two definitions, you can get very close answers:

A. Apache project with the same name, the project has released a map-reduce implementation and a distributed file system.

B. A set of project collections, Apache, and some other projects that use or somehow associate to the Apache Hadoop project.

The first definition is also often used to imply software released by the Apache Hadoop project, but it is a controversial topic to say how close a software is to (or should be) referred to as a Hadoop or Hadoop derivative.

For me, as a major user or developer of Hadoop-related software in the community, I prefer a less common definition of Hadoop. For the term "Hadoop," I prefer to use it first to represent the community, followed by the main code or project. For me, the community is more important than any single project code.

Michael Segel: I think of Hadoop as a framework and a set of tools for distributed or parallel processing. Your distributed storage in HDFS, distributed computing models in Job tracker and task trackers, and distributed persistent object storage in HBase. Depending on the location of Hadoop, I think it depends on the specific solution.

It is difficult to classify Hadoop as a separate category. In some scenarios, Hadoop is used to do intermediate processing, which is difficult to do in a traditional RDBMS, so use Hadoop as an intermediate step, and finally use their existing BI tools to perform the analysis. Hadoop was also used to provide "real time" (subjectively) data processing. Integrated LUCENE/SOLR and existing hadoop/hbase as part of a real-time search engine. The key is that Hadoop is used to solve different types of problems in different organizations, even in the same enterprise. This may be one of the biggest advantages of Hadoop, which is a basic framework that can be used to solve different types of problems. More people are using Hadoop and pushing it to the limit; This will produce a wide variety of solutions.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.