Learn 12 facts about Hadoop

Source: Internet
Author: User
Keywords Fact data warehouse dfs big data
Tags analysis analytics apache big data code computing creating creating a

Now, Apache Hadoop no one I do not know unknown. When Doug Cutting, a Yahoo search engineer, developed the open source repository for creating a distributed computing environment and named his son's elephant doll, who could think of one day it would occupy the head of "big data" technology Top spot it.

Although Hadoop hot with big data together, but I believe there are still many users do not understand it. At the TDWI Solutions Summit last week, Philip Russom, TDWI Research Director and Industry Analyst, delivered a keynote address on "12 facts about Hadoop." Below is a summary of the essence of the presentation, and we hope you'll learn more about Hadoop help.

Fact 1: Hadoop is made up of multiple products.

When people talk about Hadoop, they often think of it as a single product, but in fact it consists of several different products.

Russom said: "Hadoop is a portfolio of open source products that are part of the Apache Software Foundation."

When it comes to Hadoop, people tend to put it together with MapReduce, but HDFS is, like MapReduce, the foundation of Hadoop.

Fact 2: Apache Hadoop is open source, but proprietary vendors also offer Hadoop products.

Because Hadoop is open source and freely available for download, vendors such as IBM, Cloudera, and EMC Greenplum are all available to launch their own special Hadoop distributions.

These special distributions generally have additional features such as advanced management tools and related support and maintenance services. Someone may scoff: Since the open source community is free, why should we pay for it? Russom explained that these versions of HDFS are more appropriate for some IT departments, especially those who have relatively mature enterprise IT systems.

Fact 3: Hadoop is an ecosystem, not a product.

Hadoop is jointly developed and promoted by the open source community and various vendors. Specifically, the vendor's Hadoop product is more structured and relational.

Russom said: "The reporting platform has always been, data integration platform for the newer platform to provide a wide range of interfaces, Hadoop is no exception."

Fact 4: HDFS is a file system, not a database management system.

What Russom can not stand the most is that people often confuse the two. Being able to manage a data set is one of the most important features of a data management system that HDFS does not have.

Database management system, we can achieve through the index query data random access, it is often handled by the structured data, and in Hadoop will not handle such data types.

Fact 5: Hive is similar to SQL, but not standard SQL.

Most of the traditional business tools for getting data are based on SQL, which is a headache because Hadoop uses languages ​​like SQL but not SQL - Apache Hive and HiveQL.

Russom said: "I often hear people say, 'Hive is very easy to learn, direct Hive on the line' But this does not solve the fundamental issues compatible with SQL tools.

Russom believes that compatibility is only a short-term issue, but hinders the popularity of Hadoop.

Fact 6: Hadoop and MapReduce are interrelated, but not interdependent.

MapReduce was launched by Google Developer as early as HDFS. In addition, vendors such as MapR have been promoting the diversity of MapReduce features without HDFS support.

However, Russom thinks they are very complementary. Much of the value of HDFS is reflected in the tools that can be cascaded to distributed file systems.

Fact 7: MapReduce provides control over the analysis, not the analysis itself.

MapReduce is a common execution-driven engine that helps with big data analytics. It reads handwritten code data, processes it in parallel, and maps the result to a single collection. However, we need to make it clear that MapReduce does not do its own analysis.

"MapReduce can be thought of as an upgraded version of MPP architecture, and you can parallelize it, no matter how you write code," said Russom.

Fact 8: The significance of Hadoop lies not only in the amount of data, but also in the diversity of data.

Some people classified Hadoop as a mass data processing technology, but the real value of Hadoop is the ability to handle diverse data.

Russom said: "Hadoop's scope of processing for most data warehouses, such as for semi-structured and completely unstructured data."

Fact 9: Hadoop is a data warehouse supplement, not a data warehouse replacement.

Hadoop's ability to manage a wide variety of data types made the story of "Data Warehouse Dying," but Russom retorted.

He asks: "How often do people replace a technology in IT? Never before."

Data warehousing still performs well in its area, and Hadoop can complement data warehouse technology. The architecture of data warehouses and other systems is increasingly moving closer to distribution, and Hadoop will play a role here.

Fact # 10: Hadoop is more than just web analytics.

Hadoop is widely used on the Internet, and Russom believes that Hadoop's popularity is due in part to its ability to handle more types of analytics.

Russom cited examples of railroad companies, robots and retailers. Railway companies can use sensors to detect abnormally high temperature rail vehicles to prevent accidents.

Russom is optimistic about the future of Hadoop, but at the same time thinks its popularity will take years.

Fact 11: Big data is not necessarily non-Hadoop.

Do not look now that big data and Hadoop are inextricably linked, Russom thinks Hadoop is not the "only" big data. He mentioned many other vendors such as Teradata, Sybase IQ (acquired by SAP) and Vertica (acquired by HP).

In addition, when Hadoop was not born, some companies started to study big data. For example, the telecommunications industry has call detail records many years ago.

Fact 12: Hadoop is not "free lunch."

Although Hadoop is open source technology, software installation and deployment is costly. Russom said that because of the lack of Hadoop management tools and support services, enterprises are prone to additional costs during use. In addition, because it does not optimize the program, we can only ask professionals in the operating environment, handwriting input code, and these professionals pay the price of expensive.

Not to mention the cost of deploying Hadoop cluster hardware and related configurations.

He said: "Do not think Hadoop is free or very cheap, hidden behind it you suddenly can not see."

Original link: http: //www.searchbi.com.cn/showcontent_62856.htm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.