Some Hadoop facts that programmers must know and the Hadoop facts of programmers

Source: Internet
Author: User

Some Hadoop facts that programmers must know and the Hadoop facts of programmers

The programmer must know some Hadoop facts. Now, no one knows about Apache Hadoop. Doug Cutting, a Yahoo search engineer, developed this open-source software to create a distributed computer environment ......

1: Hadoop is composed of multiple products.
When talking about Hadoop, people often regard it as a single product, but in fact it is composed of multiple different products.
Russom said: "Hadoop is a combination of a series of open-source products, all of which are projects of the Apache Software Foundation ."
When it comes to Hadoop, people usually put it together with MapReduce, but HDFS and MapReduce are also the basis of Hadoop.
2: Apache Hadoop is an open-source technology, but proprietary vendors also provide Hadoop products.
Hadoop is an open-source technology and can be downloaded for free. Therefore, vendors such as IBM, Cloudera, and EMC Greenplum can release their respective Hadoop Special Release versions.
These special release versions generally have some additional features, such as advanced management tools and related support and maintenance services. Some may sneer at this: since the open-source community is free, why do we have to pay for its services? Russom explained that these versions of HDFS are more suitable for some IT departments, especially those who are already mature in enterprise IT systems.
3: Hadoop is an ecosystem rather than a product.
Hadoop is jointly developed and promoted by the open-source community and various vendors. Specifically, the vendor's Hadoop products are more structured and link-oriented.
Russom said: "The report platform and data integration platform have been providing various interfaces for the updated platform, and Hadoop is no exception ."
4: HDFS is a file system rather than a database management system.
The most intolerable thing about Russom is that people often confuse the two. Being able to manage datasets is an important feature of the Data Management System, which HDFS does not have.
In the database management system, we can query indexes to achieve Random Access to data. It often processes structured data, but does not process such data types in Hadoop.
5: Hive is similar to SQL, but non-standard SQL.
Traditional business tools for data retrieval are mostly SQL-based, which is a headache, because Hadoop uses a language similar to SQL but not SQL-Apache Hive and HiveQL.
Russom said: "I often hear people say that 'hive is very simple to learn, just learn Hive directly. 'But this does not solve the fundamental problem of compatibility with SQL tools ."
Russom believes that compatibility is only a short-term problem, but it hinders the popularization of Hadoop.
6: Hadoop and MapReduce are mutually related, but they are not mutually dependent.
Before the emergence of HDFS, MapReduce was developed and launched by Google. In addition, vendors such as MapR have been promoting the diversity of MapReduce functions without HDFS support.
Even so, Russom believes they are complementary. Most of the value of HDFS is embodied in tools that can be stacked into distributed file systems.
7: MapReduce provides control over analysis, rather than the analysis itself.
MapReduce is a general execution-driven engine that can assist in big data analysis. It can read handwritten code data, perform Parallel Automatic Processing on it, and map the results to a single set. However, we need to make it clear that MapReduce does not analyze itself.
Russom said: "MapReduce can be seen as an upgraded MPP architecture. No matter how you write code, it can parallelize them, which is very powerful ."
8: Hadoop is not only about data volume, but also about data diversification.
Some people classify Hadoop as a mass data processing technology, but the real value of Hadoop is its ability to process diversified data.
Russom said: "Hadoop is not suitable for most data warehouses, such as for semi-structured and fully unstructured data ."
9: Hadoop is a supplement to the data warehouse, not a substitute for the data warehouse.
The ability of Hadoop to manage diverse data types makes four comments about "data warehouse will die", but Russom retorted.
He asked: "How often do people replace a technology in the IT field? Almost never ."
The performance of the data warehouse in its field is still outstanding. Hadoop can supplement the data warehouse technology. The architecture of data warehouses and other systems is increasingly moving closer to the distributed architecture. Hadoop will play its role here.
10: Hadoop is not just Web analysis.
Hadoop is widely used on the Internet. Russom believes that the popularity of Hadoop is partly because it can handle more types of analysis.
Russom cited examples of rail companies, robotics, and retail. Railway companies can use sensors to detect abnormally high-temperature rail vehicles to prevent accidents.
Despite being very optimistic about the prospect of Hadoop, Russom believes that it will take several years to become popular.
11: Big Data is not necessarily not Hadoop.
Although Big Data and Hadoop are inseparable, Russom thinks that Hadoop is not the "only" of big data ". He mentioned products from many other vendors, such as Teradata, Sybase IQ (acquired by SAP), and Vertica (acquired by HP.
In addition, when Hadoop was not born, some enterprises began to study big data. For example, the telecom industry has recorded call details many years ago.
12: Hadoop is not a "free lunch ".
Although Hadoop is an open-source technology, software installation and deployment costs a lot. Russom said that due to the lack of Hadoop management tools and support services, enterprises are prone to additional costs during use. In addition, because there is no optimization program, we can only ask professionals to manually enter code in the runtime environment, and these professionals have a high salary price.
You do not need to mention the cost of hardware and related configuration for Hadoop cluster deployment.
Finally, let us know: "Never think that Hadoop is free or cheap. You cannot see the hidden overhead behind it at once ."

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.