Today, Apache Hadoop is no longer known to anyone. When Doug Cutting, the Yahoo search engineer, developed the Open-source Software Library to create a distributed computing environment and named his son's elephant doll, who would have thought it would one day occupy the top spot of "Big data" technology?
While Hadoop is associated with big data, it is believed that many users have little knowledge of it. At last week's Tdwi Solution Summit, TDWI research director and industry analyst Philip Russom published a keynote address on "12 Facts about Hadoop," which the editors will summarize in this article in the hope of helping you learn more about Hadoop.
The fact 1:hadoop is made up of multiple products.
When people talk about Hadoop, they often treat it as a single product, but in fact it is made up of a number of different products.
Russom said: "Hadoop is a series of open source products, these products are the Apache Software Foundation project." ”
When it comes to Hadoop, people tend to put it together with MapReduce, but in fact HDFs and MapReduce are the basics of Hadoop.
The fact 2:apache Hadoop is open source technology, but proprietary vendors also offer Hadoop products.
Because Hadoop is an open-source technology, it is free to download, so vendors such as IBM, Cloudera and EMC Greenplum can launch their own special versions of Hadoop.
These special distributions typically have additional features such as advanced management tools and associated support maintenance services. One might scoff: since the open source community is free, why should we pay for its services? Russom explains that these versions of the HDFs are more appropriate for some IT organizations, especially the relatively mature users of the enterprise IT systems.
The fact that 3:hadoop is an ecosystem, not a product.
Hadoop is developed and promoted by the open source community and various vendors. Specifically, the manufacturer's Hadoop products are more structured and more relational.
Russom said: "The report platform, data integration platform in the new platform to provide a variety of interfaces, Hadoop is certainly no exception." ”
The fact is that 4:hdfs is a file system, not a database management system.
Russom Most unbearable is that people often confuse the two. The ability to manage data sets is one of the most important features of the data management system, which is HDFs.
In the database management system, we can query the index to achieve random access to the data, it often handles the structured data, and in Hadoop does not handle such data types.
The meaning of Hadoop is the diversity of data
The fact 5:hive is similar to SQL but is not standard SQL.
The traditional business tools for getting data are mostly SQL based, which is a bit of a headache because Hadoop uses a language similar to SQL but not SQL--apache hive and HIVEQL.
Russom said: "I often hear people say, ' hive learning is very simple, direct learning hive on the line." ' But that doesn't solve the fundamental problem of compatibility with SQL tools. ”
Russom thinks compatibility is a short time problem, but it hinders the popularity of Hadoop.
Facts 6:hadoop and MapReduce are interrelated, but not interdependent.
MapReduce was launched by Google before the advent of HDFs. In addition, manufacturers such as MAPR have been promoting the diversity of MapReduce functions without HDFS support.
Nevertheless, russom that they are complementary. Most of the value of HDFs is embodied in tools that cascade to distributed file systems.
Facts 7:mapreduce provide control over analysis, not analysis itself.
MapReduce is a general-purpose executive drive engine that assists in large data analysis. It reads handwritten code data, processes it in parallel, and maps the results to a single collection. We need to be clear, however, that MapReduce does not perform analytical work on its own.
Russom said: "MapReduce can be seen as an upgraded version of the MPP architecture." No matter how you write code, it can be parallel, very powerful. ”
The significance of fact 8:hadoop is not only the amount of data, but also the diversification of data.
Some people classify Hadoop as a massive data-processing technology, but the real value of Hadoop is its ability to diversify data processing.
Russom said: "Hadoop is handled in most data warehouses, such as semi-structured and completely unstructured data." ”
The fact that 9:hadoop is a supplement to a data warehouse is not a substitute for a data warehouse.
Hadoop's ability to manage diverse data types has led to a "Data Warehouse will die" speech, but russom to refute it.
"How often do people replace a technology in the IT field?" he asked. Almost never. ”
The Data warehouse is still performing well in its domain, and Hadoop can complement the data warehousing technology. The architecture of data warehouses and other systems is increasingly starting to move toward distributed, and Hadoop is here to play its part.
The fact 10:hadoop is more than web analytics.
Hadoop is widely used on the internet, and russom that the trend is partly because it can handle more types of analysis.
Russom cited examples of railroads, robotics and retailing. Railway companies can use sensors to detect abnormally high temperature rail vehicles to prevent accidents.
Russom, while bullish on the future of Hadoop, believes it will take years to popularize it.
Fact 11: Large data is not necessarily a hadoop.
While big data and Hadoop are inseparable, russom that Hadoop is not the "only" big data. He mentions the products of many other vendors, such as Teradata, Sybase IQ (which was acquired by SAP) and Vertica (which was acquired by HP).
In addition, in the absence of Hadoop, some companies have begun to study big data. For example, the telecommunications industry has a call detail record years ago.
The fact 12:hadoop is not "free lunch".
Although Hadoop is an Open-source technology, the installation of software is expensive. Russom said that because of Hadoop's lack of management tools and support services, it is easy for businesses to generate additional costs in the course of their use. In addition, because it does not have an optimizer, we can only ask professionals to write code in the context of the runtime, and these professionals pay very little.
Not to mention the cost of deploying the hardware and related configuration of the Hadoop cluster.
He said: "Do not think Hadoop is free or very cheap, the hidden costs behind it you can not see." ”
(Responsible editor: The good of the Legacy)