"Abstract" when Hadoop enters the enterprise, it must face the problem of how to solve and deal with the traditional and mature it information architecture. In the past, MapReduce was mainly used to solve unstructured data such as log file analysis, Internet click Stream, Internet index, machine learning, financial analysis, scientific simulation, image storage and matrix calculation. But in the enterprise, how to deal with the original structured data is a difficult problem for enterprises to enter into large data field. Enterprises need large data technologies that can handle both unstructured and structured data.
In the large data age, Hadoop is mainly used to deal with unstructured data, and how to deal with the traditional IOE architecture of structured data is a difficult problem for enterprises. In this context, SQL on Hadoop, which handles both structured data and unstructured data, has emerged.
SQL on Hadoop was the hottest topic of the 2013, and it was pushed by Cloudera Impala's release version. Currently, SQL on Hadoop is in its infancy and has many technical practices. And because the enterprise has adapted to the flexible processing of small data, go to Hadoop suddenly become disoriented, so the voice of SQL on Hadoop is growing. SQL on Hadoop guarantees both Hadoop performance and SQL flexibility. About SQL on Hadoop, the industry has different views, industry major data companies are also actively studying.
1. The traditional way of db on top
Some North American vendors use the traditional approach of DB on top to solve SQL on Hadoop, which combines different computing frameworks for different data operations. It is represented by EMC Greenplum, HADAPT, citus data. The HADAPT is connected to the PostgreSQL frame on Hadoop to complete the query of structured data. It provides a unified data-processing environment that leverages the high scalability of Hadoop and the high speed of relational databases, separating the queries between Hadoop and relational databases. Citus data uses distributed processing techniques to complete the query by transforming multiple types into the native types of databases.
Figure 1, Hadapt
The DB on top approach is an initial attempt by industry colleagues to address structured and unstructured data, first proposed by HADAPT in 2010 and ready to run on the Amazon EMR Community Edition. However, the essence is that data is stored separately in two computational frameworks, as shown in Figure 1, where structured data is stored in High-performance relational data engines (High-performance relational Engine for structured). Unstructured data is stored in the Hadoop distribution filesystem (Hadoop Distributed File system for unstructured data), and two types of data interactions depend on the slice execution of the query. The organization control of metadata must be the excessive technology in the evolution of system expansion.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.