How to inherit the traditional data processing way in the enterprise

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Large data tradition data processing through

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When Hadoop enters the enterprise, it must face the problem of how to address and respond to the traditional and mature it information architecture. In the industry, how to deal with the original structured data is a difficult problem for enterprises to enter large data field.

When Hadoop enters the enterprise, it must face the problem of how to address and respond to the traditional and mature it information architecture. In the past, MapReduce was mainly used to solve unstructured data such as log file analysis, Internet click Stream, Internet index, machine learning, financial analysis, scientific simulation, image storage and matrix calculation. But in the enterprise, how to deal with the original structured data is a difficult problem for enterprises to enter into large data field. Enterprises need large data technologies that can handle both unstructured and structured data.

In the large data age, Hadoop is mainly used to deal with unstructured data, and how to deal with the traditional IOE architecture of structured data is a difficult problem for enterprises. In this context, SQL on Hadoop, which handles both structured data and unstructured data, has emerged.

SQL on Hadoop was the hottest topic of the 2013, and it was pushed by Cloudera Impala's release version. Currently, SQL on Hadoop is in its infancy and has many technical practices. And because the enterprise has adapted to the flexible processing of small data, go to Hadoop suddenly become disoriented, so the voice of SQL on Hadoop is growing. SQL on Hadoop guarantees both Hadoop performance and SQL flexibility. About SQL on Hadoop, the industry has different views, industry major data companies are also actively studying.

1. The traditional way of db on top

Some North American vendors use the traditional approach of DB on top to solve SQL on Hadoop, which combines different computing frameworks for different data operations. It is represented by EMC Greenplum, HADAPT, citus data. The HADAPT is connected to the PostgreSQL frame on Hadoop to complete the query of structured data. It provides a unified data-processing environment that leverages the high scalability of Hadoop and the high speed of relational databases, separating the queries between Hadoop and relational databases. Citus data uses distributed processing techniques to complete the query by transforming multiple types into the native types of databases.

Figure 1, Hadapt

The DB on top approach is an initial attempt by industry colleagues to address structured and unstructured data, first proposed by HADAPT in 2010 and ready to run on the Amazon EMR Community Edition. However, the essence is that data is stored separately in two computational frameworks, as shown in Figure 1, where structured data is stored in High-performance relational data engines (High-performance relational Engine for structured). Unstructured data is stored in the Hadoop distribution filesystem (Hadoop Distributed File system for unstructured data), and two types of data interactions depend on the slice execution of the query. The organization control of metadata must be the excessive technology in the evolution of system expansion.

2. Optimization of original Ecological hive

In the open source community, take Hortonworks Stinger, Apache drill as an example. Hortonworks Stinger through the transformation of the original ecological hive, optimize the speed of SQL query, so that it reached 5-30 seconds, complete the SQL query. Apache drill completes queries on SQL by optimizing the hive of the original ecosystem.

Fig. 2, Hortonworks Stinger

Open source community of the original ecological transformation, the goal is to establish a common computing framework and interface, the current open source projects, although still just hatching stage, or gain the support of the industry, such as the Apache Drill project, due to open Data format and query language, is supported by professional Hadoop commercial release vendor MAPR.

The development and contribution of the open source community will be the main force driving the SQL on Hadoop large-scale landing industry.

3. Human-Machine process interaction

At home, for SQL on Hadoop, it is mainly from the SQL data processing process and ad hoc analysis. In the SQL data processing process, many operations can be predefined by the data processing process, and then batch processing of the MapReduce job. For example, ETL process processing. ETL process processing is the data extraction, cleaning, conversion, loading stage. At this stage, through the definition of the data flow, the MapReduce job is assembled in a friendly Man-machine interface, and the drag-and-drop operation is used to form a workflow to solve the traditional SQL.

4. Ad hoc query for multilevel indexing structure

Ad hoc querying of large data is a difficult problem for large data. At PB-level data, its query efficiency and query performance are not satisfactory. In the traditional DW environment, the enterprise uses OLAP cube more. OLAP cube through the preprocessing of data, the data based on the dimensions of the maximum clustering operations, through the configuration of the dimension, can be completed on the small data ad hoc analysis. But for PB-level large data environment, how to build large data cube to take into account the flexibility of front-end applications and query efficiency? HBase's hash Fast positioning feature enables the millisecond response and high concurrency of ad hoc queries. Sky Cloud large data by constructing multilevel index on the HBase and using MPP method based on statistical analysis, it not only solves the HBase query, but also satisfies the ad hoc query of PB-level large data.

5. Operational SQL on Hadoop

For operational Hadoop, the SQL on Hadoop data query, response, etc. has been transferred from the storage disk level to memory. Because of its distributed memory consistency requirements, it is slow to develop and can not reach the enterprise application level. At present, the distributed memory computing has become more and more prosperous, and more representative technology pioneers such as splice Machine, Sqlstream, etc. Currently, the industry is actively exploring the operation of Hadoop.

In the face of the large number of structured data accumulated by the enterprise for many years, SQL on Hadoop has undoubtedly become the stepping stone of the distributed computing framework into the traditional computing market, but we are more aware that the stage of mainstream distributed computing such as Hadoop is far less than that, It defines a broader 0 consumer market (white spaces) solution for computing outside of SQL for enterprise computing.

The complex world cannot simply be described by a flat-unfolded table structure, and SQL is capable of querying and numerical computing. But how does a lot of fragments of text information and image images be calculated? " Buying "+" is equal to what? is the female "+" Dior equal to "elegant" or "sexy"? Can I use Sum, Group by, Join SQL to do the topic of unstructured information, classification, clustering, we will discuss these topics in subsequent articles.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More