#数据技术选型 # ad hoc query Shib+presto, cluster task scheduling Hue+oozie

Source: Internet
Author: User

Zheng created in 2014/10/30 last updated on 2014/10/31 a) Selection: Shib+prestoScenario: ad hoc queries (Ad-hoc query) 1.1. Target of ad hoc queries
The
user is the data analyst of the product/operation/sales operation; requires data analysts to master query SQL query scripting skills, to master different business data stored in different data marts; whether their computing tasks are submitted to the database or Hadoop, computing time can be very long, it is impossible to wait online , so the user submitted a calculation task (pig/sql/hive SQL), the console tells the task is queued, give approximate calculation time and other friendly tips, The weights of these jobs are low, and the user and administrator can view the calculation tasks in the queue, including the execution time of the performed tasks, the length of the run, and the results of the run, and when the calculation task has results, the console interface has notification prompts, or e-mail prompts, Users can view and download data online.
1.2. Ad hoc query of the current technology selectionGraphical Interactive interface: SHIB; data query engine: Facebook Presto. 1.3. Why do I need to change the data query engine? MapReduce-based Hadoop is suitable for batch processing of data, but is not suitable for ad hoc query scenarios. MySQL, which is based on the Innodb/myisam storage engine, is naturally unsuitable.   Of course we've also looked at Infinidb/infobright, the Columnstore database engine (still based on MySQL), which is more suitable for historical archival data that is not changed at all, so it is less suitable for e-commerce scenarios.     Our Eagle Eye (tracing) project has been folded up in real-time query, the back-end of HBase can not carry in the big data volume of live insertion and query. 『Hive is more suitable for long-time batch query analysis, Impala, Shark, Stinger, and Presto for real-time interactive SQL queries, they provide data analysts with big data analysis tools for rapid experimentation and validation of ideas. So you canuse hive for data conversion processing, and then use one of these four systems for fast data analysis on a result dataset after hive processing。 Impala, Shark, Stinger, and Presto four systems are all class SQL real-time Big data query analysis engines, but their technical focus is completely different. And They're not born to replace Hive ., Hive is extremely valuable when it comes to data warehousing. These four systems and hive are data query tools built on top of Hadoop, each with a different focus on the adaptation surface, but from the client side they have a lot in common with hive, such as data table metadata, thrift interface, ODBC/JDBC driver, SQL syntax, flexible file format, Storage resource pools, and so on.     "--" open source Big Data query Analysis engine status, 2014 "Finally we chose Presto. Facebook in November 2013, Presto, a distributed SQL query engine, was designed to specialize in high-speed, real-time data analysis.it supports standard ANSI SQL, including complex queries, aggregations (aggregation), joins, and windowing Functions (window functions)。 Presto designed a simple abstraction layer of data storage that can be queried using SQL on a variety of data storage systems, including HBase, HDFS, scribe, and so on.

Presto a simplified architecture, as shown in 1, the client sends SQL queries to the Presto coordinator. The coordinator checks the syntax, analyzes, and plans the query plan. The scheduler combines the executed pipelines, assigns the task to those nodes closest to the data, and then monitors the execution process. The client takes the data out of the output segment, which is taken in turn from the lower processing segment.

The operational model of Presto is fundamentally different from Hive. Hive translates the query into multi-stage map-reduce tasks, one after the other. Each task reads the input data from the disk and outputs the intermediate results to disk. However the Presto engine did not use Map-reduce. It uses a custom query execution engine and a response operator to support SQL syntax. In addition to the improved scheduling algorithm, all data processing is performed in memory. Different processing end through the network composed of processing lines. This avoids unnecessary disk reads and writes, and additional latency. This pipelined execution model runs multiple data processing segments at the same time, passing data from one processing segment to the next when the data is available.

Such a way would greatly reduce the end-to-end response time for various queries.

At the same time, Presto designed a simple data storage abstraction layer to satisfy the use of SQL to query on different data storage systems. The storage connector currently supports HBase, Scribe, and custom developed systems in addition to HIVE/HDFS.

Figure 1. Presto architecture

1.4. The latter is selected between Hue and ShibHUE Everyone may have heard of it.   Shib is relatively unfamiliar, it is so introduced to its own: WebUI for query engines:hive and Presto. Pan introduces the pros and cons of both.HUE Development language: Python Pros: Hue is a WEB application that interacts with Apache Hadoop. An open-source Apache Hadoop UI. We've already used hue in our production environment, and Hue has a great advantage in managing hbase/pig/hive, and it comes with a Oozie app for creating and monitoring workflows. Cons: Hue is a relatively heavy tool, the changes involved in more things will be more, and each subsequent upgrade may cause our changes in the function to be modified.   Shib Development language: Nodejs Pros: Shib can operate hive and Presto directly with a simple configuration. The amount of code is small, and the work is much less modified. disadvantage: Not familiar with Nodejs, have the study cost. Finally, we selected Shib with a relatively small amount of code and development. 1.5. Ad hoc query interface displayAfter logging in to Shib, select Data Warehouse presto-wowo_dw.   When writing SQL, you can move the table structure's cue box to the side and write the edge reference as shown.   Figure 2 Edge Query view data structure because all queries are asynchronous, you can see the execution status and execution results of your query statements in the My Queries list, so that you don't have to wait in the query interface, as shown in.   Figure 3 My query can also save their frequently used query statements in the "bookmarks", which is a very useful function. Next, you can develop the SQL query results in-station notification mechanism and more complex user access control mechanism. II) Selection: Hue+oozieApplication Scenario: Hadoop cluster Computing task scheduling and management platform. 2.1. Difficulties faced by data platform running dataE-Commerce Data Platform report dimension has many kinds, there are general briefing angle, operational angle, the media point of view, etc., can also have goods, merchants, users, competition and other dimensions, as well as daily, weekly and monthly report points. So there are a number of computational tasks that correspond. Each calculation task can be considered a workflow, after all, the computational process is very complex, one ring set.   Then Hue+oozie is the visualization of the management and scheduling of these workflows.   What was it like before Oozie? One, the calculation script is configured as a timed task, run to fly only from the huge log in the haystack, do not know where to break, can only manually clear the data from the beginning to run.   Task calculation time is very long, do not know where to go now, how long it will take to run out. Second, it is difficult to precisely control task a run out to run Task B, can only be left long enough between different scheduled tasks, lack of elasticity. What is 2.2.Oozie?Oozie is a Java Web application that runs in Tomcat and uses a database to store the following: The workflow definition, the currently running workflow instance (including the status and variables of the instance). We most appreciate its three points:
    • Oozie allows a failed workflow to rerun from any point, which is useful for handling transient errors in the workflow due to a previous time-consuming activity.
    • Visualization of the workflow execution process.
    • Each step of the workflow log, error messages can be clicked to view, and real-time scrolling, easy to troubleshoot problems.
2.3. Let's see.First select "Oozie Editor/dashboard" on the Hue navigation bar and see the default panel: Figure 5 Oozie The default panel click on a workflow and go to the details page: Figure 6 Workflow Details page A workflow is defined as 7, as shown in XML format HPDL. HPDL is a very concise language that uses only a handful of process control and action nodes. The control node defines the process of execution and includes the starting and ending points of the workflow (start, end, and fail nodes) and the mechanism that controls the execution path of the workflow (decision, fork, and join nodes). Figure 7 Workflow definition Now, the various computing tasks for the data platform are migrated to Oozie and redefined in HPDL language format one by one. III) summarize the various technology selection of the data centerListed below, no longer explained: Apache Hadoop/hive/hbase Apache Pig flume/kafka/storm/sqoop/awk Facebook Presto MySQL Hue/shib oozie-over-

#数据技术选型 # ad hoc query Shib+presto, cluster task scheduling Hue+oozie

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.