Eleme big data computing engine practice and application

Source: Internet
Author: User
Tags big data big data cluster dispatcher eleme monitor cluster

The BDI-Big Data Platform R&D team of Eleme currently has about 20 people, mainly responsible for offline & real-time Infra and platform tool development. The offline team of 6 people needs to maintain the size of the big data cluster as follows:

  • Hadoop cluster size 1300+

  • HDFS inventory data 40+PB, Read 3.5 PB+/day, Write 500TB+/day

  • 14W MR Job/day, 10W Spark Job/day, 25W Presto/day

In addition, it is necessary to maintain the internal version of Hadoop, Spark, Hive, Presto and other components to solve the various problems faced by the company's 400+ big data cluster users every day.

This article focuses on how Eleme big data team reduced the user access threshold by unifying the entry of the computing engine. How to enable users to self-analyze task anomalies and failure causes, and how to monitor cluster computing/storage resource consumption, monitor cluster status, and monitor abnormal tasks from the task data itself generated by the cluster.


Unified engine entrance

At present, the query engines provided by the company mainly include Presto, Hive and Spark. Among them, Spark has two modes: SparkThrift Server and Spark SQL, and Kylin is also steadily trialing, and Druid is also investigating. Various computing engines have their own advantages and disadvantages, and the applicable computing scenarios are different.

From the user's point of view, ordinary users do not have strong recognition ability, and the learning cost will be relatively high. And when the user can choose the engine to perform the task independently, the so-called fastest engine will be preferred, and this will inevitably cause the engine to block, or submit a completely unsuitable task to an engine, thereby reducing the success rate of the task.

From a management perspective, there are too many entrances to big data clusters, and it will be difficult to achieve unified management. It is difficult to achieve load balancing and authority control, and it is difficult to control the overall external service capabilities of the cluster. And when there are new computing requirements that need to be accessed, we also need to deploy the corresponding client environment for it.


Functional module

In response to this situation, the Big Data team developed the Dispatcher.

All user tasks are submitted through Dispatcher. In Dispatcher, we can achieve unified authentication and unified task execution tracking. It is also possible to perform automatic routing of the execution engine, load control of each execution engine, and increase the success rate of the task by engine downgrade.


Logical architecture

Currently, the user can call the Dispatcher service through the JDBC mode, or run the Dispatcher directly in the Driver mode. After receiving the query request, the Dispatcher will perform operations such as authentication and engine routing to submit the query to the corresponding engine. In addition, Dispatcher also has a SQL conversion module that will automatically convert Presto SQL to HiveQL when it is downgraded from the Presto engine to the Spark/Hive engine.


The benefits of Dispatcher's unification of query entries are as follows:

  • The user access threshold is low, no need to learn the usage methods and advantages and disadvantages of each engine, and there is no need to manually select the execution engine;

  • The deployment cost is low, and the client can quickly access through JDBC.

  • Unified authentication and monitoring;

  • The downgrade module improves the success rate of the task;

  • Each engine is load balanced;

  • The engine is extensible.


Engine extensibility mainly means that users can be made without awareness when they subsequently access Kylin, Druid or other query engines. Since all the queries submitted to the cluster are collected, for each existing query plan, we can get the heat data, know which tables are used most frequently in all queries, which tables are often associated with queries, which fields are often aggregated, etc. When you subsequently access Kylin, you can quickly create or optimize the cube with this data.


SQL portrait

The core of Dispatcher is the SQL portrait module.

After the query is submitted, the query plan is parsed by connecting to HiveServer, and all metadata information of the current query can be obtained, for example:

  • Read data volume

  • Read in table / partition number

  • Various types of Join

  • How many associated fields

  • Aggregation complexity

  • Filter condition

  • ......

The above metadata information can basically describe each query accurately, and each query can be dispatched to different engines through the statistics of these dimensions.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.