Implement HTTP Storage plugin in drill

Source: Internet
Author: User
Tags mongodb like hadoop mapreduce

Apache drill can be used for real-time analysis of big data, citing an introduction:

Inspired by Google Dremel, Apache's Drill project is a distributed system for interactive analysis of big data sets. Drill does not attempt to replace the existing Big Data batch processing framework, such as the Hadoop MapReduce or stream processing framework. Like S4 and Storm. Instead, it is to populate the existing whitespace--real-time interactive processing of large datasets

In short, drill can receive SQL query statements, and then the backend obtains data from multiple data sources such as HDFs, MongoDB, and analyzes the results of the output analysis. In a single analysis, it can assemble data from multiple data sources. And based on a distributed architecture, you can support second-level queries.

Drill is architecturally flexible, and its front-end may not necessarily be a SQL query language, and back-end data sources can also be plugged into storage plugin to support other data sources. Here I have implemented a storage plugin demo that gets the data from the HTTP service. This demo can be accessed based on a GET request and returns a JSON-formatted HTTP service. Source code can be obtained from my GitHub: drill-storage-http

Examples include:

Realize

To achieve a self-storage plugin, drill this aspect of the document is almost no, only from the other storage plugin source, such as MongoDB, reference Drill sub-project drill-mongo-storage . The implemented storage plugin packaged as jars into the jars directory, drill starts automatically loaded, and then the specified type is configured on the Web.

The main classes that need to be implemented include:

Abstractstoragepluginstoragepluginconfigschemafactorybatchcreatorabstractrecordreaderabstractgroupscan
Abstracestorageplugin

StoragePluginConfigUsed to configure plugin, for example:

{  "type": "http",  "Connection": "http://xxx.com:8000",  "Resultkey": "Results",  "Enabled": true}

It must be JSON serialized/deserialized, and drill will store the storage configuration /tmp/drill/sys.storage_plugins in, for example, Windows D:\tmp\drill\sys.storage_plugins .

AbstractStoragePluginis the main class of plugin, which must mate StoragePluginConfig , when implementing this class, the constructor must follow the parameter conventions, for example:

Public Httpstorageplugin (Httpstoragepluginconfig httpconfig, Drillbitcontext context, String name)

When drill starts, it automatically scans the AbstractStoragePlugin implementation class ( StoragePluginRegistry ) and StoragePluginConfig.class establishes AbstractStoragePlugin constructor the mapping. AbstractStoragePluginthe interfaces that need to be implemented include:

The corresponding need to implement Abstracegroupscan    //selection contains the database name and table name, which can be used without public    Abstractgroupscan Getphysicalscan (String userName, jsonoptions selection)     //Register schema public    void Registerschemas (schemaconfig Schemaconfig, Schemaplus parent) throws IOException    //Storagepluginoptimizerrule for optimizing drill generated plan, can be implemented or not implemented    Public set<storagepluginoptimizerrule> Getoptimizerrules ()

The schema in drill is used to describe a database, and transactions such as table processing must be implemented, otherwise any SQL query will be considered to be unable to find the corresponding table. AbstraceGroupScanused to provide information in a single query, such as which columns to query.

Drill at query time, there is an intermediate data structure (JSON-based) called Plan, which is divided into logic plan and physical plan. Logic Plan is the first intermediate structure for a complete expression of a query, which is the intermediate structure after the conversion of SQL or other front-end query languages. It will be converted to physical plan, also known as Exectuion plan, which is an optimized plan that can be used to interact with the data source for real queries. StoragePluginOptimizerRuleis used to optimize the physical plan. The final structure of these plan is somewhat similar to the syntax tree, after all, SQL can also be considered a programming language. StoragePluginOptimizerRulecan be understood as rewriting these syntax trees. For example, MONGO storage plugin implements this class by where converting the filter in to MongoDB's own filter (such as {' $gt ': 2}), which optimizes the query.

Another item of Apache is involved: Calcite, formerly known as Optiq. The entire execution of SQL in drill, mainly relies on this project. It is difficult to optimize for the game plan, also because of the lack of documentation, more related code.

Schemafactory

registerSchemasThe primary or the calling SchemaFactory.registerSchemas interface. The schema in drill is a tree-like structure, so you can see that you are actually adding child to the registerSchemas parent:

public void Registerschemas (Schemaconfig schemaconfig, Schemaplus parent) throws IOException {        Httpschema schema = NE W Httpschema (schemaName);        Parent.add (Schema.getname (), schema);    }

HttpSchemaDerived from AbstractSchema , the main need to implement the interface getTable , because I this HTTP storage plugin table is actually passed to the HTTP service query, so the table is dynamic, so getTable the implementation is relatively simple:

Public Table getTable (String tableName) {//table name can is any of the String        httpscanspec spec = new Httpscanspec (tabl ENAME); 'll is pass to Getphysicalscan        return new dynamicdrilltable (plugin, schemaName, NULL, spec);    }

This is HttpScanSpec used to save some parameters in the query, for example, the table name is saved here, which is the HTTP service query, for example /e/api:search?q=avi&p=2 . It will be transmitted AbstraceStoragePlugin.getPhysicalScan to JSONOptions :

Public Abstractgroupscan Getphysicalscan (String userName, jsonoptions selection) throws IOException {        Httpscanspec Spec = Selection.getlistwith (new Objectmapper (), new typereference

HttpGroupScanYou will see the usefulness later.

Abstractrecordreader

AbstractRecordReaderResponsible for actually reading the data and returning it to drill. BatchCreatoris used to create the AbstractRecordReader .

public class Httpscanbatchcreator implements batchcreator

Since AbstractRecordReader it is responsible for actually reading the data, it certainly needs to know the query that was passed to the HTTP service, but this query was first in HttpScanSpec and then passed HttpGroupScan , so I immediately saw the HttpGroupScan parameter information passed HttpSubScan to it.

Drill will also automatically scan BatchCreator the implementation class, so there is no need HttpScanBatchCreator to care about the origin.

HttpSubScanThe implementation of this is relatively simple and is mainly used for storage HttpScanSpec :

public class Httpsubscan extends Abstractbase implements Subscan//need to implement Subscan

Back to the HttpGroupScan interface that must be implemented:

Public Subscan Getspecificscan (int. minorfragmentid) {//pass to Httpscanbatchcreator        return new Httpsubscan (config, SCANSPEC); Will eventually be passed to the Httpscanbatchcreator.getbatch interface      }

The final query is passed to the HttpRecordReader interface that the class needs to implement, including: setup and next , somewhat like an iterator. The data setup is queried, and then next the data is converted to drill. You can use the and when converting to drill VectorContainerWriter JsonReader . This is the legendary vector data format in drill, which is the column store data.

Summarize

Above, it contains the creation of the plugin itself, and query the delivery of queries. A similar select titile, name columns in the query will be passed to the HttpGroupScan.clone interface, but I am not concerned about it here. With this, you can query the data in the HTTP service through drill.

And select * from xx where xx in the where Filter,drill themselves will be the query out of the data to do filtering. If you want to construct the filter for MongoDB like MONGO plugin, you need to implement it StoragePluginOptimizerRule .

The HTTP storage plugin I implemented here is intended to feel that the query passed to the HTTP service might be dynamically built, for example:

Select name, length from http. '/e/api:search ' where $p =2 and $q = ' avi ' # P=2&q=avi is a dynamic build whose value can be derived from other query results select name, len Gth from http. '/e/api:search?q=avi&p=2 ' where length > 0  # This is static.

The first query needs to be aided by the StoragePluginOptimizerRule collection of all the filter in the where, and ultimately as the HTTP serivce query. But the implementation here is not perfect.

Overall, because the drill project is relatively new, it is more difficult to expand. Especially the Plan optimization section.

Original address: http://codemacro.com/2015/05/30/drill-http-plugin/
Written by Kevin Lynx posted athttp://codemacro.com

Implement HTTP Storage plugin in drill

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.