Implement HTTP storage plugin and drillplugin in Drill

Source: Internet
Author: User
Tags hadoop mapreduce

Implement HTTP storage plugin and drillplugin in Drill

Apache Drill can be used for real-time big data analysis:

Inspired by Google Dremel, Apache's Drill Project is a distributed system that performs Interactive Analysis on large datasets. Drill does not try to replace the existing Big Data batch processing framework, such as Hadoop MapReduce or stream processing framework, such as S4 and Storm. Instead, it needs to fill in existing gaps-real-time interactive processing of large datasets

In short, Drill can receive SQL query statements and then obtain data from multiple data sources, such as HDFS and MongoDB, and analyze and generate analysis results. In an analysis, it can collect data from multiple data sources. In addition, the distributed architecture supports second-level queries.

Drill is flexible in architecture. Its front-end may not necessarily be an SQL query language, and the back-end data source can also be connected to Storage plugin to support other data sources. Here I implemented a Storage plugin demo for retrieving data from the HTTP service. This demo can access HTTP Services in JSON format based on GET requests. Source code can be obtained from my Github: drill-storage-http

Examples include:

select name, length from http.`/e/api:search` where $p=2 and $q='avi'select name, length from http.`/e/api:search?q=avi&p=2` where length > 0 
Implementation

To implement a storage plug-in of your own, there is almost no Drill documentation. You can only start with other existing storage plug-in source code, such as for mongodb, refer to the Drill sub-projectdrill-mongo-storage. The implemented storage plugin is packaged as jar and putjarsDirectory, which is automatically loaded when Drill is started, and then the specified type can be configured on the web.

Main classes to be implemented include:

AbstractStoragePluginStoragePluginConfigSchemaFactoryBatchCreatorAbstractRecordReaderAbstractGroupScan
AbstraceStoragePlugin

StoragePluginConfigUsed to configure plugin, for example:

{  "type" : "http",  "connection" : "http://xxx.com:8000",  "resultKey" : "results",  "enabled" : true}

It must be JSON-serializable/deserialized. Drill stores the storage configuration/tmp/drill/sys.storage_pluginsFor example, in windowsD:\tmp\drill\sys.storage_plugins.

AbstractStoragePluginIs the main class of plugin, which must be used togetherStoragePluginConfigTo implement this class, the constructor must follow the parameter conventions, for example:

public HttpStoragePlugin(HttpStoragePluginConfig httpConfig, DrillbitContext context, String name)

Drill is automatically scanned when it is started.AbstractStoragePluginImplementation class (StoragePluginRegistry) And createStoragePluginConfig.classToAbstractStoragePlugin constructor.AbstractStoragePluginInterfaces to be implemented include:

// You need to implement AbstraceGroupScan. // selection includes the database name and table name. public AbstractGroupScan getPhysicalScan (String userName, JSONOptions selection) is not required) // register schema public void registerSchemas (SchemaConfig schemaConfig, SchemaPlus parent) throws IOException // The scheduler is used to optimize the plan generated by Drill. public Set <strong> getOptimizerRules () is optional ()

The schema in Drill is used to describe a database and process transactions such as tables. It must be implemented. Otherwise, no corresponding table can be found in any SQL query.AbstraceGroupScanUsed to provide information in one query, for example, query which columns.

During query, Drill has an intermediate data structure (based on JSON) called Plan, which is divided into Logic Plan and Physical Plan. Logic Plan is the first intermediate structure used to fully express a query. It is the intermediate structure converted by SQL or other front-end query languages. It is also converted to Physical Plan, also known as Exectuion Plan. This Plan is an optimized Plan that can be used to interact with the data source for real queries.StoragePluginOptimizerRuleIs used to optimize Physical Plan. The final structure of these plans is similar to the syntax tree. After all, SQL can also be considered as a programming language.StoragePluginOptimizerRuleIt can be understood as rewriting these syntax trees. For example, Mongo storage plugin implements this class.whereConvert the filter in to mongodb's own filter (for example, {'$ gt': 2}) to optimize the query.

Another Apache project is involved here: calcite, whose predecessor is OptiQ. The entire execution of SQL statements in Drill relies mainly on this project. It is difficult to optimize the Plan, because there is a lack of documentation and a lot of related code.

SchemaFactory

registerSchemasMainly callSchemaFactory.registerSchemasInterface. The Schema in Drill is a tree structure, so you can seeregisterSchemasActually, add child to the parent:

public void registerSchemas(SchemaConfig schemaConfig, SchemaPlus parent) throws IOException {        HttpSchema schema = new HttpSchema(schemaName);        parent.add(schema.getName(), schema);    }

HttpSchemaDerived fromAbstractSchemaTo implement interfaces.getTableBecause the table in my http storage plugin is actually passed to the HTTP service query, so the table is dynamic, sogetTableImplementation is relatively simple:

public Table getTable(String tableName) { // table name can be any of string        HttpScanSpec spec = new HttpScanSpec(tableName); // will be pass to getPhysicalScan        return new DynamicDrillTable(plugin, schemaName, null, spec);    }

HereHttpScanSpecUsed to save some parameters in the query. For example, the table name is saved here, that is, the HTTP service query, for example/e/api:search?q=avi&p=2. It will be passedAbstraceStoragePlugin.getPhysicalScanInJSONOptions:

public AbstractGroupScan getPhysicalScan(String userName, JSONOptions selection) throws IOException {        HttpScanSpec spec = selection.getListWith(new ObjectMapper(), new TypeReference<HttpScanSpec>() {});        return new HttpGroupScan(userName, httpConfig, spec);    }

HttpGroupScanYou will see the usage later.

AbstractRecordReader

AbstractRecordReaderReads data and returns it to Drill.BatchCreatorIs used to createAbstractRecordReader.

public class HttpScanBatchCreator implements BatchCreator<HttpSubScan> {      @Override      public CloseableRecordBatch getBatch(FragmentContext context,          HttpSubScan config, List<RecordBatch> children)          throws ExecutionSetupException {        List<RecordReader> readers = Lists.newArrayList();        readers.add(new HttpRecordReader(context, config));        return new ScanBatch(config, context, readers.iterator());      }    }

SinceAbstractRecordReaderTo read data, you must know the query passed to the HTTP service.HttpScanSpecAnd then passedHttpGroupScanSo you will seeHttpGroupScanThe parameter information is passedHttpSubScan.

Drill will also automatically scanBatchCreatorSo you don't have to worry aboutHttpScanBatchCreator.

HttpSubScanImplementation is relatively simple, mainly used for storageHttpScanSpecOf:

Public class HttpSubScan extends actbase implements SubScan // SubScan is required

BackHttpGroupScanRequired interfaces:

Public SubScan getSpecificScan (int minorFragmentId) {// pass to HttpScanBatchCreator return new HttpSubScan (config, scanSpec); // it will be passed to HttpScanBatchCreator. getBatch interface}

The final query is passedHttpRecordReaderInterfaces to be implemented for this class include:setupAndnext, A bit similar to the iterator.setupAnd thennextTo the Drill instance. You can useVectorContainerWriterAndJsonReader. This is the legendary vector data format in Drill, that is, column-store data.

Summary

The above section contains the creation of plugin and the transfer of query in the query. Similarselect titile, nameColumns in will be passedHttpGroupScan.cloneInterface, but I am not concerned about it here. After this is done, you can query the data in HTTP service through Drill.

Whileselect * from xx where xxInwhereFilter. Drill filters the queried data. If you want to construct the mongodb filter like in mongo plugin, You need to implementStoragePluginOptimizerRule.

The HTTP storage plugin I implemented here is intended to think that the query passed to the HTTP service may be dynamically constructed, for example:

Select name, length from http. '/e/api: search' where $ p = 2 and $ q = 'av' # p = 2 & q = avi is a dynamic build, the value can be from select name, length from http. '/e/api: search? Q = avi & p = 2 'where length> 0 # static

The first query requires the helpStoragePluginOptimizerRuleIt collects all the filters in the where clause and serves as the query of HTTP serivce. However, the implementation here is not complete.

In general, it is difficult to expand the Drill project because it is relatively new. Especially Plan optimization.

Address: http://codemacro.com/2015/05/30/drill-http-plugin/
Written by Kevin Lynx posted athttp: // codemacro.com

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.