Implement HTTP storage plugin and drillplugin in Drill
Apache Drill can be used for real-time big data analysis:
Inspired by Google Dremel, Apache's Drill Project is a distributed system that performs Interactive Analysis on large datasets. Drill does not try to replace the existing Big Data batch processing framework, such as Hadoop MapReduce or stream processing framework, such as S4 and Storm. Instead, it needs to fill in existing gaps-real-time interactive processing of large datasets
In short, Drill can receive SQL query statements and then obtain data from multiple data sources, such as HDFS and MongoDB, and analyze and generate analysis results. In an analysis, it can collect data from multiple data sources. In addition, the distributed architecture supports second-level queries.
Drill is flexible in architecture. Its front-end may not necessarily be an SQL query language, and the back-end data source can also be connected to Storage plugin to support other data sources. Here I implemented a Storage plugin demo for retrieving data from the HTTP service. This demo can access HTTP Services in JSON format based on GET requests. Source code can be obtained from my Github: drill-storage-http
Examples include:
select name, length from http.`/e/api:search` where $p=2 and $q='avi'select name, length from http.`/e/api:search?q=avi&p=2` where length > 0
Implementation
To implement a storage plug-in of your own, there is almost no Drill documentation. You can only start with other existing storage plug-in source code, such as for mongodb, refer to the Drill sub-projectdrill-mongo-storage
. The implemented storage plugin is packaged as jar and putjars
Directory, which is automatically loaded when Drill is started, and then the specified type can be configured on the web.
Main classes to be implemented include:
AbstractStoragePluginStoragePluginConfigSchemaFactoryBatchCreatorAbstractRecordReaderAbstractGroupScan
AbstraceStoragePlugin
StoragePluginConfig
Used to configure plugin, for example:
{ "type" : "http", "connection" : "http://xxx.com:8000", "resultKey" : "results", "enabled" : true}
It must be JSON-serializable/deserialized. Drill stores the storage configuration/tmp/drill/sys.storage_plugins
For example, in windowsD:\tmp\drill\sys.storage_plugins
.
AbstractStoragePlugin
Is the main class of plugin, which must be used togetherStoragePluginConfig
To implement this class, the constructor must follow the parameter conventions, for example:
public HttpStoragePlugin(HttpStoragePluginConfig httpConfig, DrillbitContext context, String name)
Drill is automatically scanned when it is started.AbstractStoragePlugin
Implementation class (StoragePluginRegistry
) And createStoragePluginConfig.class
ToAbstractStoragePlugin constructor
.AbstractStoragePlugin
Interfaces to be implemented include:
// You need to implement AbstraceGroupScan. // selection includes the database name and table name. public AbstractGroupScan getPhysicalScan (String userName, JSONOptions selection) is not required) // register schema public void registerSchemas (SchemaConfig schemaConfig, SchemaPlus parent) throws IOException // The scheduler is used to optimize the plan generated by Drill. public Set <strong> getOptimizerRules () is optional ()
The schema in Drill is used to describe a database and process transactions such as tables. It must be implemented. Otherwise, no corresponding table can be found in any SQL query.AbstraceGroupScan
Used to provide information in one query, for example, query which columns.
During query, Drill has an intermediate data structure (based on JSON) called Plan, which is divided into Logic Plan and Physical Plan. Logic Plan is the first intermediate structure used to fully express a query. It is the intermediate structure converted by SQL or other front-end query languages. It is also converted to Physical Plan, also known as Exectuion Plan. This Plan is an optimized Plan that can be used to interact with the data source for real queries.StoragePluginOptimizerRule
Is used to optimize Physical Plan. The final structure of these plans is similar to the syntax tree. After all, SQL can also be considered as a programming language.StoragePluginOptimizerRule
It can be understood as rewriting these syntax trees. For example, Mongo storage plugin implements this class.where
Convert the filter in to mongodb's own filter (for example, {'$ gt': 2}) to optimize the query.
Another Apache project is involved here: calcite, whose predecessor is OptiQ. The entire execution of SQL statements in Drill relies mainly on this project. It is difficult to optimize the Plan, because there is a lack of documentation and a lot of related code.
SchemaFactory
registerSchemas
Mainly callSchemaFactory.registerSchemas
Interface. The Schema in Drill is a tree structure, so you can seeregisterSchemas
Actually, add child to the parent:
public void registerSchemas(SchemaConfig schemaConfig, SchemaPlus parent) throws IOException { HttpSchema schema = new HttpSchema(schemaName); parent.add(schema.getName(), schema); }
HttpSchema
Derived fromAbstractSchema
To implement interfaces.getTable
Because the table in my http storage plugin is actually passed to the HTTP service query, so the table is dynamic, sogetTable
Implementation is relatively simple:
public Table getTable(String tableName) { // table name can be any of string HttpScanSpec spec = new HttpScanSpec(tableName); // will be pass to getPhysicalScan return new DynamicDrillTable(plugin, schemaName, null, spec); }
HereHttpScanSpec
Used to save some parameters in the query. For example, the table name is saved here, that is, the HTTP service query, for example/e/api:search?q=avi&p=2
. It will be passedAbstraceStoragePlugin.getPhysicalScan
InJSONOptions
:
public AbstractGroupScan getPhysicalScan(String userName, JSONOptions selection) throws IOException { HttpScanSpec spec = selection.getListWith(new ObjectMapper(), new TypeReference<HttpScanSpec>() {}); return new HttpGroupScan(userName, httpConfig, spec); }
HttpGroupScan
You will see the usage later.
AbstractRecordReader
AbstractRecordReader
Reads data and returns it to Drill.BatchCreator
Is used to createAbstractRecordReader
.
public class HttpScanBatchCreator implements BatchCreator<HttpSubScan> { @Override public CloseableRecordBatch getBatch(FragmentContext context, HttpSubScan config, List<RecordBatch> children) throws ExecutionSetupException { List<RecordReader> readers = Lists.newArrayList(); readers.add(new HttpRecordReader(context, config)); return new ScanBatch(config, context, readers.iterator()); } }
SinceAbstractRecordReader
To read data, you must know the query passed to the HTTP service.HttpScanSpec
And then passedHttpGroupScan
So you will seeHttpGroupScan
The parameter information is passedHttpSubScan
.
Drill will also automatically scanBatchCreator
So you don't have to worry aboutHttpScanBatchCreator
.
HttpSubScan
Implementation is relatively simple, mainly used for storageHttpScanSpec
Of:
Public class HttpSubScan extends actbase implements SubScan // SubScan is required
BackHttpGroupScan
Required interfaces:
Public SubScan getSpecificScan (int minorFragmentId) {// pass to HttpScanBatchCreator return new HttpSubScan (config, scanSpec); // it will be passed to HttpScanBatchCreator. getBatch interface}
The final query is passedHttpRecordReader
Interfaces to be implemented for this class include:setup
Andnext
, A bit similar to the iterator.setup
And thennext
To the Drill instance. You can useVectorContainerWriter
AndJsonReader
. This is the legendary vector data format in Drill, that is, column-store data.
Summary
The above section contains the creation of plugin and the transfer of query in the query. Similarselect titile, name
Columns in will be passedHttpGroupScan.clone
Interface, but I am not concerned about it here. After this is done, you can query the data in HTTP service through Drill.
Whileselect * from xx where xx
Inwhere
Filter. Drill filters the queried data. If you want to construct the mongodb filter like in mongo plugin, You need to implementStoragePluginOptimizerRule
.
The HTTP storage plugin I implemented here is intended to think that the query passed to the HTTP service may be dynamically constructed, for example:
Select name, length from http. '/e/api: search' where $ p = 2 and $ q = 'av' # p = 2 & q = avi is a dynamic build, the value can be from select name, length from http. '/e/api: search? Q = avi & p = 2 'where length> 0 # static
The first query requires the helpStoragePluginOptimizerRule
It collects all the filters in the where clause and serves as the query of HTTP serivce. However, the implementation here is not complete.
In general, it is difficult to expand the Drill project because it is relatively new. Especially Plan optimization.
Address: http://codemacro.com/2015/05/30/drill-http-plugin/
Written by Kevin Lynx posted athttp: // codemacro.com