Apache drill can be used for real-time analysis of big data, citing an introduction:
Inspired by Google Dremel, Apache's Drill project is a distributed system for interactive analysis of big data sets. Drill does not attempt to replace the existing Big Data batch processing framework, such as the Hadoop MapReduce or stream processing framework. Like S4 and Storm. Instead, it is to populate the existing whitespace--real-time interactive processing of large datasets
In short, drill can receive SQL query statements, and then the backend obtains data from multiple data sources such as HDFs, MongoDB, and analyzes the results of the output analysis. In a single analysis, it can assemble data from multiple data sources. And based on a distributed architecture, you can support second-level queries.
Drill is architecturally flexible, and its front-end may not necessarily be a SQL query language, and back-end data sources can also be plugged into storage plugin to support other data sources. Here I have implemented a storage plugin demo that gets the data from the HTTP service. This demo can be accessed based on a GET request and returns a JSON-formatted HTTP service. Source code can be obtained from my GitHub: drill-storage-http
Examples include:
Realize
To achieve a self-storage plugin, drill this aspect of the document is almost no, only from the other storage plugin source, such as MongoDB, reference Drill sub-project drill-mongo-storage
. The implemented storage plugin packaged as jars into the jars
directory, drill starts automatically loaded, and then the specified type is configured on the Web.
The main classes that need to be implemented include:
Abstractstoragepluginstoragepluginconfigschemafactorybatchcreatorabstractrecordreaderabstractgroupscan
Abstracestorageplugin
StoragePluginConfig
Used to configure plugin, for example:
{ "type": "http", "Connection": "http://xxx.com:8000", "Resultkey": "Results", "Enabled": true}
It must be JSON serialized/deserialized, and drill will store the storage configuration /tmp/drill/sys.storage_plugins
in, for example, Windows D:\tmp\drill\sys.storage_plugins
.
AbstractStoragePlugin
is the main class of plugin, which must mate StoragePluginConfig
, when implementing this class, the constructor must follow the parameter conventions, for example:
Public Httpstorageplugin (Httpstoragepluginconfig httpconfig, Drillbitcontext context, String name)
When drill starts, it automatically scans the AbstractStoragePlugin
implementation class ( StoragePluginRegistry
) and StoragePluginConfig.class
establishes AbstractStoragePlugin constructor
the mapping. AbstractStoragePlugin
the interfaces that need to be implemented include:
The corresponding need to implement Abstracegroupscan //selection contains the database name and table name, which can be used without public Abstractgroupscan Getphysicalscan (String userName, jsonoptions selection) //Register schema public void Registerschemas (schemaconfig Schemaconfig, Schemaplus parent) throws IOException //Storagepluginoptimizerrule for optimizing drill generated plan, can be implemented or not implemented Public set<storagepluginoptimizerrule> Getoptimizerrules ()
The schema in drill is used to describe a database, and transactions such as table processing must be implemented, otherwise any SQL query will be considered to be unable to find the corresponding table. AbstraceGroupScan
used to provide information in a single query, such as which columns to query.
Drill at query time, there is an intermediate data structure (JSON-based) called Plan, which is divided into logic plan and physical plan. Logic Plan is the first intermediate structure for a complete expression of a query, which is the intermediate structure after the conversion of SQL or other front-end query languages. It will be converted to physical plan, also known as Exectuion plan, which is an optimized plan that can be used to interact with the data source for real queries. StoragePluginOptimizerRule
is used to optimize the physical plan. The final structure of these plan is somewhat similar to the syntax tree, after all, SQL can also be considered a programming language. StoragePluginOptimizerRule
can be understood as rewriting these syntax trees. For example, MONGO storage plugin implements this class by where
converting the filter in to MongoDB's own filter (such as {' $gt ': 2}), which optimizes the query.
Another item of Apache is involved: Calcite, formerly known as Optiq. The entire execution of SQL in drill, mainly relies on this project. It is difficult to optimize for the game plan, also because of the lack of documentation, more related code.
Schemafactory
registerSchemas
The primary or the calling SchemaFactory.registerSchemas
interface. The schema in drill is a tree-like structure, so you can see that you are actually adding child to the registerSchemas
parent:
public void Registerschemas (Schemaconfig schemaconfig, Schemaplus parent) throws IOException { Httpschema schema = NE W Httpschema (schemaName); Parent.add (Schema.getname (), schema); }
HttpSchema
Derived from AbstractSchema
, the main need to implement the interface getTable
, because I this HTTP storage plugin table is actually passed to the HTTP service query, so the table is dynamic, so getTable
the implementation is relatively simple:
Public Table getTable (String tableName) {//table name can is any of the String httpscanspec spec = new Httpscanspec (tabl ENAME); 'll is pass to Getphysicalscan return new dynamicdrilltable (plugin, schemaName, NULL, spec); }
This is HttpScanSpec
used to save some parameters in the query, for example, the table name is saved here, which is the HTTP service query, for example /e/api:search?q=avi&p=2
. It will be transmitted AbstraceStoragePlugin.getPhysicalScan
to JSONOptions
:
Public Abstractgroupscan Getphysicalscan (String userName, jsonoptions selection) throws IOException { Httpscanspec Spec = Selection.getlistwith (new Objectmapper (), new typereference
HttpGroupScan
You will see the usefulness later.
AbstractrecordreaderAbstractRecordReader
Responsible for actually reading the data and returning it to drill. BatchCreator
is used to create the AbstractRecordReader
.
public class Httpscanbatchcreator implements batchcreator
Since AbstractRecordReader
it is responsible for actually reading the data, it certainly needs to know the query that was passed to the HTTP service, but this query was first in HttpScanSpec
and then passed HttpGroupScan
, so I immediately saw the HttpGroupScan
parameter information passed HttpSubScan
to it.
Drill will also automatically scan BatchCreator
the implementation class, so there is no need HttpScanBatchCreator
to care about the origin.
HttpSubScan
The implementation of this is relatively simple and is mainly used for storage HttpScanSpec
:
public class Httpsubscan extends Abstractbase implements Subscan//need to implement Subscan
Back to the HttpGroupScan
interface that must be implemented:
Public Subscan Getspecificscan (int. minorfragmentid) {//pass to Httpscanbatchcreator return new Httpsubscan (config, SCANSPEC); Will eventually be passed to the Httpscanbatchcreator.getbatch interface }
The final query is passed to the HttpRecordReader
interface that the class needs to implement, including: setup
and next
, somewhat like an iterator. The data setup
is queried, and then next
the data is converted to drill. You can use the and when converting to drill VectorContainerWriter
JsonReader
. This is the legendary vector data format in drill, which is the column store data.
SummarizeAbove, it contains the creation of the plugin itself, and query the delivery of queries. A similar select titile, name
columns in the query will be passed to the HttpGroupScan.clone
interface, but I am not concerned about it here. With this, you can query the data in the HTTP service through drill.
And select * from xx where xx
in the where
Filter,drill themselves will be the query out of the data to do filtering. If you want to construct the filter for MongoDB like MONGO plugin, you need to implement it StoragePluginOptimizerRule
.
The HTTP storage plugin I implemented here is intended to feel that the query passed to the HTTP service might be dynamically built, for example:
Select name, length from http. '/e/api:search ' where $p =2 and $q = ' avi ' # P=2&q=avi is a dynamic build whose value can be derived from other query results select name, len Gth from http. '/e/api:search?q=avi&p=2 ' where length > 0 # This is static.
The first query needs to be aided by the StoragePluginOptimizerRule
collection of all the filter in the where, and ultimately as the HTTP serivce query. But the implementation here is not perfect.
Overall, because the drill project is relatively new, it is more difficult to expand. Especially the Plan optimization section.
Original address: http://codemacro.com/2015/05/30/drill-http-plugin/
Written by Kevin Lynx posted athttp://codemacro.com
Implement HTTP Storage plugin in drill