Implement HTTPstorageplugin in Drill
Apache Drill can be used for real-time big data analysis:
Inspired by Google Dremel, Apache's Drill Project is a distributed system that performs Interactive Analysis on large datasets. Drill does not try to replace the existing Big Data batch processing framework, such as Hadoop MapReduce or stream processing framework, such as S4 and Storm. Instead, it needs to fill in existing gaps-real-time interactive processing of large datasets
In short, Drill can receive SQL query statements and then obtain data from multiple data sources, such as HDFS and MongoDB, and analyze and generate analysis results. In an analysis, it can collect data from multiple data sources. In addition, the distributed architecture supports second-level queries.
Drill is flexible in architecture. Its front-end may not necessarily be an SQL query language, and the back-end data source can also be connected to Storage plugin to support other data sources. Here I implemented a Storage plugin demo for retrieving data from the HTTP service. This demo can access HTTP Services in JSON format based on GET requests. Source code can be obtained from my Github: drill-storage-http
Examples include:
select name, length from http.`/e/api:search` where $p=2 and $q='avi'select name, length from http.`/e/api:search?q=avi&p=2` where length > 0
Implementation
To implement a storage plug-in of your own, there are almost no documents about Drill. You can only start with other existing storage plug-in source code, such as mongodb. For details, refer to the Drill sub-project drill-mongo-storage. The implemented storage plugin is packaged as jar in the jars directory. It is automatically loaded when Drill is started, and then the specified type can be configured on the web.
Main classes to be implemented include:
AbstractStoragePluginStoragePluginConfigSchemaFactoryBatchCreatorAbstractRecordReaderAbstractGroupScan
AbstraceStoragePlugin
StoragePluginConfig is used to configure plugin, for example:
{ "type" : "http", "connection" : "http://xxx.com:8000", "resultKey" : "results", "enabled" : true}
It must be JSON-serializable/deserialized. Drill stores the storage configuration to/tmp/drill/sys. in storage_plugins, for example, D: \ tmp \ drill \ sys in windows. storage_plugins.
AbstractStoragePlugin is the main class of plugin. It must be used with StoragePluginConfig. To implement this class, the constructor must follow the parameter conventions. For example:
public HttpStoragePlugin(HttpStoragePluginConfig httpConfig, DrillbitContext context, String name)
When Drill is started, it automatically scans the storageregiinregistry class and establishes the storing between StoragePluginConfig. class and AbstractStoragePlugin constructor. The interfaces to be implemented by AbstractStoragePlugin include:
// You need to implement AbstraceGroupScan. // selection includes the database name and table name. public AbstractGroupScan getPhysicalScan (String userName, JSONOptions selection) is not required) // register schema public void registerSchemas (SchemaConfig schemaConfig, SchemaPlus parent) throws IOException // StoragePluginOptimizerRule is used to optimize the plan generated by Drill, either implemented or not implemented.
GetOptimizerRules ()
The schema in Drill is used to describe a database and process transactions such as tables. It must be implemented. Otherwise, no corresponding table can be found in any SQL query. AbstraceGroupScan is used to provide information in one query, for example, to query which columns.
During query, Drill has an intermediate data structure (based on JSON) called Plan, which is divided into Logic Plan and Physical Plan. Logic Plan is the first intermediate structure used to fully express a query. It is the intermediate structure converted by SQL or other front-end query languages. It is also converted to Physical Plan, also known as Exectuion Plan. This Plan is an optimized Plan that can be used to interact with the data source for real queries. StoragePluginOptimizerRule is used to optimize the Physical Plan. The final structure of these plans is similar to the syntax tree. After all, SQL can also be considered as a programming language. StoragePluginOptimizerRule can be understood as rewriting these syntax trees. For example, Mongo storage plugin implements this class, which converts the filter in where to mongodb's own filter (for example, {'$ gt': 2}) to optimize the query.
Another Apache project is involved here: calcite, whose predecessor is OptiQ. The entire execution of SQL statements in Drill relies mainly on this project. It is difficult to optimize the Plan, because there is a lack of documentation and a lot of related code.
SchemaFactory
RegisterSchemas mainly calls the SchemaFactory. registerSchemas interface. The Schema in Drill is a tree structure, so we can see that registerSchemas actually adds child to the parent:
public void registerSchemas(SchemaConfig schemaConfig, SchemaPlus parent) throws IOException { HttpSchema schema = new HttpSchema(schemaName); parent.add(schema.getName(), schema); }
HttpSchema is derived from AbstractSchema and mainly needs to implement the getTable interface. Because the table in my http storage plugin is actually passed to the HTTP service query, the table is dynamic, so the implementation of getTable is relatively simple:
public Table getTable(String tableName) { // table name can be any of string HttpScanSpec spec = new HttpScanSpec(tableName); // will be pass to getPhysicalScan return new DynamicDrillTable(plugin, schemaName, null, spec); }
Here, HttpScanSpec is used to save some parameters in the query. For example, table name is saved here, that is, the query of HTTP service, such as/e/api: search? Q = avi & p = 2. It will be uploaded to JSONOptions in AbstraceStoragePlugin. getPhysicalScan:
public AbstractGroupScan getPhysicalScan(String userName, JSONOptions selection) throws IOException { HttpScanSpec spec = selection.getListWith(new ObjectMapper(), new TypeReference
() {}); return new HttpGroupScan(userName, httpConfig, spec); }
HttpGroupScan is available later.
AbstractRecordReader
AbstractRecordReader is responsible for truly reading data and returning it to Drill. BatchCreator is used to create an actrecordreader.
public class HttpScanBatchCreator implements BatchCreator
{ @Override public CloseableRecordBatch getBatch(FragmentContext context, HttpSubScan config, List
children) throws ExecutionSetupException { List
readers = Lists.newArrayList(); readers.add(new HttpRecordReader(context, config)); return new ScanBatch(config, context, readers.iterator()); } }
Since AbstractRecordReader is responsible for actually reading data, it must know the query passed to the HTTP service, but this query was first transmitted to HttpScanSpec and then to HttpGroupScan, therefore, HttpGroupScan immediately transmits the parameter information to HttpSubScan.
Drill automatically scans the implementation class of BatchCreator, so you don't need to worry about the origins of HttpScanBatchCreator here.
The implementation of HttpSubScan is relatively simple, mainly used to store HttpScanSpec:
Public class HttpSubScan extends actbase implements SubScan // SubScan is required
Return to HttpGroupScan. required interfaces:
Public SubScan getSpecificScan (int minorFragmentId) {// pass to HttpScanBatchCreator return new HttpSubScan (config, scanSpec); // it will be passed to HttpScanBatchCreator. getBatch interface}
The final query is passed to HttpRecordReader. The interfaces to be implemented in this class include setup and next, which is somewhat similar to the iterator. Query the data in setup, and convert the data to Drill in next. VectorContainerWriter and JsonReader can be used to convert to Drill. This is the legendary vector data format in Drill, that is, column-store data.
Summary
The above section contains the creation of plugin and the transfer of query in the query. Columns in a query similar to select titile will be passed to the HttpGroupScan. clone interface, but I am not concerned about it here. After this is done, you can query the data in HTTP service through Drill.
In select * from xx where xx, where filter, Drill filters the queried data. To construct a mongodb filter like mongo plugin, You need to implement StoragePluginOptimizerRule.
The HTTP storage plugin I implemented here is intended to think that the query passed to the HTTP service may be dynamically constructed, for example:
Select name, length from http. '/e/api: search' where $ p = 2 and $ q = 'av' # p = 2 & q = avi is a dynamic build, the value can be from select name, length from http. '/e/api: search? Q = avi & p = 2 'where length> 0 # static
The first query requires the use of StoragePluginOptimizerRule, which will collect all the filters in the where and eventually serve as the query of HTTP serivce. However, the implementation here is not complete.
In general, it is difficult to expand the Drill project because it is relatively new. Especially Plan optimization.