Druid is an open-source, distributed, columnstore system that is especially suitable for real-time analysis statistics on big data. and has good stability (highly Available). It's relatively lightweight, documentation is perfect, and it's easier to get started. Druid vs Other systems Druid vs Impala/shark
The comparison between Druid and Impala and Shark basically boils down to what kind of system needs to be designed.
Druid is designed to:
- Always on-line service
- Get real-time data
- Handling Slice-n-dice-style instant queries
Query speed is different:
- Druid is a column storage method, the data is compressed into the index structure, compression increases the data storage capacity in RAM, can make RAM adapt to more data fast access . The index structure means that when a filter is added to query, Druid does less processing and will query faster.
- Impala/shark can be thought of as the background program cache layer above HDFs. But they don't go beyond caching to really improve query speed.
The data gets different:
- Druid can get real-time data.
- Impala/shark is based on HDFs or other backing storage, limiting the speed of data acquisition.
The form of the query is different:
- Druid supports time series and groupby-style queries, but joins are not supported.
- Impala/shark supports SQL-style queries.
Druid vs Elasticsearch
Elasticsearch (ES) is a search server based on Apache Lucene. It provides a pattern for full-text search and provides access to raw event-level data. Elasticsearch also provides analysis and aggregation support. According to research, ES has a higher resource than druid in data acquisition and aggregation.
Druid focuses on OLAP workflows. Druid is high performance (fast aggregation and acquisition) optimized at lower cost and supports a wide range of analytical operations. Druid provides some basic search support for structured event data.
Segment: An important unit of data in Druid is called Segment, which is druid generated from raw data by bitmap indexing (batch or realtime). The segment guarantees the speed of the query. You can set each segment corresponding to the granularity of the data, the application of the minimum granularity of AD traffic query is days, so the daily data will be created as a segment. Note that segment is non-modifiable and can only modify raw data and recreate segment if it needs to be modified.
Architecture
The druid itself consists of 5 components : Broker nodes, historical nodes, Realtime nodes, coordinator nodes and indexing services. The function of the distinction is as follows:
- Broker nodes: responsible for responding to external query requests, through the query zookeeper divide the request into segments respectively to historical and real-time nodes, and finally merge and return the query results to the outside;
- Historial nodes: Responsible for the storage and query of ' historical ' segments. It will load segments from deep storage and respond to requests broder nodes. Historical nodes usually in the local synchronization of the deep storage part of the segments, so even if deep storage inaccessible, historical nodes can serve its synchronous segments query;
- real-time nodes: for storing and querying hot data, the data is periodically build into segments to historical nodes. external dependency Kafka are generally used to improve the availability of realtime data ingestion. If you do not need real-time ingest data to cluter, you can discard real-time nodes, only the batch ingestion data to deep storage;
- Coordinator nodes: can be considered as Master in Druid, which manages historical and real-time nodes through zookeeper, and metadata management through MySQL segments
- Indexing services are often used for data import in Druid, and both batch data and streaming data can be imported by sending requests to indexing services.
Druid also contains 3 external dependencies
- Mysql: store various metadata in Druid(the data inside is created and inserted by Druid itself), containing 3 tables: "Druid_config" (usually empty), "Druid_rules" (Coordinator Some rule information used by nodes, such as which node segment from, and "druid_segments" (storing metadata information for each segment);
- Deep Storage: storage Segments,druid currently supports local disks , NFS mounted disks, HDFS,S3, etc. Deep storage data has 2 sources, one is batch ingestion, the other is real-time nodes;
- ZooKeeper: Used by Druid to manage the state of the current cluster, such as recording which segments moved from real-time nodes to historical nodes;
Inquire
Druid's query is Druid official documentation by sending an HTTP POST request to broker nodes (or directly to historical or Realtime Node). The description of the query condition is a JSON file, and the response of the query is also in JSON format. The Druid query consists of the following 4 types:
- Time Boundary Queries: a TimeSpan for querying all data
- GroupBy Queries: Druid is the most typical query method, very similar to the groupBy query of MySQL. Several elements in query body can be understood as follows:
- "Aggregation": corresponding to the MySQL "Select XX from" section, that is, you want to check which columns of aggregation results;
- "Dimensions": corresponds to MySQL "Group by XX", that is, you want to base on which columns to do aggregation;
- "Filter": corresponds to the MySQL "where XX" condition, namely the filter condition;
- "Granularity": granularity of data aggregation;
- Timeseries queries: Its statistics satisfy the result of a certain column of "Rows" on the filter condition, compared to "groupBy queries" does not specify which columns are based on which to aggregate, higher efficiency;
- TopN queries: The most common n values that are used to query a column for sorting by some sort of metric;
Summary of this article
- Druid is an open source, distributed, Columnstore, system for real-time data analysis, with detailed documentation and ease of use;
- Druid in the design of the full consideration of the highly Available, all kinds of nodes hanging off will not make Druid stop working (but the status will not be updated);
- There is a low coupling between the components in the Druid, and realtime node can be ignored if streaming data ingestion is not required;
- Druid Data Unit segment is not modifiable, our practice is to generate new segments to replace existing ones;
- Druid uses bitmap indexing to speed up the query speed of Column-store, using an algorithm called concise to compress bitmap indexing, Makes the generated segments much smaller than the original text file;
- In our application scenario (altogether 10 machines, the data is about 100 columns, the number of rows is billion), the average query time <2 seconds, is the same machine number of MySQL cluter 1/100 ~ 1/10;
- Some of the "limitations" of Druid:
- The non-modifiable nature of segment simplifies the implementation of Druid, but if you have the need to modify the data, you must recreate the segment, and the bitmap indexing process is time-consuming;
- Druid can accept data in a relatively simple format, such as data that cannot handle nested structures
Druid (quasi) real-time analysis statistics Database--Columnstore + efficient compression