1 Historical Node
Historical node's sole responsibility is to load the segment of all historical data that druid within the non-real-time window and satisfies the loading rules. Each historical node is kept in sync with zookeeper and does not communicate with other types of nodes or other historical node.
According to the previous section, coordinator nodes will periodically (by default 1 minutes) synchronize the Meta repository, perceive the newly generated segment, and save the segment information to be loaded in zookeeper historical's load In the queue directory, when historical node senses that a new segment needs to be loaded, it will first go to the local disk directory to find if the segment has been downloaded, and if not, the meta-information from the zookeeper to be loaded segment will be downloaded. This meta information includes where the segment is stored, how to decompress it, and how to deserve segment. Historical node uses a memory file mapping method to load the Xxxxx.smoosh file in Index.zip into memory and declares that the served has been loaded in the segments segment directory of this node in zookeeper. Thus the segment can be queried. For historical Node that is re-launched, the local storage path is also scanned after the boot is completed, and all scanned segment are loaded as memory so that it can be queried. 2 Broker Node
Broker node is the gateway to the entire cluster query, and as a query routing role, broker node perceives all the meta information for the published segment in the cluster that is stored on the zookeeper, that is, which storage nodes each segment exists on, broker Node creates a timeline,timeline for each datasource in zookeeper that describes where each segment is stored in chronological order. We know that each query request contains DataSource and interval information, and Broker node finds the storage nodes for all the segment that meet the criteria in timeline based on these two information and sends the query request to the corresponding node.
For data returned by each node, Broker node uses the LRU cache policy by default, and Druid uses memcached to share the cache for multiple Broker nodes in the cluster. For the results returned by historical node, broker node is considered "trustworthy" and will be cached, while real-time node returns the data within the real-time window, broker node is considered mutable, "untrusted" and therefore not cached. So for each query request, Broker node queries the local cache first, if it does not exist to find timeline, and then sends a query request to the appropriate node. 3 Coordinator Node
Coordinator node is primarily responsible for the management and release of segment in the Druid cluster, including loading new segment, discarding non-compliant segment, managing segment replicas, and segment load balancing. If more than one coordinator Node exists in the cluster, the leader is generated by the election algorithm, and the other follower as backups.
Coordinator will periodically (by default one minute) synchronize the data topology diagram of the entire cluster in the zookeeper, all valid segment information in the meta repository, and the rules library to determine what to do next. For a valid and unassigned Segment,coordinator node, first in reverse order by the capacity of historical node, that is, the minimum capacity has the highest priority, and the new segment is prioritized to the high-priority historical node. As seen in section 3.3.4.1, Coordinator node does not deal directly with historical node, but instead creates temporary information for the segment under the Load queue directory in zookeeper historical node. Wait for historical node to load the segment.
The coordinator will compare the current data topology stored in zookeeper and the data stored in the Meta repository after each boot, and all segment that have been loaded in the cluster that are marked as invalid or nonexistent in the Meta repository are coordinator Node is recorded in the Remove list, which also includes the new and old version of the same segment that we described in section 3.3, 3, and the old version of segments is also placed in the Remove list and is eventually logically discarded.
For offline historical Node,coordinator node, all segment on the historical node are invalidated, notifying other historical node in the cluster to load the segment. However, in the production environment, we will encounter the machine temporary offline, historical node in a short period of time to restore the service situation, so the "simple brute" strategy will inevitably aggravate the network load in the entire cluster. For this scenario, Coordinator will save a time-to-live (lifetime) for all discarded segment within the cluster, which means that coordinator node allows the maximum wait time not to be reassigned after the segment is marked as discarded. If the historical node is back online within that time, segment will be reset to a valid, if more than that time will be reassigned to other historical node according to the load rule.
Consider one of the most extreme cases where all coordinator node in the cluster stops serving, the entire cluster is still valid externally, but the new segment will not be loaded, and the outdated segment will not be discarded, that is, the data topology in the entire cluster will remain intact. Until the New coordinator node service is online. 4 Indexing Service
Indexing Service is the highly available, distributed, Master/slave Architecture services that are responsible for "production" segment. Consists mainly of three types of components: the peon responsible for running the indexing task (indexing Task), the Middlemanager responsible for controlling Peon, and the Middlemanager for tasks distributed to Overlord The relationship of the three can be explained as: Overlord is the master of Middlemanager, and Middlemanager is the master of Peon. Among them, Overlord and Middlemanager can be distributed, but Peon and Middlemanager are on the same machine by default. Figure 3.5 shows the overall architecture of the Indexing Service.
Overlord
The Overlord is responsible for accepting tasks, coordinating assignment of tasks, creating task locks, and collecting and returning task run status to callers. When there are multiple overlord in the cluster, the leader is generated by the election algorithm, and the other follower as backup.
Overlord can run in local (default) and remote two modes, if run in local mode, Overlord is also responsible for peon creation and operation, when running in remote mode, Overlord and Middlemanager perform their respective roles, and, as shown in Figure 3.6, Overlord accept indexing tasks generated by real-time/batch data streams, registering task information with zookeeper/ In the directory of all online Middlemanager in the task directory, the new tasks generated by Middlemanager are sensed, and the status of each index task is periodically synchronized by Peon to the/status directory in zookeeper. For overlord to perceive the health of all current index tasks.
Overlord provides a visual interface externally, and by accessing https://:/console.html, we can observe all the indexing tasks currently running within the cluster, the available peon, and any successful or failed indexing tasks that have recently been completed peon.
Middlemanager
Middlemanager is responsible for receiving Overlord assigned index tasks, while creating a new process to start Peon to perform the index task, each middlemanager can run multiple Peon instances.
On the machine running the Middlemanager instance, we can observe the directory starting with xxx_index_xxx in the ${java.io.tmpdir} directory, each of which corresponds to a peon instance At the same time, all the currently running index task information is kept in the Restore.json file, which is used to record the status of the task, on the other hand, if Middlemanager crashes, the index task can be restarted with this file.
Peon
Peon is the smallest unit of work for Indexing Service and the specific performer of the index task, and all currently running Peon tasks can be accessed through the Web visual interface provided by Overlord.
5 Real-time Node
Real-time nodes are primarily responsible for real-time data ingestion, as well as the generation of segment files, with two data processing modes, one for stream Push and the other for stream pull.
Stream Pull
Real-time nodes consume real-time data through Firehose, Firehose is a real-time consumption data model in Druid, can have different implementations, Druid comes with a Kafka- The API implements data consumption for Kafka (Druid-kafka-eight Firehose). In addition to the firehose, there is an important role in the real season called plumber, which is responsible for merging data files according to the specified period.
If Druid pulls data from an external data source autonomously in stream pull mode to generate indexing Service Tasks, we need to establish real-time Node. Real-time node mainly contains two "factories": one is to connect the stream data source, is responsible for the data access Firehose (the Chinese translation is the water pipe, very vividly describes the responsibility of the component), and the other is responsible for segment release and transfer of plumber (Chinese translation for the porter, The responsibilities of the component are also depicted in a very graphic manner. In the Druid source code, both components are abstract factory methods, and the user can create different types of firehose or plumber according to their needs. Firehose and plumber give me the feeling, more similar to the kafka_0.9.0 release after the release of the Kafka Connect framework, firehose similar to Kafka Connect source, defines the data entry, but does not care about the type of the Access data source While plumber is similar to Kafka Connect Sink, it defines the exit of the data and does not care where the final output goes.
The following is a talk about the principle, structure and some problems of druid-kafka-eight.
When using Druid-kafka-eight to consume data from Kafka, the firehose can make real-time nodes very scalable. When multiple live nodes are launched, data acquisition is performed from Kafka using the Kafka Consumer Group, which maintains the offset condition of each node through zookeeper, whether it is increasing the node or deleting the node, through the high The API ensures that Kafka data is consumed at least once by the Druid cluster. For a detailed description of Kafka Consumer Group, please refer to: http://blog.csdn.net/eric_sunah/article/details/44243077
The high availability mechanism implemented by Druid-kafka-eight can be represented by the following diagram:
Through the high availability of the Druid-kafka-eight guarantee, careful analysis can be found that the generated segment file cannot be uploaded to the deepstorage defect, the problem can be resolved by two ways to restart the real-time node use Tranquility+index Service to make accurate consumption and backup of Kafka data. Since tranquility can push the data to the Druid cluster, it can create multiple copies of the same partition data at once, and when a data consuming task fails, The system can accurately use segment data blocks created by another identical task. More information here: http://blog.csdn.net/eric_sunah/article/details/44243077
Stream push if the stream push policy is used, we need to create a "copy service" that pulls the data from the data source and generates the Indexing Service Tasks to "push" the data into the druid, and we druid_ This pattern was used before the 0.9.1 version, but this mode requires external services tranquility,tranquility components can connect to a variety of streaming data sources, such as spark-streaming, Storm, and Kafka, Therefore, Tranquility-storm, Tranquility-kafka and other external components have also been produced. The principles and uses of Tranquility-kafka are described in detail in section 3.4. 6 External expansion
The Druid cluster relies on external components, rather than relying on it, because it is Druid open architecture, so users can use different external components according to their own needs.
Deep Storage
Druid currently supports saving segments and indexing task logs using Local disk (stand-alone mode), NFS mounted disks, HDFS, Amazon S3, and other storage methods.
Zookeeper
Druid uses zookeeper as a communication component within a distributed cluster, and various nodes register instances and services on zookeeper through the curator framework, while also storing information that needs to be shared within the cluster in the Zookeeper directory. This simplifies complex logic such as intra-cluster automatic connection management, leader elections, distributed locks, path caches, and distributed queues.
Metadata Storage
Druid Cluster meta-information using MySQL or PostgreSQL storage, standalone version uses Derby. In the druid_0.9.1.1 version, the Meta Repository Druid consists of 10 tables, all starting with "Druid_", as shown in Figure 3.7.
7 Loading Data
For loading external data, Druid supports two modes: live stream (real-time ingestion) and bulk import (batch ingestion).
Real-time ingestion
The real-time streaming process can generate data using the Apache Storm, Apache Spark streaming streaming framework, and then through pipeline tools such as Apache Kafka, ActiveMQ, RABBITMQ and other message bus class components, Use stream pull or stream push mode to generate the indexing Service Tasks, which are eventually stored in Druid.
Batch Ingestion
The bulk import pattern can use structured information as a data source, such as JSON, AVRO, parquet formatted text, Druid internally using the Map/reduce batch framework to import data. 8 High Availability
Druid High Availability can summarize the following points:
Historical Node
If a historical node is offline longer than a certain threshold, coordinator node will reassign the loaded segments on that node to other online historical nodes, as described in 3.3.4.1. Ensure that all segments that satisfy the load rule are not lost and can be queried.
Coordinator Node
Cluster can be configured with multiple coordinator node instances, working mode is mainly from synchronization, using the election algorithm to generate leader, other follower as backup. When leader is down, other follower can quickly failover.
Even when all coordinator node fails, the entire cluster is still externally valid, but the new segments will not be loaded, and the outdated segments will not be discarded, i.e. the data topology throughout the cluster remains intact until the new coordinator Node service is online.
Broker Node
Broker node is consistent with coordinator node in HA deployment.
Indexing Service
Druid can configure multiple indexing Service tasks replicas for the same segment to ensure data integrity.
Real-time
The data integrity of the real-time process is mainly determined by the real-time stream semantics (semantics) of the access. We use the Tranquility-kafka component to access real-time data before the 0.9.1.1 release, because of the time window, the data in the time window will be submitted to firehose, the data outside the time window will be discarded, if the Tranquility-kafka temporarily offline, it will cause Kafka Data is "out of date" and is not guaranteed data integrity, and this "Copy Service" Usage mode not only consumes a lot of CPU and memory, and does not satisfy atomic operations, so after 0.9.1.1, we use Druid's new features Kafka indexing Service,druid internally uses the Kafka advanced consumer API to guarantee exactly-once semantics, ensuring data integrity to the maximum extent possible. However, we still find data loss problems in our use.
Metadata Storage
If the metadata storage fails, coordinator will not be able to perceive the new segment generation and the data topology will not change throughout the cluster, but will not affect access to old data.
Zookeeper
If the zookeeper fails, the data topology of the entire cluster does not change, so the data in the cache can still be queried due to the presence of the broker node cache. 9 Data tiering
The Druid access control strategy uses data tiering (tier) for the following two purposes:
Different historical node is divided into groups, thus controlling different rights (priority) users in the cluster to access different group at query time.
By dividing tier, let historical node load data for different time ranges. For example tier_1 load 2016 Q1 data, tier_2 load 2016 Q2 data, tier_3 load 2016 Q3 data, etc; then send the request to the corresponding tier historical Node according to the user's different query requirements, not only can control the user access request , you can also speed up queries by reducing the number of historical node that responds to requests.
https://blog.csdn.net/eric_sunah/article/details/78563634