Druid: An open source distributed system for real-time processing of big data

Source: Internet
Author: User

Druid is a high-fault-tolerant, high-performance open-source distributed system for real-time query and analysis of Big data, designed to quickly process large-scale data and enable fast query and analysis. In particular, Druid can maintain 100% uptime when code deployment, machine failure, and other product systems are experiencing downtime. The initial intent to create the Druid was primarily to address query latency, when an attempt was made to use Hadoop for interactive query analysis, but it was difficult to meet the needs of real-time analytics. Instead, Druid provides the ability to interactively access the data, weighing the flexibility and performance of the query and taking a special storage format.

The Druid function is between Powerdrill and Dremel, which implements almost all of Dremel's functionality and absorbs some interesting data formats from Powerdrill. Druid allows single-table queries in the same way as Dremel and Powerdrill, while adding new features such as a Columnstore format for locally nested data structures, indexing for fast filtering, real-time ingestion and querying, and a highly fault-tolerant distributed architecture. From the official knowledge, Druid has the following main characteristics:

    • The design of--druid for analysis is built for the exploratory analysis of OLAP workflow, which supports a variety of filtering, aggregation and query classes;
    • Fast Interactive query --druid's low latency data ingestion architecture allows events to be queried within milliseconds after they are created;
    • High-availability --druid data is still available when the system is updated, and scale expansion and reduction will not result in data loss;
    • Scalable--druid has been implemented to handle billions of of events and terabytes of data per day.

Druid applications are similar to those used in AD analytics startups Metamarkets, such as AD analytics, Internet ad system monitoring, and network monitoring. Druid is a good choice for technical solutions when the following occurs in the business:

    • When you need to interactively aggregate and quickly explore large amounts of data;
    • When real-time query analysis is required;
    • When there is a large amount of data, such as the addition of hundreds of millions of daily events, the increase of 10T data per day;
    • In real-time analysis of data, especially big data;
    • When a high-availability, high-fault-tolerant, high-performance database is required.

A druid cluster has various types of nodes (node), each of which can handle things well, including historical nodes for processing storage and querying of non-real-time data, real-time ingest of data, listening to realtime sections of input data streams, Monitor the coordinator node of the historical node, receive queries from external clients, and forward the query to the broker node of realtime and historical nodes, the indexer node responsible for Indexing Service.

The relationship between the data flow and the individual nodes in the query operation is as follows:

As the management architecture of the Druid cluster, this diagram shows the relationship between the related nodes and other components that the cluster manages to rely on (such as the zookeeper cluster responsible for service discovery):

Druid has open source based on the Apache License 2.0 protocol, the code is hosted on GitHub, and its current stable version is 0.7.1.1. Currently, Druid has 63 code contributors and nearly 2000 concerns. Druid's major contributors include advertising analytics startups metamarkets, movie streaming sites Netflix, and Yahoo. Druid official also on the Druid with Shark, Vertica, Cassandra, Hadoop, Spark, Elasticsearch, etc. in fault tolerance, flexibility, query performance and other convenient to make a comparison. For more information about Druid, you can also refer to the official introductory tutorials, white papers, design documentation, and more.

Excerpted from http://www.infoq.com/cn/news/2015/04/druid-data/

Druid: An open source distributed system for real-time processing of big data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.