Druid.io Series 1: Introduction to the background

Source: Internet
Author: User

Druid.io (hereinafter referred to as DRUID) is an OLAP storage system for real-time query and analysis for massive data. The four key features of Druid are summarized as follows:

Sub-second OLAP query analysis. Druid uses key technologies such as column storage, inverted index, and bitmap indexing to complete the filtering, aggregation and multidimensional analysis of massive data in sub-second level.

Real-time Streaming data analysis. Druid provides real-time stream data analysis in a way that differs from the bulk-import data used by traditional analytic databases, and the LSM (Long structure merge)-tree structure enables Druid to have very high real-time write performance At the same time, it realizes the visualization of real-time data within sub-second level.

Rich data analysis capabilities. For different user groups, Druid provides a friendly visual interface, class SQL query language, and Rest query interface.

High availability and scalability. Druid uses the distributed, SN (share-nothing) architecture, the Management class node can configure Ha, the work node function is single, do not depend on each other, these characteristics make the Druid cluster in management, fault tolerance, disaster preparedness, expansion and so on is very simple. 1 Why do you have druid ?

Big Data technology has been around for more than more than 10 years since the earliest Hadoop project, and Druid was open source at the end of 2013, though not yet an Apache top-notch project, but as a rising star, it still attracts a lot of users and the community is very active. Then, why there will be Druid, and Druid solved the traditional Big data processing framework of which "pain point" problem, let us answer each.

In the era of big data, how to extract valuable information from massive data is a difficult problem to be solved urgently. To address this problem, it giants have developed a large number of data storage and analysis products, such as IBM Netezza, HP Vertica, EMC greenplum, etc., but they are mostly expensive commercial payment products, the industry is a handful of users.

and benefiting from the rising spirit of open source in recent years, there are many outstanding open source projects in the industry, the most famous of which is the Apache Hadoop biosphere. Today, Hadoop has become a "standard" solution for big data, but while people are enjoying Hadoop's easy data analysis, they must endure many of the "pain points" in Hadoop's design, with three questions listed below:

When data queries can be made. For the Map/reduce batch framework used by Hadoop, there is no performance guarantee when data can be queried.

Random io problem. Map/reduce batch processing framework to deal with the data needs to be stored in HDFs, and HDFs is a cluster hard disk as a storage resource pool of Distributed File system, then in the process of massive data processing, will inevitably cause a lot of read and write operations, at this time, the random io is a high concurrency scenario performance bottleneck.

Data visualization issues. HDFs is an excellent distributed file system, but HDFs is not an optimal choice for data analysis and ad hoc querying of data.

the traditional Big Data processing architecture Hadoop prefers a "Back-end batch processing Data Warehouse system", which is an excellent general-purpose solution, which is a large-volume historical data storage and cold data analysis, but how to ensure the query analysis performance of massive data in high concurrency environment, And how to realize the query analysis and visualization of massive real-time data, Hadoop does seem powerless. 2 druid face pain point

Druid's parent company Metamarket was also a fan of Hadoop before 2011, but in high concurrency, Hadoop was not able to provide product-level assurance of data availability and query performance, making it necessary for Metamarket to find new solutions , when trying to use a variety of relational databases and NoSQL products, they felt that none of these tools could solve their "pain points", so they decided to start developing their own "wheel" Druid in 2011, and they defined Druid as "open source, distributed, A real-time data storage System for column-based storage "The pain point" is also mentioned above, "in high concurrency environment, to ensure the large data query analysis performance, while providing massive real-time data query, analysis and visualization function."

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.