Use Elasticsearch, Kafka, and Cassandra to build streaming data centers
Over the past year, I 've met software companies discussing how to process application data (usually in the form of logs and metrics ). During these discussions, I often hear frustration that they have to use a group of fragmented tools to aggregate the data over time. These tools, such as:-tools used by O & M personnel for monitoring and alarms
-Tools used by developers to track performance and locate Problems
-A complete and independent system, business intelligence (BI) and business rely on it to analyze user behavior.
Although these tools use different perspectives and apply to different scenarios, they also focus on data sources and types. Therefore, many software teams say, "if time is sufficient, we can build a better one." Frankly speaking, there are a lot of excellent open-source code, whether it is more meaningful to Establish a set by yourself is worth discussing. This is what we do in Jut. We have built a stream data analysis system using open-source big data components. This article describes the fragments we use and how we combine them. We will introduce:
-Data ingestion: how to introduce different types of data streams
-Indexing and data storage: efficient storage and unified Query
-Series: Data Flow Process in the system
-Optimization: users can use it only when the whole process is truly fast.
I hope that reading this article will help your system avoid some of our traps in a rational and scalable way.
1. Data uptake
Most data types, formats, and transmission protocols related to business analysis and monitoring are not fixed. You must be able to support different data sources and data senders in the system. For example, your data may include any of the following:
-Custom Application events.
-Container-level indicators and logs.
-Statsd or collected metrics.
-Webhook events from third parties, such as GitHub or Stripe.
-Application or server logs.
-User behavior. Although these have different formats and symbols, they need a unified format within the system. No matter which format you choose, you need to convert the input data stream.
We chose a simple and flexible data format: each record ("point") is a series of key/value pairs, which can be easily expressed as a JSON object. All vertices have a "time" field, and a measurement point also has a numeric "value" field. Other vertices can have any "shape ". The front-end HTTPS server (running Nginx) receives data, distributes data multiple ways, and sends the data to each local "connector" process (running Node. js ). These processes convert incoming data to an internal system format and then publish them to a Kafka topic (reliability) where they can be used for indexing and/or processing. In addition to the above data types, consider using connectors to make it easier for your team to integrate input data into your data bus. You may not need much versatility or flexibility I described here, but it is always good to design some flexibility, which enables your system to get more data types, to prevent future construction of new data.
2. Index and save data
All the data must be stored somewhere. It is better to expand your data in a database as it grows. And if the database supports analysis-type queries, it would be better. If the data center only stores logs and events, you can choose Elasticsearch. If this is only about metrics, you can select a time series database (TSDB ). But we all need to handle it. We finally built a system with multiple local data storage so that we can most effectively process different types of data.
ElasticSearch stores logs and Events
We use Elasticsearch as the event database. These events can have different shapes, depending on the source from which they come from. We use some Elasticsearch APIs, especially the query and aggregation APIs.
Cassandra and ElasticSearch save Metrics
In principle, metrics is fully stored in Elasticsearch (or any other database). It is more effective to use a database that specifically matches the metrics data structure and metrics redundant data. The best way is to use the existing open-source time series database (TSDB ).
We used this originally-I used open-source TSDB and Cassandra as the backend. The challenge of this method is that TSDB has its own query API, which is different from Elasticsearch's API. Because of the differences between APIs, it is difficult to provide a unified search and query interface for events and metrics. This is why we finally decided to write our own TSDB and store metrics through Casandra and Elasticsearch.
Specifically, we store time/value pairs in Cassandra, store metadata in Elasticsearch, and have a query and management layer at the top. In this way, events and metrics can be searched and queried in Elasticsearch. Stream processing engine now we have a way to get data and some databases. Can we add front-end applications and use our data? No! Although Elasticsearch can perform log and event analysis, we still need a processing engine. Because:
-We need a unified way to access events and metrics, including real-time or historical data.
-In some cases (monitoring and alerting), when this happens, we need to process the data in real time.
-Metrics! What we want to do is not just look for metrics and read them out.
-Metrics are used to optimize existing metrics.
-Even for events, we need a more general processing capability than the Elasticsearch API. For example, join different sources and data, or perform String Parsing or custom aggregation. From here on, things have become very interesting. You can spend one day (or more) studying how others build data pipelines and understand data architectures such as Lambda and Kappa. In fact, there are a lot of very good information there. Let's get started: what we achieve is a processing engine that supports real-time data streams and batch processing computing. In this regard, we fully support it. If you are interested, you can refer to it here and here.
Here, unlike storage and uptake, we have built our own processing engine from scratch-not because there are no other stream Processing engines, but because we value the query performance, we will discuss it separately in the following sections. More specifically, we have established a stream processing engine to implement a data stream processing model. computation represents a directed graph represented as a series of operations and converts input to output, these operations include aggregation, window, filtering, or join. This naturally combines model query and computing, and is suitable for real-time and batch operations and distributed operation.
Of course, unless you are looking for a new project, we recommend that you use an Open Source stream processing engine. We recommend that you take a look at Riann, Spark Streaming, or Apache Flink.
3 query and computing
We use a stream processing engine to compute data stream models. But how do users express the query and creation of such a data flow chart? One method is to provide an API or embedded DSL. This interface provides methods for querying and filtering data, defining conversions, and other processing operations, and most importantly, A method is provided to combine multiple processing stages and apply them to a flow chart. Each of the above projects has its own API, and personal preferences may be different. A common challenge of API is that SQL analysts or Excel users cannot use it conveniently.
A possible solution to the problem allows these users to access the system (for example, a simple web application) through tools built based on these Apis ).
Another method is to provide a simple query language. This is what Jut is doing. Because there is no existing data stream query language (such as SQL for relational queries), we have created a data stream query language called Juttle. Its core is that Juttle's stream graph query language can use simple syntax to declare and process pipelines, as shown in.
It has the following primitives: search, window, join, aggregation, and group-by. The syntax is simple. Of course, before processing a flowchart data, you need to get the data-Juttle allows you to define a query to get the data, through any combination of events and/or measurements, real-time and/or historical, all have the same syntax and structure. The following is a simple example, following a pattern... Query | analyze | view (note that the pipeline operator is used for links. The syntax is similar to shell ). '''Read-from: 1 day ago: ype = 'weblog '| reduce-every: minute: count () by status_code | @ timechar '''
4. Join Together: An Example of exception detection
So far, we have taken a component-centric perspective-we have discussed the components and their roles, but have not mentioned how to combine them. Now we switch the perspective to data-centric to see how to support real-time and Historical queries. Let's use an example of an exception detection algorithm. This is a good example, because we need to query historical data to train potential statistical models and test exceptions using real-time streaming data. Then we need to write the results back to the system and send exceptions and alarms.
However, before we perform any queries, We Need To concatenate the whole process of uptake and write the imported data into the index storage. This is done by the import service, which includes writing time series databases and storing metric data and metadata in Elasticsearch and Cassandra.
Now a user starts an exception detection job. This requires reading historical data and directly querying the underlying database through the task processing engine. Different queries and data can be further optimized (discussed below), and/or the read path of the metric database can be implemented (query the metadata in Elasticsearch to obtain the metric value in Cassandra, and generate actual metric points based on the results ).
Historical data covers some data in the past. The processing engine converts historical data into real-time data of the flow chart. To achieve this, the processing engine directly imports data to the entry point of the import service. Note that this type of switch must be careful to avoid data discarding or duplication.
At this point, we have a well-trained exception detection flow chart running on real-time data. When an exception is detected, we want it to send the alarm to some external systems, which can POST data to the external HTTP service through the processing engine. In addition to sending alerts, we also want to keep track of internal systems. In other words, we want to write data streams back to the system. In terms of concept, this is a way to return data to the uptake through the Processing Engine Pipeline.
5 Tuning
We now have a data ingestion system and some databases and Processing engines. Can we add a front-end application and analyze our data? No! Well, we can actually do this, but the problem is that our query performance is still very slow. Slow query means ...... No one will use our system.
Therefore, let's review the concept of "unified processing engine. According to our explanation, the same system uses the same structure, abstraction, and query to process historical or real-time data. The performance challenge comes from the fact that there are more historical data than real-time data. For example, if we have a speed of 1 million o'clock/second input to the system, and there is a fast enough processing process, we can perform real-time query during data input. Now, the same query semantics is used to query the data of the past day.
This requires processing tens of billions of points at a time (or, at least, the read speed from the storage point that must be followed ). If computing is distributed, we can solve it by adding computing nodes, but in the best case, this will be inefficient and expensive. So this is where the optimization is. There are many ways to optimize data queries. Some of these include converting the query itself. For example, filters or aggregations of Upstream data do not change the query semantics as much as possible. The optimization we are talking about is to filter and process data by the database as much as possible. This requires the following:
-Automatically identifies the parts that can be queried by the database
-Convert the corresponding part into the query language of the target database.
-Run the backend query and inject the result to the correct position of the data flow diagram.
6 conclusion
We did it! Of course, if we don't need a visualization layer, we will do it. You can only query the system through APIS. Creating a client application to create queries, streams, and visualized data, and combining dashboards is another tricky issue, so we will discuss this one day.
Now, let's summarize what we have seen during the construction of this data center:
-An uptake pathway can accept input data from different sources, convert it to a uniform format, and store it for future consumption. (In Jut, this is based on Kafka ).
-Database for events and measurements. In Jut, Events uses Elasticsearch, and their own metric database is based on Cassandra.
-A Processing Engine (or two, if you want to use the lambda ISH architecture ).
-Run the query API or query language on the system. . Building this system is a long and interesting journey. Even if you want to build your own system, try Jut first. You may find it useful.