Design optimization of hbase log storage for log system

Source: Internet
Author: User
Tags hadoop ecosystem

I blog article if not specifically noted are original! If reproduced please specify the source: http://blog.csdn.net/yanghua_kobe/article/details/46482319

Continue talking about the recently taken-over log system, which is related to log collection, which we talk about log storage related topics.

Brief introduction

Let's begin by summarizing the business characteristics of this data: it has almost no update requirements, a component or a system usually has a fixed log format, but in terms of multiple components or systems it will have a variety of custom tags, which are often built for the purpose of late query/ Troubleshoot the need for online problems, so the log retrieval fields are also flexible.

Our log storage selection is HBase, mainly because we think HBase is a good fit for log data:

(1) HBase's qualifier is quite flexible and can be created dynamically, making it ideal for log-like, semi-structured data (where flexibility is primarily for tag storage)

(2) HBase belongs to the Hadoop ecosystem and facilitates offline analysis and data mining at the back of the system.

Combined with the characteristics of the log data mentioned above, because tag is flexible and changeable, the query based on tag hbase seems to be somewhat inadequate. This is mainly because HBase itself does not provide a level two index, and you cannot search based on column. If the range of Rowkey or Rowkey cannot be determined and there is no secondary index, then a full table scan will be performed. From this point of view, you can see it as a key-value form of database (such as Redis).

Design of defect index based on HBase self-built index

Because HBase itself does not provide a two-level indexing mechanism, it is common practice to build indexes on the outside itself, as I did when I took over the log system. The basic idea is that the logs are stored in a log table, where the index information is built into the index metadata table, and one index information in the metadata table corresponds to an index table, and the column-family scale is used to store the rowkey of the log in the Index table. Summarized as follows:

(1) Log table: Store log records

(2) Meta table: Store index metadata (which contains table names for dynamic index tables)

(3) Dynamic Index Table: the specific information of the storage index, one index corresponding to a table

Let's take a look at the schema design for these tables:


Here I can talk about the approximate logic of creating a Dynamic Index table, which requires three parameters:

(1) IndexName: index name

(2) Tags: the tag array for which you need to index

(3) Span: time interval

First convert the tags array to (fast-json#tojsonbytes) byte[] and use it as rowkey to check in the meta table if there is already an index name built for that tag combination. (HBase recognizes the Rowkey format as byte[], and the rowkey of the meta table is the representation of the JSON array of tags serialized as byte[].

If the metadata for the index does not exist, a dynamic index table is created, and the table name of the index table is indexname.

The design of the index table's Rowkey object contains two properties:

(1) Time: The "quasi" record of the log, note that this is not the actual recording time, but the indirect real event of a point in time (Timestamp/span *span)

(2) string array of Tags:tag

The log table is then scanned for a full table, with each log record doing the following:

(1) Get the time of log generation

(2) Then there is a loop to traverse tags, for each tag: Determine whether the log exists in the tag, if not exist then jump directly out of the loop, if tag can match each of the tags, then the index for it

(3) If an index is required, add a piece of data to the index table

In a sense, the index created here is to match the tags set and the time Shard, which will meet the criteria of the day when the ambition closest to its point in time to gather together.

Problems with index design

(1) The index table is inefficient, requires a two-layer nested loop, the outermost thing is full table scan, if the amount of data is large, this processing method is difficult to accept. In fact, this approach is similar to the fact that the data has been in the post facto compensation mechanism. The usual practice is that the index table is created as an empty table, and the data is dynamically analyzed when it is necessary to build the index (the second half of the sentence is also implemented, is implemented by storm)

(2) through the index to find the time, also need two layers of loop, the outer layer is to find the dynamic Index table of the rowset, the interior is to get all the log table in the list related records of the Rowkey. If the time range of the query is longer or the interval between the time shards, then the point of time will be very much, and the time point more, the number of outer loops will be very much, so in order to avoid this, the implementation of the time fragment limit, that is, the fragment can not be greater than a certain range; Then these logs will fall to that point in time, so the number of loops in the inner layer is much more.

(3) The efficiency of the query will depend very much on the degree of soundness of the index establishment, in which case the indexed set of tags must be comprehensive rollup, and if large and wide, the condition matching of the index will be reduced. If there is no index information for the tag you want to query, you will have to perform a full table scan.

(4) The Log table ID is based on the distributed self-increment ID, the other table uses the JSON object as the string form, does not notice rowkey to the HBase query importance.

HBase Storage Log Query optimizes the underlying concepts of hbase queries

The reason for these problems is that the self-built index is implemented, we must optimize the query of the log system, before we first have to have a basic understanding of hbase query. There are three ways to access the Row records for HBase:

(1) Make a unique match by Rowkey

(2) match a range by Rowkey range, then filter in the range with multiple filters

(3) Full table scan

From a programmatic point of view, HBase's query implementations support only two ways:

(1) Get: Specify Rowkey to get only one record

(2) Scan: Obtain a batch of records according to the specified conditions (it contains the above 2, 32 ways)

Generally, a full table scan is rarely the approach we expect. So if we want to improve our query efficiency, we have to design rowkey carefully.

The problems generated from the above self-built index and our basic understanding of hbase queries. There are two main aspects of the problem:

(1) Self-built index implementation is not efficient

(2) No good design for Rowkey (log ID with distributed self-increment ID)

Let's talk about the optimization strategy for these two points.

Optimization of Rowkey

Rowkey can never simply be handled with a UUID or a self-increment ID, as the traditional RDBMS handles the primary key. HBase's Rowkey is based on dictionary ordering, specifically based on the ASCII code to sort the key, our idea is to add to the rowkey we want to query the condition factor, through the combination of multiple factors, to step-by-step to determine the scope of the search. For example, time must be a query factor that we should add to the Rowkey, a start time and a cutoff time to form a time period range, you can fix a result set range.

You can easily see it. The more query factors are added to the Rowkey, the higher the accuracy of the query range positioning. But the query factor is actually abstracted from a lot of logs (such as host,level,timestamp, etc.), which requires them to be the CPC of every log record, in terms of our current log system, roughly divided into two types of log:

(1) fixed-frame business System/framework log (e.g. Business framework/web app, etc.)

(2) Technical system/component/framework log (e.g. Nginx, Redis, RABBITMQ, etc.) in an indeterminate format

For a fixed-type log, our Rowkey rules are:


For the non-fixed log, our Rowkey rule is:


Because of the variety of log formats for various technology components, we are unable to parse the time from it, so here we select the collection time of the log as the authentication timestamp. Here we can only assume that the entire log system has been working well, that is, the log generation time is similar to the collection time. But there is no doubt that such assumptions are sometimes inaccurate, but we do not use real time as a benchmark, because this type of log is re-dumped through offline batch parsing, so the exact log timestamp will eventually be obtained.

Rowkey is best designed to be long, and it is best to convert each segment of Rowkey into pure or plain letters, which is easy to convert to ASCII and easily artificially set maximum and minimum values. For example: If the previous few are fixed, the last three digits are indeterminate, and if it is a number, then the range of the interval will be between xxxxx000-xxxxx999.

Usually we want to add Rowkey query factor, whose value is not a number or a letter is very normal, then we can map through the code table, such as our loglevel factor for Applog is through the Code table to map, we now use two-digit number to map the possible level.

Filter-filter

After the scope of the result set is determined by the range of Rowkey, the next step is to use the built-in filter for more precise filtering, and hbase defaults to provide multiple filters for users to filter on Rowkey, column-family, qualifier, and so on. Of course, if the Rowkey filter has a large span of values, it will also produce a similar effect to full-table scanning. The only thing we can do is to limit the query conditions, such as:

(1) The span of the query time interval can only be limited to a certain extent

(2) pagination gives the results of the query

Talk about the self-built index again

Since the index is a key element in optimizing queries, the idea of building an index is no problem. However, in any case, the self-built index still requires careful design of the rowkey, whether it is the rowkey of the data table or the Rowkey of the index table. Sometimes, in order to query efficiency, even the first few of a segment of Rowkey are fixed, and the data they represent falls in the same region. The reason for the careful design of rowkey, or because of the query features of HBase: the more accurate the Rowkey range you get, the faster you will find it.

Co-processor-coprocessor

In general, the index table should not be fully scanned when it is established, but we should process each piece of data in the log table to produce the final index data. In our current system, it is analyzed and inserted through storm. Here we go to storm's purpose is not to do this, the main purpose is real-time past loglevel as the error log, and achieve quasi-real-time notification. So the question is, if we don't have this requirement, are we going to have a storm cluster to calculate the index? The answer is: it doesn't have to be big.

In fact, this is mainly when inserting data into hbase, get a hock (hook) or callback to intercept each piece of data, and analyze whether it should be rowkey into the index table. HBase provides a technology called coprocessor (coprocessor) after version 0.92, allowing code interception data running on HBase server to be written, and the coprocessor is broadly divided into two categories:

(1) Observer (analogous to triggers in RDBMS)

(2) EndPoint (analogous to stored procedures in an RDBMS)

We can intercept log records through observer and add code processing logic to build them for indexing. Since introducing HBase technical details is not the focus of this article, mention here, if there is a chance later, to continue to explore.

Back to the topic of self-built index, which discussed the technical point of self-built index dependency, the following is recommended a self-built index design ideas. Here is a Good article , by cleverly design the index table Rowkey to meet the requirements of multi-criteria query. This is a two-level multi-column index design. By mapping multiple query condition keys and condition values to Rowkey to narrow the Rowkey interval of the index table to the final determination of the unique target Rowkey, the rowkey of the data table is obtained from the cell. However, such an index design depends on the table structure known and the precondition is fixed. Obviously, there are a variety of unpredictable tags in the log table, there is no way to reference such an index design. This scenario is best indexed by some search engines that specialize in full-text search.

Third-party professional indexing mechanism

As can be seen from the above discussion, HBase is difficult to achieve very efficient indexing in cases where the requirement for full-text retrieval is very high in the table. At this point we can index the rowkey of hbase with the ability to index the full-text search engine, and HBase is only responsible for storing the underlying data. The industry already has a lot of practice summaries based on this idea (index + storage). Here, the selection of the full-text index can be SOLR, or elasticsearch (which itself has a storage mechanism) that is more suitable for log searches. The solution can refer to this slide.

Here is a picture of the overall architecture pattern:


If there is time and opportunity, I will talk about the elasticsearch+hbase combination of log full-text retrieval practice.


Design optimization of hbase log storage for log system

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.