Application of HBase in Content recommendation engine system

Source: Internet
Author: User
Keywords At the same time all server

After Facebook abandoned Cassandra, HBase 0.89 was given a lot of stability optimizations to make it truly an industrial-grade, structured data storage retrieval system. Facebook's Puma, Titan, ODS time Series monitoring system uses hbase as a back-end data storage System. HBase is also used in some projects of domestic companies.

HBase is subordinate to the Hadoop ecosystem, from the beginning of design, it attaches great importance to the scalability of the system, the dynamic expansion of the cluster, load balancing, fault-tolerant, data recovery and so on are fully considered. Compared with the traditional relational database, HBase is more suitable for applications with large data volume, high read/write throughput, data reliability consistency and low transactional requirements for data manipulation.

HBase uses HDFS as storage layer, HDFs masks the heterogeneity of the underlying file system, and the load balancing, fault-tolerant and fault recovery of cluster data are transparent to the upper layer. This makes the structure of hbase very simple and clear, and the cluster extensibility is very prominent. At the same time, HBase uses zookeeper as distributed message middleware to manage the state of each node in the cluster operation and ensure the consistency of distributed transaction.

Through the use of HDFs and zookeeper,hbase to achieve the management node master and Service node region Server runtime stateless design concept, service node region The essential meaning of the memstore and Blockcache structures that are managed in the server is caching. When the system runtime replaces, adds or deletes a service node without relying on any information stored by the service node, the process of data recovery is very simple in load balancing, cluster expansion and failure. Operations such as adding a server or discovering the server's offline cluster load rebalancing can be done in 1 minutes or even seconds without the need to roll back the log. The work of the Management node Hmaster in HBase is only to maintain the sequence of changes in the state of the cluster stored in zookeeper, acting as the watchdog role. When an exception occurs in the Management node, Backup Master can activate immediately without affecting the normal use of the cluster.

The various problems of hbase

HBase also has a number of users criticized the deficiencies, such as the native hbase does not support indexing (many NoSQL data are indexed as the basic function of their support, such as also have many fans of the MongoDB) Query method, only support based on the primary key data read and write and range query, Data filtering for non-primary key columns can only be done through a filter that is inefficient. If a user establishes an index from the client, he or she needs to maintain the consistency of the index table and the data table, while HBase does not support cross row or cross table transactions, which can result in the failure of the data rollback complex logic requires the user to complete.

The HBase bottom uses HDFs as the persistence layer, because the HDFS is very simple to maintain replica consistency, and data modifications are not supported once the file is generated. HBase had to use the lsmtree structure to simulate the real-time modification of data by swiping the new file and querying the contents of multiple storage files at the same time, and then merging the results by timestamp. Therefore, after a long period of data writing will generate a lot of storage files, the traditional mechanical hard drive per second random path number is very limited, and random seek time is around 10ms, much larger than hbase find block and other reading data when necessary operation time, Finding data from multiple storage files can cause read performance, especially random read performance.

After generating multiple storage files, hbase to mitigate data read performance requires regular data file merging operations compaction. Because compaction generally needs to read all of the storage files for a partition, and then sort the records and write them back into a new storage file. During execution consumes a large amount of system network bandwidth, memory, disk I/O, and CPU resources, which can easily overload the system. Once the bandwidth overhead is too large to cause network delay or excessive memory overhead, region server can cause long service termination when it performs a long GC operation. If the time to stop the service is maintained to the zookeeper lease timeout, Master assumes that the server is down and notifies it, and then again distributes the data that is assumed on this region server to other servers. This process usually lasts for 2-3 minutes, and at worst, if there is a meta table on the region server, the entire cluster may not be able to provide services externally.

In addition, because the HBase base uses the column storage structure to solidify the data, the processing of the non sparse data will result in large data redundancy resulting in data bloat. Usually, the 10-column data is deposited, regardless of the expansion rate of the replica is 3~5 times. The easiest way to reduce this type of data expansion is to minimize row keys, column clusters, and column lengths. In extreme cases, we have written the data in the Rowkey of the HBase table to reduce the expansion rate and increase the speed of data range queries and random reads. This method degrades hbase to keyvalue storage to improve read-write performance, but most applications still want the data storage structure to be as close to the logical structure as possible or as close to the structure of the table in the relational table as possible, so they have to compress the data using the snappy compression algorithm or use the hfile V2 uses row prefix compression to reduce redundancy. But these two kinds of pressure methods, especially the compression algorithm, will greatly affect the performance of HBase random reading. In order to improve the compression efficiency, the compression algorithm usually needs to maintain a suitable buffer, compressing the data in the buffer into a compressed block. HBase storage files in the block default size of 64KB, and snappy compression buffer 256KB, which will greatly increase the amount of random reading required to deal with the volume of data, hbase is not good read performance will be further affected.

Introduction and characteristics of recommendation system

Sohu recommended engine system is from the Zero foundation of the state gradually formed, after a very tight development. Currently has access to Gui user's behavior log, the daily information volume in millions, about tens of thousands of per second of the user log is processed in real time warehousing. In this amount of data requests for referral requests and related news requests per second to support more than the number of visits, the recommended response time delay control within 70MS. At the same time, the system requires 10 seconds to complete the process from log to user model correction.

10 seconds of real-time feedback has become the main difficulties of the current system, so we need to maintain the Gui user 200GB short-term attribute information, while relying on these user behavior changes in real-time property information to update users interested in the article theme, while real-time computing user-owned interest groups, Complete the short-term interest-led content recommendation and user group Synergy recommendations.

The user's short-term interest attribute needs to be updated and modified frequently according to the user's click and drop to refresh the three kinds of operation. Once the system receives the user's log needs to find out the corresponding information, and also to find the user-related attribute data, according to the operating properties of all related attributes weighted or reduced weight. The weighted operation includes the clicks, the browsing time and the screen, while the weight-reducing operation is mainly recommended exposure. This data is written back to the user library in real time, and the user's short-term interest model is captured by each recommendation directly from the library, which captures the user's current interest in browsing and reading. In addition, there are some low frequency operations, such as recording the user browsing history, the periodic calculation of hot articles. These operations are done on HBase.

The most stringent requirements of the system is to deal with tens of thousands of per second of the user log, a single log corresponding to the information properties of about 5 to 10, while the most simple update properties need to read out the user's original corresponding attributes and then weighted or weight reduction after the property table. As a result, the storage system handles logs with a random read-write count of approximately hundreds of thousands of times per second. The system also needs to process the recommended requests every second, so many referral requests need to read the current short-term model for each user, while the requested return time needs to be controlled within 70MS, which includes a disk-random seek or even a data-hit disk, and the JVM GC is a problem that the storage system is trying to avoid.

Meet demanding random data literacy requirements

At present, the core of the whole system's pressure is that the hbase,hbase read-write data is the user's short-term attribute. One of the biggest problems of native hbase is that the data randomly read and write is too slow. In order to meet the needs of the current application, we have developed a fully utilized memory data storage system based on HBase. The following is a two-part description of how memory-based storage systems and hbase can load the front end of the huge data additions and deletions to check the pressure.

Memt Bear System Core pressure

Since we have the package named memtable of the memory data storage system on the HBase in our code, this set of things is referred to as memt here. MEMT currently deploys a single cluster of 10 servers (10 to 10 hot standby) that mainly stores short term interest for 200GB users and summary information for the last 30 days of the article.

MEMT main functions include a single server to support nearly 200,000 enhancements per second, support and hbase the same row, column, column table structure, support TTL timestamp data management, supporting the hbase of all filter data filtering. It also encapsulates some common functions of the system, such as TOPN the column or column values in a row of data, smoothing data by time and calculating attenuation.

To ensure system availability, MEMT maintains two memory tables in a single cluster to be backed up, and the client automatically switches to the currently available copy when the node is down, and the application generally has no sense of downtime. At the same time, MEMT also utilizes HBase's own load balancing (balancer) and downtime region recovery strategy to manage its own memory data fragmentation. When a single replica is unavailable, the client quickly switches to the available copy, so there is no case of hbase RS downtime waiting for the session to be extended. All data on the node that stopped service after downtime is allocated to other servers on the cluster, and servers receiving new data fragmentation begin loading data into memory while providing services externally. Each backup in the cluster memory synchronizes data through a log table in HBase, where the client can choose to write the data to the log table, or it can force the brush to write memt the memory of each server to synchronize the data. The log table is hashed to 40 region distributed in the cluster, after a server down, its data will be divided into the cluster of other servers, the entire cluster to restore the downtime of server memory data, so data recovery is very fast, After recovering the data in the recent log, you need to restore the contents of the dump table. This process is described in detail later. At present, the online cluster hangs up a server, from log check to restore memory about 20GB data time less than 1 minutes.

When the in-memory data grows beyond the user-configured threshold (currently 25GB), after the system is sorted by region size, the in-memory data is eliminated from the largest region by the LRU rule into the corresponding HBase dump table, and the row dump mark is set to true in memory. When the row is read again by the system, the contents of the dump table are loaded again into memory to merge the results by timestamp, while the dump mark is modified to false. If the dump flag bit is true, the system updates the data in this row is also placed directly into the dump table to save memory. The corresponding data fragments of the HBase region and memt corresponding to the dump table are assigned to the same server to ensure the performance of their interactions.

The contents of the System Log table are marked as 6-hour expiration, and every 4 hours the system makes a snapshot of the data in memory. The snapshot process is similar to the process of storing data in the dump table when there is not enough memory. The difference is that the snapshot does not affect the dump flag bit for each row of data, and when the memory fragment completes the snapshot, the log before the snapshot can be discarded and the data is recovered directly from the snapshot.

In addition, the system requires each recommendation to request a corresponding delay at 70ms. In order to allow MEMT to make thousands of requests per second without producing a large amount of memory fragmentation and frequent GC, we rewrote the RPC layer of hbase, and designed a cache of those classes that connection, handler these RPC and primarily request memory, When RPC requests and the size of the returned data remain unchanged for a certain amount of time, connection and handler can almost reuse all the discarded data structures to eliminate the memory garbage generation. We once discarded the role of RPC reader, where all requests were processed by handler and returned directly. This will optimize the amount of memory footprint processing. However, after the lack of the request queue after the request of the relationship is not guaranteed, can not guarantee first to first service, the client will randomly appear service delay abnormally high requests.

Use of HBase

The principles of hbase use are as follows.

1. Avoid transaction class applications. The hbase defaults only to ensure data timing and consistency of multiple-user single line data operations. If users need to cross rows or even cross table transaction support, they need to have multiple rows of data locks on the client. When HBase supports high concurrent data access, it is highly likely that deadlocks can affect data access as well as a variety of client issues. If the user needs to lock the table segment or even the table, it needs to handle the lock request on the server by coprocessor or altering the region. Such operations are dangerous and can cause all RS handler threads in the entire cluster to run out of cycle waiting, which in turn stops the entire cluster from serving.

Currently, the least expensive way to process transactions based on HBase is that the data version requests different transaction IDs through different operations, while reading the data to filter the data version of the unfinished transaction. In summary, applications based on HBase processing transaction classes or strong data-consistent classes are inconsistent, violating the design concept of high throughput data access in HBase.

If the application requires a higher transaction, a traditional relational database or a new series of Newsql databases can be selected. For example, the memory database VOLTDB, which uses processing threads to bind to CPU and data fragments, all data modification operations are first sent to the master replica in multiple replicas, and the master copy management thread is uniformly determined sequentially and then executed separately by each copy. A transaction operation that typically requires multiple locks to be unlocked can be completed in a completely unlocked state. At the same time, the measured transaction volume per second is far beyond the general relational database, which is a good choice for OLTP applications.

2. Avoid long time large amount of data writing, while balancing cluster load. Because HBase needs to combine written data to optimize data read performance through compaction operations, compaction operations consume system resources. In order to make the system stable to provide services, it is best to manually control the data table compaction time, while reducing the amount of written data to reduce the system's I/O resource consumption, users can open the hfile prefix compression and shorten the row, column and column length, At the same time, reasonable design table primary key will write data scattered to all servers to ease the pressure. At the same time stop the system automatically, select the low pressure period, timing rolling trigger. Finally, the user is best to close the split function of HBase, at the same time in the definition of the data table in advance partition data fragmentation, so that on the one hand can avoid the new table due to less fragmented, early reading and writing flux are low, on the other hand can avoid the many problems caused by split. Finally, it is best for users to implement balance functionality, such as balance by table granularity, which can allow the load to spread across the cluster more quickly.

3. Ensure the availability of the meta table. The region of all the user tables in the HBase relies on the meta table to determine their current position, and the availability of the meta table is related to whether the entire cluster can provide services externally. To ensure that the meta tables are available, we regularly move the meta tables to the lowest-load, memory-consuming servers on the cluster. Moving the meta table at the same time will write the latest changes to the file system to prevent Meta data loss.

4. Reduce Zookeeper node pressure. HBase all service nodes and data fragment scheduling operation sequence, all service node lifetime session and client query service node address are all performed by zookeeper. Zookeeper nodes also need to synchronize all the data, so it is very important to reduce the node load and ensure the network reach. It is generally recommended that master, Backup Master, and zookeeper nodes be deployed together when server resources are sufficient. At the same time not running region server and other resources consume more processes.

5. Avoid random reading, using caching to reduce thermal data latency. Currently recommended system read, change the most frequent real-time requirements of the highest user data short-term interest data is placed in the MEMT, but there are some more data, but updates and changes are not so frequent data stored in the HBase. For example, all news information raw data, all users of the long-term interest model, and so on, these data will not be updated after the basic storage, while the front-end recommendation server read the data can basically put the hot part of the data cache local and for a long time do not need to visit HBase, These data acceleration methods are basically the use of local cache for each application.

6. Prevent region server from suspended animation. Typically, the session that is maintained in zookeeper is extended by the Region server process when it is suspended or exited for GC or other reasons, and the data recovery process for master is raised. In rare cases, however, we also encounter situations where the region server cannot provide services externally but the session is not extended, which can result in some data being inaccessible. To avoid this, our system monitoring process periodically reads the first line of data for each piece of region, calls the script to restart region Server with no return or timeout, quickly discovers the service node exception, and quickly assigns the data back. In addition, because the region server is very common because of the occurrence of GC downtime, we periodically reboot all services to enable the region server to reboot and to balance the cluster load.

Many of the issues that need to be noted in the use of HBase are described earlier, but in practice hbase most of the time is still very stable and has good performance. HBase sequence scan and data write speeds can reach tens of thousands of times per second. At present, there are many similar data Warehouse data processing operations in the system, because the amount of data involved is also put on the hbase, such as the conversion of data structure between the sources, log data users of information data splicing and the article heat distribution calculation. Most of these operations are the use of hbase sequential reading and writing, although the amount of data processing slightly larger, but also did not cause excessive pressure on the online system. These operations are performed directly on the hbase, simplifying the overall complexity of the system.

In short, hbase can use a large number of inexpensive PC servers to provide excellent high concurrency and large traffic data read and write performance. Even if you do not do fine-grained optimization, simply increasing the number of servers can multiply the ability of reading and writing to increase the processing capacity and stability of the system.

Other modules of the system

Other modules of the system currently include Kafka queues used as transfer logs and other messages, off-line computing of Hive, Pig, Mahout, and other operational control systems for user models. Kafka Message Queuing is very good at reading and writing, but it can cause messages to be scrambled and messages to be published repeatedly. At present, all statistic data of the system are obtained by hive processing log. The development of hive is very low, easy to use and highly productive. Pig is mainly used for initial log cleaning and mahout for user model calculation.

Concluding

The content recommendation engine system integrates heavily open source systems, standing on the shoulders of giants to pick the current results.

Compared to other nosql systems (such as Redis, MongoDB, Cassandra, etc.), HBase does not support complex transactions based on HDFs, the biggest consideration in the original design is extensibility, its design is based on clustering, scalability, recovery mechanism is clear and efficient, The load distribution mode based on horizontal fragmentation is easy to adjust.

These reduce the difficulty of our design system, good scalability so that we do not have to worry because the system users multiplied long, have to deal with data fragmentation, scheduling, synchronization, reliability and a series of problems. The cluster scale with the user scale synchronous linear expansion is the cheapest way to promote the group system.

At the same time hbase simple and clear code structure allows us to solve their various problems or customized two of development possible. Many powerful components in the hbase, such as the Bloom filter hfile and RPC, are also disassembled for reuse in other systems. The current system of HBase and a series of derivative systems based on HBase have been able to perform most demanding requirements and provide services to the outside world for a long time under low load and steady state.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.