Basic knowledge:
Expect: 1. Use efficient row and column keys to organize data storage and use smooth data persistence strategies to relieve cluster pressure
2. Ensure data consistency with zookeeper (election leader)
Technologies to improve performance: Data compression, indexing technology, and manifested views
Zookeeper Monitor Hregionserver, save root region actual address, hmaster Physical address, alleviate the burden of distributed application developing collaboration service from scratch
Hmaster managing Hregionserver Load Balancing
Logs based on Hadoop's sequencefile storage
HBase primarily handles actual data files and log files
Hregionserver submits the request to hregion processing
Merging and splitting of underlying storage files:
1. Memory files (Menstore) constantly swipe in, the number of files increased to a certain number of compaction threads will be merged into large files
2. Large file volume reaches the threshold of segmentation, triggering regions segmentation in Hregionserver
Architecture:
Data is uploaded by the IoT Sensor device group to the real-time historical database server farm, compressed, cached and written to the HBase server cluster
Zookeeper server group to the IoT Sensor device group for device registration management, real-time historical database server group process monitoring, the HBase server Cluster service cluster
Persistence policy:
In layman's words, raw data is processed by real-time data cache preprocessing, discarding processing of unqualified or time-disordered data, using historical data caching and compression, and finally depositing into hbase
The data cache pool is divided into two blocks of the same size, and a lossy compression thread pool is used to write the fixed position of one piece, reaching a piece of receiving, a piece of hbase writing, the writing method is divided into timed brush-in and threshold refresh
Thinking:
1.Flush and compaction optimization
Improvements to the 2.Split mechanism
database table with narrow table, row key is designed to combine row keys, that is [Tag_name][data_timestamp]
Data collation using the mechanism of offline finishing, using MapReduce to the row key in the tag_name and time range of the same merge into a row, the disadvantage is the cost and resource consumption
Zookeeper:1. Device confirmation (Registers a data node, stores information for all acquisition points to access physical node location information)
2. The application layer opens the query transaction function, first decomposes the query transaction, obtains the location information, the query request routes to the correct physical node
3. Device index changes, will be changed to submit, complete their own maintenance
Data statistics and Analysis module:
such as the one-time sliding averaging method
Predict the amount of data written in the next cycle, compare it to the current write volume, and defer the split request if a downtrend or forecast file is small
Query middleware:
1. Interpreting the row keys for data items
2. The query request is parsed, the results are merged and encapsulated
Simultaneous caching of index information required for applications with large demand for real-time queries
HBase-based time series database (improved)