- Hadoop ecosystem
Zookeeper responsible for reconciling hbase must rely on zookeeper
Flume Log Tool
Sqoop is responsible for data conversion from HDFS DBMS to relational database conversion
Big Data Learning Group 119599574
- About HBase
Hadoop database
- is a high-reliability, high-performance, column-oriented, scalable, real-time read-write distributed database
- Using Hadoop HDFs as its file storage system, Hadoop MapReduce is used to process massive amounts of data in HBase, using zookeeper as its Distributed system service
- Used primarily to store unstructured and semi-structured loose data (NoSQL databases)
HBase Data Model
ROW KEY
- Deciding on a row of data
- Sort by Field order
- Row Key can store only 64k of byte data
Column Family Column Family &qualifier columns
- Each column in the HBase table belongs to a column family, and the column family must be given as part of the table schema definition as create ' test ';
- Column names are prefixed with column families, and each column family can y9uo multiple column member columns, such as Test:testfirst, and new column members can then be added on demand and dynamically;
- Permissions control, storage, and tuning are all at the column family level.
- HBase stores data from the same column family in the same directory, saved by several files
Timestamp time Stamp
- In HBase each cell storage unit has multiple versions of the same data, distinguishing between each version based on a unique timestamp, and different versions of the data are sorted in reverse chronological order, with the most recent version of the data in front.
- The type of timestamp is a 64-bit integer.
- The timestamp can be assigned by HBase (automatically when the data is written), at which time the timestamp is the current system time that is accurate to milliseconds.
- Timestamps can also be explicitly assigned by the customer, and if the application wants to avoid data version conflicts, it must generate its own unique timestamp.
Cell cells
- is determined by the intersection of the row and column coordinates;
- The cell is versioned;
- The contents of the cell are an unresolved array of bytes;
- The only unit determined by {row key, column (=<family> +<qualifier>), version}.
- The data in the cell is of no type and is all stored in bytecode form.
HLog (Wal log)
- The Hlog file is an ordinary Hadoop Sequence file,sequence The key is the Hlogkey object, the Hlogkey records the attribution information written to the data, in addition to table and region names, but also includes Sequence number and Timestamp,timestamp are "write Time", the starting value of sequence is 0, or the last time the file system was deposited in sequence.
- The value of HLog Sequecefile is the KeyValue object of HBase, which corresponds to KeyValue in hfile.
HBase schema
Client
- Includes an interface to HBase and maintains the cache to expedite access to hbase
Zookeeper
- Ensure that there is only one master in the cluster at any time
- Store address entry for all region
- Monitor the online and offline information of region server in real time. and notify Master in real time
- Storing the schema and table metadata for HBase
Master
- Assigning region servers to region
- Responsible for load balancing of Region server
- Find the failed region server and reassign the region on the ride
- Manage user additions and deletions to table
Regionserver
- Region server maintains region, processing IO requests to these region
- Region server is responsible for slicing the region that has become too large during operation
Region
- HBase automatically divides the table horizontally into regions, where each region saves a contiguous piece of data in a table
- Each table starts with only one region, and as the data is inserted into the table, the region grows, and when it grows to a threshold, the region is divided into two new region (fission)
- As the rows in the table grow, there will be more and more region. Such a complete table is saved on multiple regionserver.
Big Data Learning Group 119599574
Memstore and StoreFile
- A region is made up of multiple stores, and a store corresponds to a column family (CF)
- The store includes the in-memory Memstore and the StoreFile write operation that is located on the disk first written to Memstore, when the data in the memstore reaches a certain threshold, Hregionserver will start the flashcacher process to write storefile, each write to form a separate storefile
- When the number of storefile files increases to a certain threshold, the system merges (minor, major compaction), and version merging and deletion works during the merge process major form a larger storefile
- When the size and quantity of all storefile in a region exceed a certain threshold, the current regjion is divided into two and assigned to the corresponding Regionserver server by Hmaster, which can achieve load balancing
- The client retrieves the data, now Memstore find, find StoreFile
Hregion
- Hregion is the smallest unit of distributed storage and load balancing in HBase. The smallest unit means that different hregion can be distributed across the Hregion server.
- Hregion consists of one or more stores, each store a column family
- Each store is made up of one memstore and 0 to more storefile. The storefile is stored in hfile format on HDFs.
The-hbase of the Big Data Learning series