I was fortunate enough to take the MOOC college Hadoop experience class at the academy.
This is the little Elephant College hadoop2. X Overview Notes for chapter eighth
The main introduction is HBase, a distributed database application case.
Case Overview:
1) Time series database (OPENTSDB)
Use HBase to store time series data, every moment is resolved, the database is open source
2) hbase Crawler Scheduler Library
Vertical Search Crawler
Mass crawler (whole crawler)
This defines the URL crawler scheduler
3) HBase Document library
Storage of document databases, with emphasis on storage
4) Bank RMB Inquiry System
Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all
Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH
The application of HBase in practical problems:
When the data requires random read and write applications, or high concurrency (big Data multiple operations), or when the data structure is simple, but large (non-relational type need to apply a large number of join operations)
HBase is more difficult to do with relational queries such as joins
The key is to design the Rowkey to speed up the query
Common language has Java, Thrift reference Other language operation
In Rowkey design to avoid rowkey hot spots, to make full use of the Rowkey order characteristics, and can combine the requirements of the field into a Rowkey
Time Series Database
Opentsdb belongs to a distributed, scalable time series database
Capture data in seconds, support permanent storage and capacity planning, and store, index from different metrics
Normal MySQL capacity is not enough, dimension support is not enough
experience with the database (there should be omissions: )
1) More columns, more data, faster scan (scan on column faster than line scan)
2) to make each row of data relatively independent. To slice a line according to a certain rule (for example, 10 seconds is a row of data, timestamp)
3) to store more data in each of the keyvalue
4) do not store the synchronization in the server (such as Htable/htablepool, etc.), multi-use asynchbase nursing high concurrency database
5) Key as long as possible
6) do not store too much in a region?
Methods for storing time series
Each row holds a metric & time and values that can be stored in different dimensions
Put the metric ID in front of the time to do the combination of key, can scan the corresponding dimension faster, and can save storage space (metrics number, rather than directly with its name do metrics)
You can also widen the row so that the rows store more data (+0,+1,+2), but this will not save any space, it's just a change in the display.
But the line cannot be infinitely wider.
In addition, to prevent the network from interrupting the wrong line, it is recommended to break lines by time stamp instead of time + 1, + 2, +3
There is a corresponding PDF, search on the internet can be.
Summarize
Widening rows can increase scanning speed and combine the use of rowkey, but these do not save space
Only merge columns, shorten column family name to a certain extent, shorten the space
Vertical Crawler Scheduling Library
Multiple groups (group newsgroups, etc.) are also processed at the same time, and stored in the Dispatch library, HBase regular read can
Characteristics
Crawler software needs to crawl URLs based on real-time, priority, and other storage schedules.
And the crawler needs to maintain a list of URLs for different groups
is basically a queue feature, first the URL to be inserted is the priority crawl. However, there are also features that can be customized to the priority level. And because of the large difference in data volume (large picture), it is also reasonable to allocate resources.
such as vertical business scheduling, site crawl speed limit processing, there is time stamp scheduling processing.
Dispatch Library
Store host features and host URL lists for different channels.
Sort by HostID and priority in the URL
This is in line with the previous OPENTSDB characteristics, do not use the name of the Rowkey, but the ID (from the Host name table) sort
This allows you to have an interval of scan threads to execute URLs
Summarize:
To make full use of rowkey for orderly sorting
To incorporate Rowkey into useful fields Hostid+pid+urlid
Do not use strings directly as Rowkey, but instead encode (integers) to scan, saving space (because each column is stored Rowkey
And it's normalized after the integer.
Document Library
Document library is similar to the principle of dispatch library
Document libraries that store Web pages to analyze more refined data later
Characteristics:
Data formats are different, require real-time reads and writes (and updates), and there is a correlation between data storage (e.g. blog comments and body text are related)
Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all
Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH
Technical features
Split base and Dynamic Data (two column family)
Basic will not change (page title AH content Ah creation time AH)
Dynamic data can be changed in real time (browse volume Ah, etc.)
This is no longer a server for different groups, but multiple servers should respond to multiple groups to meet different data refinement requirements for different groups
Association
Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all
Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH
Characteristics of Bank RMB Inquiry system:
Large scale, and equipment dispersion (such as ATM AH cash counter AH, etc.), acquisition system requirements to be timely and can not be omitted
can be in accordance with the size of the yuan, do hash value or reversal (because the crown may be continuous, some of the number of banknotes will be stored together, can not effectively slice the data storage, sometimes resulting in access to hot spots, so need to change the crown size to do Rowkey)
Requirements
Timely and reliable, fast retrieval and storage, and better scalability
Because it involves multi-device acquisition input, you can solve the problem with Flume+hbase
The reason to choose HBase is that the application is very simple, just a simple query, with HBase is enough
can refer to the Cloudera Open Source log collection system
Summarize
HBase often needs to be used in conjunction with other systems
To avoid creating access hotspots (especially to avoid direct adoption of time as Rowkey), break the serial number
Hadoop-hbase Case Study-hadoop Learning notes < two >