Hadoop-hbase Case Study-hadoop Learning notes

Hadoop-hbase Case Study-hadoop Learning notes < two >

Last Update:2015-08-28 Source: Internet

Author: User

Tags opentsdb

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I was fortunate enough to take the MOOC college Hadoop experience class at the academy.
This is the little Elephant College hadoop2. X Overview Notes for chapter eighth
The main introduction is HBase, a distributed database application case.

Case Overview:

1) Time series database (OPENTSDB)
Use HBase to store time series data, every moment is resolved, the database is open source
2) hbase Crawler Scheduler Library
Vertical Search Crawler
Mass crawler (whole crawler)
This defines the URL crawler scheduler
3) HBase Document library
Storage of document databases, with emphasis on storage
4) Bank RMB Inquiry System

Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all

Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH

The application of HBase in practical problems:

When the data requires random read and write applications, or high concurrency (big Data multiple operations), or when the data structure is simple, but large (non-relational type need to apply a large number of join operations)
HBase is more difficult to do with relational queries such as joins
The key is to design the Rowkey to speed up the query
Common language has Java, Thrift reference Other language operation

In Rowkey design to avoid rowkey hot spots, to make full use of the Rowkey order characteristics, and can combine the requirements of the field into a Rowkey

Time Series Database

Opentsdb belongs to a distributed, scalable time series database
Capture data in seconds, support permanent storage and capacity planning, and store, index from different metrics
Normal MySQL capacity is not enough, dimension support is not enough
experience with the database (there should be omissions: )
1) More columns, more data, faster scan (scan on column faster than line scan)
2) to make each row of data relatively independent. To slice a line according to a certain rule (for example, 10 seconds is a row of data, timestamp)
3) to store more data in each of the keyvalue
4) do not store the synchronization in the server (such as Htable/htablepool, etc.), multi-use asynchbase nursing high concurrency database
5) Key as long as possible
6) do not store too much in a region?

Methods for storing time series

Each row holds a metric & time and values that can be stored in different dimensions
Put the metric ID in front of the time to do the combination of key, can scan the corresponding dimension faster, and can save storage space (metrics number, rather than directly with its name do metrics)

You can also widen the row so that the rows store more data (+0,+1,+2), but this will not save any space, it's just a change in the display.
But the line cannot be infinitely wider.
In addition, to prevent the network from interrupting the wrong line, it is recommended to break lines by time stamp instead of time + 1, + 2, +3
There is a corresponding PDF, search on the internet can be.

Summarize

Widening rows can increase scanning speed and combine the use of rowkey, but these do not save space
Only merge columns, shorten column family name to a certain extent, shorten the space

Vertical Crawler Scheduling Library

Multiple groups (group newsgroups, etc.) are also processed at the same time, and stored in the Dispatch library, HBase regular read can

Characteristics

Crawler software needs to crawl URLs based on real-time, priority, and other storage schedules.
And the crawler needs to maintain a list of URLs for different groups
is basically a queue feature, first the URL to be inserted is the priority crawl. However, there are also features that can be customized to the priority level. And because of the large difference in data volume (large picture), it is also reasonable to allocate resources.
such as vertical business scheduling, site crawl speed limit processing, there is time stamp scheduling processing.

Dispatch Library

Store host features and host URL lists for different channels.
Sort by HostID and priority in the URL
This is in line with the previous OPENTSDB characteristics, do not use the name of the Rowkey, but the ID (from the Host name table) sort
This allows you to have an interval of scan threads to execute URLs

Summarize:

To make full use of rowkey for orderly sorting
To incorporate Rowkey into useful fields Hostid+pid+urlid
Do not use strings directly as Rowkey, but instead encode (integers) to scan, saving space (because each column is stored Rowkey
And it's normalized after the integer.

Document Library

Document library is similar to the principle of dispatch library
Document libraries that store Web pages to analyze more refined data later

Characteristics:

Data formats are different, require real-time reads and writes (and updates), and there is a correlation between data storage (e.g. blog comments and body text are related)

Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all

Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH

Technical features

Split base and Dynamic Data (two column family)
Basic will not change (page title AH content Ah creation time AH)
Dynamic data can be changed in real time (browse volume Ah, etc.)
This is no longer a server for different groups, but multiple servers should respond to multiple groups to meet different data refinement requirements for different groups
Association

Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all

Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH

Characteristics of Bank RMB Inquiry system:

Large scale, and equipment dispersion (such as ATM AH cash counter AH, etc.), acquisition system requirements to be timely and can not be omitted
can be in accordance with the size of the yuan, do hash value or reversal (because the crown may be continuous, some of the number of banknotes will be stored together, can not effectively slice the data storage, sometimes resulting in access to hot spots, so need to change the crown size to do Rowkey)
Requirements
Timely and reliable, fast retrieval and storage, and better scalability

Because it involves multi-device acquisition input, you can solve the problem with Flume+hbase
The reason to choose HBase is that the application is very simple, just a simple query, with HBase is enough
can refer to the Cloudera Open Source log collection system

Summarize

HBase often needs to be used in conjunction with other systems
To avoid creating access hotspots (especially to avoid direct adoption of time as Rowkey), break the serial number

Hadoop-hbase Case Study-hadoop Learning notes < two >

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More