Directory:
Installation of 1-hbase
2-java manipulating HBase Examples
3-hbase Simple Optimization techniques
4– Storage
5 (cluster)-pressure load and fail forward
6-vernacular MySQL (RDBMS) and HBase
7-Security & Permissions
Installation of 1-hbase
What is hbase.
HBase is a subproject in Apache Hadoop, and HBase relies on Hadoop's HDFs as the most basic storage base unit to see the structure of these data storage folders by using the DFS tools for Hadoop, as well as through the map/ The framework of reduce (algorithm) operates on HBase, as shown in the diagram on the right:
HBase also includes jetty in the product, which starts the jetty in an embedded manner when HBase starts, so it is lightweight to manage and view some of the state that is currently running through the Web interface.
why HBase is used.
HBase differs from the general relational database, which is a database suitable for unstructured data storage. So-called unstructured data storage means that HBase is a column-based rather than a row-based pattern that reads and writes your big data content.
HBase is a way of storing data between the map Entry (key & value) and the DB row. Just a little bit like the current popular memcache, but not just a simple key corresponding to a value, you probably need to store multiple properties of the data structure, but there is no traditional database table in so many association relationships, this is called loose data.
Simply put, the table you create in HBase can be seen as a large table, and the properties of this table can be dynamically increased as required, and there are no queries associated with tables in HBase. All you have to do is tell your data to be stored in HBase's column families, and you don't need to specify its specific type: Char,varchar,int,tinyint,text and so on. However, you need to be aware that HBase does not contain such functionality as transactions.
Apache HBase and Google Bigtable have a very similar place where a data row has an optional key and any number of columns. Tables are loosely stored, so users can define a variety of columns for the row, which is useful in large projects and simplifies the cost of design and upgrade.
how to run HBase.
Download a stable version of HBase http://mirrors.devlib.org/apache/hbase/stable/hbase-0.20.6.tar.gz from the Image Web site of the Apache HBase, and after downloading is complete, to decompress it. Make sure that the Java SDK and SSH are installed correctly in your machine, otherwise it will not work correctly.
$ cd/work/hbase
Enter this directory
$ vim conf/hbase-env.sh
Export Java_home=/jdk_path
Edit the conf/hbase-env.sh file and modify the Java_home to your JDK installation directory
$ vim Conf/regionservers
Enter all of your hbase server names, localhost, or IP addresses
$ bin/start-hbase.sh
Start HBase, the middle requires you to enter the password two times, you can also set no need to enter a password, start success, as shown in the figure:
$ bin/hbase Rest Start
After you start the HBase rest service, you can implement the rest-form data operation on HBase by using the general Rest operation (get/post/put/delete) for uri:http://localhost:60050/api/.
You can also enter the following command to enter the HQL instruction mode
$ bin/hbase Shell
$ bin/stop-hbase.sh
Close HBase Service
Problems at startup
Because the hostname of the Linux system is not configured correctly, there may be an issue in running HBase server, as shown in the figure:
2010-11-05 11:10:20,189 ERROR org.apache.hadoop.hbase.master.HMaster:Can not start master
java.net.unknownhostexception:ubuntu-server216:ubuntu-server216
Indicates that your hostname is not correct, you can first check the name of the/etc/hosts/, and then use the hostname command to modify, hostname you_server_name
View running status if you need to monitor the HBase log, you can view the log files under hbase.x.x./logs/, and you can use tail-f to view them. Web-based view of zookeeper http://localhost:60010/zk.jsp running under HBase if you need to see the current running state you can view the HBase server on the Web, as shown in the figure:
Extended Reading 1:
The Apach Hadoop project contains those products, as shown in the figure:
Pig is a query language (sql-like) built on MapReduce and is suitable for a large number of parallel computations.
Chukwa is a monitoring system based on a Hadoop cluster, which is simply a "watchdog" (WatchDog)
Hive is the intersection of Datawarehouse and Map reduce, and is suitable for ETL work.
HBase is a column-oriented distributed database.
Map Reduce is an algorithm proposed by Google for the parallel operation of very large datasets.
HDFS can support Tens's large distributed file system.
Zookeeper features include: Configuration maintenance, name services, distributed synchronization, group services, etc., for a reliable and coordinated system for distributed systems.
Avro is a data serialization system designed for applications that support large-scale data exchange.
Extended Reading 2:
What is column storage. Columnstore is different from the traditional relational database, whose data is stored on rows in the table, one of the important benefits of the column approach is that because the selection rules in the query are defined by columns, the entire database is automatically indexed. The data aggregation store for each field is stored by column, and when the query requires only a few fields, it can greatly reduce the amount of data read and the data aggregation storage of a field, which makes it easier to design a better compression/decompression algorithm for this clustered storage. This diagram describes the differences between traditional row and column storage:
Extended Reading 3:
The massive log4j log of the system can be stored in a centralized machine, the installation of Splunk on this machine can be convenient for all log viewing, installation method can be consulted:
Http://www.splunk.com/base/Documentation/latest/Installation/InstallonLinux
2-java manipulating HBase examples
This article describes the HBase server operation with the HBase shell command and the HBase Java API. Prior to this, there was a general understanding of hbase. For example, what are the main components inside the HBase server? What is the internal workings of hbase. I want to learn any knowledge, technology attitude can not just know how to use, the internal construction of the product is not concerned about, so out of the problem, it is difficult for you to quickly find the answer, and even we want to finally be able to understand the technology of their own experience, for my use, The design ideas of this technology are used to create their own solutions, and more flexible to deal with the changeable computing scenarios and architecture design. To my current understanding of hbase is not enough in-depth, the future of continuous learning, I will share the bits I know to this blog.
Let's take a look at how HBase works by reading a row, first the HBase client will connect to zookeeper Qurom (as can be seen from the code below, for example: Hbase_config.set (" Hbase.zookeeper.quorum "," 192.168.50.216 ")). Which server management-root-region can be learned by the Zookeeper component client. Then the client accesses the server that manages the-root-and records all the table information in HBase in meta (you can use scan '. META. ' command lists the details of all the tables you have created to get information about the region distribution. Once the client obtains the location information for this line, such as which region,client this line belongs to, it caches this information and accesses hregionserver directly. Over time, the client caches more information, even if not accessed. META. Tables can also know which hregionserver to visit. HBase contains two basic types of files, one for storing the log for Wal, and the other for storing specific data that is stored interactively through the DFS client and Distributed file System HDFs.
As shown in the figure:
Take a look at some of HBase's memory implementations: there is only one master server in Hmaster-hbase. Hregionserver-is responsible for multiple hregion to provide services to client side, there will be multiple hregionserver in HBase cluster. Servermanager-is responsible for managing region server information, such as Hserverinfo for each region server (this object contains Hserveraddress and Startcode), the number of load region, The dead Region Server List regionmanager-is responsible for assigning region to region server and monitoring root and Meta 2 system-level region states. Rootscanner-periodically scans the root region to find a meta region that is not assigned. metascanner-periodically scans the meta region to discover a user region that is not assigned.
hbase Basic Commands
Let's look at some of the basic operations commands for HBase, and I've listed several common hbase shell commands, as follows:
Name |
Command-expression |
Create a table |
Create ' table name ', ' column name 1 ', ' Column Name 2 ', ' Column name n ' |
Add a record |
Put ' table name ', ' Row name ', ' Column name: ', ' value ' |
View Records |
Get ' table name ', ' Row name ' |
View the total number of records in a table |
Count ' table name ' |
Deleting records |
Delete ' table name ', ' Row name ', ' column name ' |
Delete a table |
To block the table before it can be deleted, the first step disable ' table name ' Second step drop ' table name ' |
View all records |
Scan "Table name" |
View all data in a column of a table |
Scan "table name", [' Column name: '] |
Update record |
is to rewrite it again to overwrite |
If you are a novice team some of the commands of HBase are not very familiar, you can enter HBase Shell mode you can input the help command to see the commands you can execute and instructions to the command, for example, scan this command, in Help not only mention the command, It also explains in detail the parameters and functions that can be used in the scan command, such as how to query by column name and how to use the limit, StartRow:
Scan Scan a table, pass table name and optionally a dictionary of scanner specifications.
Scanner specifications may include one or more of the Following:limit, StartRow, Stoprow, TIMESTAMP, or COLUMNS.
If no columns is specified, all columns'll be scanned. To scan all members of a column family, leave the
qualifier empty as in ' col_family: '. Examples:
hbase> scan '. META. '
Hbase> Scan '. META. ', {COLUMNS = ' info:regioninfo '}
hbase> scan ' t1 ', {COLUMNS = [' C1 ', ' C2 '], LIMIT = ten, StartRow = ' xyz '} |
using the Java API to manipulate hbase servers
The following jar packages are required
Hbase-0.20.6.jar Hadoop-core-0.20.1.jar Commons-logging-1.1.1.jar Zookeeper-3.3.0.jar log4j-1
.2.91.jar import Org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.hbase.HBaseConfiguration;
Import Org.apache.hadoop.hbase.HColumnDescriptor;
Import Org.apache.hadoop.hbase.HTableDescriptor;
Import Org.apache.hadoop.hbase.KeyValue;
Import Org.apache.hadoop.hbase.client.HBaseAdmin;
Import org.apache.hadoop.hbase.client.HTable;
Import Org.apache.hadoop.hbase.client.Result;
Import Org.apache.hadoop.hbase.client.ResultScanner;
Import Org.apache.hadoop.hbase.client.Scan;
Import Org.apache.hadoop.hbase.io.BatchUpdate;
@SuppressWarnings ("deprecation") public class Hbasetestcase {static hbaseconfiguration cfg = null;
static {Configuration Hbase_config = new configuration ();
Hbase_config.set ("Hbase.zookeeper.quorum", "192.168.50.216");
Hbase_config.set ("Hbase.zookeeper.property.clientPort", "2181"); CFG = new Hbaseconfiguration (hbase_config); }/** * Create a table */public static void Creattable (String tablename) throws Exception {HBas
Eadmin admin = new hbaseadmin (CFG);
if (admin.tableexists (tablename)) {System.out.println ("table Exists!!!");
} else{Htabledescriptor Tabledesc = new Htabledescriptor (tablename);
Tabledesc.addfamily (New Hcolumndescriptor ("Name:"));
Admin.createtable (TABLEDESC);
SYSTEM.OUT.PRINTLN ("CREATE table OK."); }}/** * Add a data */public static void AddData (String tablename) throws Exceptio
n{htable table = new htable (CFG, tablename);
BatchUpdate update = new BatchUpdate ("Huangyi");
Update.put ("Name:java", "http://www.javabloger.com". GetBytes ());
Table.commit (update);
SYSTEM.OUT.PRINTLN ("Add data ok.");
}
/** * Show All data */public static void Getalldata (String tablename) throws exception{Htabl
e table = new htable (CFG, tablename);
Scan s = new scan ();
Resultscanner SS = Table.getscanner (s);
for (Result r:ss) {KeyValue Kv:r.raw ()) {System.out.print (New String (Kv.getcolumn ()));
System.out.println (New String (Kv.getvalue ())); }}} public static void Main (String [] agrs) {try {string ta
Blename= "TableName";
Hbasetestcase.creattable (tablename);
Hbasetestcase.adddata (tablename);
Hbasetestcase.getalldata (tablename);
} catch (Exception e) {e.printstacktrace (); }
}
} |
3-hbase Optimization Tips
This article from a few aspects of the simple talk about hbase some of the optimization techniques, only as part of my study notes, because learning more afraid of forgetting, left to see for themselves.
1 modifying Linux system Parameters
The maximum number of open files in Linux system the default parameter value is 1024, if you do not make changes to the concurrency will appear when the "Too many Open files" error, resulting in the entire hbase is not operational, you can use the Ulimit-n command to modify, or modify/ etc/security/limits.conf and/proc/sys/fs/file-max parameters, specifically how to modify can go to google keyword "Linux limits.conf"
2 JVM Configuration
Modify the configuration parameters in the hbase-env.sh file to configure the appropriate parameters based on your machine hardware and the JVM (32/64 bits) of the current operating system
Hbase_heapsize 4000 HBase The size of the JVM heap used
Hbase_opts "‐SERVER‐XX:+USECONCMARKSWEEPGC" JVM GC option
Hbase_manages_zkfalse whether to use zookeeper for distributed management
3 HBase Persistence
After restarting the operating system the data in HBase is completely absent, you can create a table, write a piece of data without making any changes, then restart the machine, reboot and then you enter HBase's shell using the list command to see the current table, none of them. It's not quite a cup. It doesn't matter. You can set the Hbase.rootdir value in Hbase/conf/hbase-default.xml to specify a folder where the file is saved, for example: <value>file:///you/ Hbase-data/path</value>, the tables and data in HBase that you created are written directly to your disk, as shown in the figure:
You can also specify the path to your distributed file system HDFs such as: Hdfs://namenode_server:port/hbase_rootdir, which is written on your distributed file system.
4 Configuring HBase Run Parameters
Next you need to configure the Hbase/conf/hbase-default.xml file, the following are the more important configuration parameters I think
Hbase.client.write.buffer
Description: This parameter can be set to the size of the write data buffer, when the client and server to transmit data, the server in order to improve the performance of the system to open a write buffer to handle it, this parameter setting if the large set, will have a certain system memory requirements, directly affect the performance of the system.
Hbase.master.meta.thread.rescanfrequency
Description: How long hmaster the system table root and Meta scan once, this parameter can be set a bit longer, reduce the system energy consumption.
Hbase.regionserver.handler.count
Description: Because the Hbase/hadoop server is designed with multiplexed, non-blocking I/O mode, it can be processed through a thread, but because the method that the client side calls is blocking I/O, it is designed to place the object passed by the client first in the queue, and a heap of handler (Thread) is generated when the server is started, and the handler is polling to get the object and execute the corresponding method, which defaults to 25 , you can set a larger number according to the actual scenario.
Hbase.regionserver.thread.splitcompactcheckfrequency
Description: This parameter is the time interval for how often to regionserver a server to run a split/compaction, although a compact operation will be performed before split. The compact operation may be minor The compact may also be major compact.compact after Will take midkey from all the storefile files in all the stores. This midkey may not be in the full data mid. The following data for a row-key may cross different hregion.
Hbase.hregion.max.filesize
Description: Hstorefile maximum value in Hregion, the column family in any table will be sliced once it exceeds this size, and the default size of Hstroefile is 256M.
Hfile.block.cache.size
Description: Specifies the percentage that the Hfile/storefile cache allocates in the JVM heap, the default value is 0.2, which means 20%, and if you set it to 0, the option is masked.
Hbase.zookeeper.property.maxClientCnxns
Description: The option for this configuration is from zookeeper, which indicates the number of concurrent connections that the Zookeeper client accesses concurrently, and zookeeper is an entry for hbase. The value of this parameter can be enlarged appropriately.
Hbase.regionserver.global.memstore.upperLimit
Description: The size parameter configuration for all memstores in the region server, the default value is 0.4, which means 40%, and if set to 0, is the option to mask.
Hbase.hregion.memstore.flush.size
Description: The cached content in Memstore will be written to disk after it has exceeded the configured range, for example: The delete operation is written in Memstore, indicating that value, column or family are to be deleted, and hbase periodically makes a major of the stored file Compaction, at that time HBase will brush memstore into a new hfile storage file. If the major compaction is not done within a certain time frame, and the range of Memstore is written to disk.
5 log4j logs in HBase
The log output level in HBase is opened by default for debug, info-level logs, and the log level can be adjusted to suit your needs, and HBase's log4j log configuration file is in the Hbase\conf\log4j.properties directory.
4– Storage
A table created in HBase can be distributed across multiple hregion, saying that a table can be split into chunks, each of which calls us a hregion. Each hregion will save a table inside a contiguous data, the user created the large table of each hregion block is provided by the Hregion server maintenance, Access Hregion block is to go through the hregion server, A hregion block corresponds to a hregion server, and a complete table can be saved on multiple hregion. The correspondence between Hregion Server and region is a one-to-many relationship. Each hregion is physically divided into three parts: Hmemcache (Cache), Hlog (log), Hstore (persistence layer).
These relationships look like this in my mind, as shown in the figure:
The relationship between 1.HRegionServer, Hregion, Hmemcache, Hlog, Hstore, as shown in the figure:
The distribution of the data in the 2.HBase table with the Hregionserver, as shown in the figure:
HBase Read Data
HBase reads the data first and reads the contents of the Hmemcache, and if it does not fetch the data in the Hstore, it improves the performance of the data read.
HBase writes data
HBase writes the data to Hmemcache and Hlog, hmemcache the cache, hlog the transaction log for synchronization Hmemcache and Hstore, and when the flush cache is launched, the data is persisted to Hstore, and empty the Hmemecache.
When the client accesses the data through Hmaster, each hregion server maintains a long connection to the Hmaster server, and Hmaster is the manager of the HBase distributed system, his main task is to tell each hregion The server it wants to maintain which hregion. These data of the user can be saved on the Hadoop Distributed file system. If the primary server Hmaster freezes, the entire system will be invalid. Below I will consider how to solve the problem of hmaster SPFO, this problem is a bit similar to the SPFO problem of Hadoop, just a namenode maintenance of the global Datanode,hdfs once the crash all hung up, some people say that the use of heartbeat to solve the problem , but I always want to find out other solutions, more time, there is always a way.
Yesterday in the environment of hadoop-0.21.0, hbase-0.20.6 for a long time, has been an error message, the following:
Exception in thread ' main ' Java.io.IOException:Call to localhost/serv6:9000 failed On local exception:java.io.EOFException 10/11/10 15:34:34 ERROR master. Hmaster:can Not start master java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccess Orimpl.newinstance0 (Native Method) at Sun.reflect.NativeConstructorAccessorImpl.newInstance (nativeconstructoracce ssorimpl.java:39) at Sun.reflect.DelegatingConstructorAccessorImpl.newInstance (Delegatingconstructoraccessorimpl . java:27) at Java.lang.reflect.Constructor.newInstance (constructor.java:513) at Org.apache.hadoop.hbase. Master.
Hmaster.domain (hmaster.java:1233) at Org.apache.hadoop.hbase.master.HMaster.main (hmaster.java:1274) |
Dead or Alive connection is not on HDFs, also can't connect hmaster, depressed ah.
I think about it, think slowly, my eyes bright java.io.EOFException This exception, is it possible that the RPC protocol format inconsistency caused. This means that the server side and the client version are inconsistent. For an HDFS server side, everything is good, sure enough is the version of the problem, and finally use hadoop-0.20.2 collocation hbase-0.20.6 more stable.
The final effect is as shown in the figure:
Some textual descriptions of the above figure: the Hadoop version is 0.20.2, the hbase version is 0.20.6, a table tab1 is created in HBase, exiting the HBase shell environment, viewed with the Hadoop command, Files in the file system that's a lot more. A newly created TAB1 directory, the above picture illustrates HBase running in the Distributed File System Apache HDFs.
5 (cluster)-pressure load and fail forward
In the previous article on HBase, which described the architecture of hbase in a distributed environment, this article will explain how HBase eliminates single point of failure in distributed environments (SPFO), doing a small experiment about HBase's high availability in a distributed environment, seeing some phenomena in its own eyes, Extend some of the topics of thought.
Let's recap. HBase main components: Hbasemaster hregionserver hbase Client hbase Thrift server hbase REST Server
Hbasemaster
Hmaster is responsible for assigning areas to Hregionserver, and is responsible for load balancing hreginserver in the cluster environment, Hmaster is also responsible for monitoring the hreginserver in the cluster environment, If a hreginserver down, Hbasemaster will not use the Hreginserver to provide services Hlo