Mapreduce operation HBase

Source: Internet
Author: User
My nonsense: This article provides sample code, but does not describe the details of mapreduce on the HBase code layer. It mainly describes my one-sided understanding and experience. Recently, we have seen Medialets (Ref) share their experience in using MapReduce in the website architecture. HDFS is used as the basic environment for MapReduce distributed computing.

My nonsense: This article provides sample code, but does not describe the details of mapreduce on the HBase code layer. It mainly describes my one-sided understanding and experience. Recently, we have seen Medialets (Ref) share their experience in using MapReduce in the website architecture. HDFS is used as the basic environment for MapReduce distributed computing.

My nonsense:
This article provides sample code, but does not describe the details of mapreduce on the HBase code layer. It mainly describes my one-sided understanding and experience.

Recently, we have seen Medialets (Ref) share their experience in using MapReduce in the website architecture, using HDFS as the basic environment for MapReduce distributed computing, and computing specific content based on the Python MapReduce framework, write the computing results to MongoDB for storage, and claim to be able to process millions of business events per second. As a result, the application scenarios of MapReduce are increasingly enriched, in addition to super-large Internet companies such as Google and Yahoo, more small and medium portals are interested in MapReduce and Hadoop. The use of Hadoop's Mapreduce distributed computing scenario will be closer and closer to us.

The Hadoop Map/Reduce framework is indeed easy to understand. applications developed based on it can run on a large cluster consisting of thousands of machines, or MapReduce is"Brute force" computingTo a certain extent, the larger the number of machines, the more significant the results are, and MapReduce provides a reliable and fault-tolerant way to process T-level data sets in parallel.
In actual scenarios, executing a Map/Reduce job divides the Input data into multiple data blocks (like the yellow chunks on the left in the figure below ), mapperTask processes the input data in a distributed and parallel manner. The mapreduce framework sorts the map output first, and then the map inputs the computing results to reduce to merge the computing results.

The failure forwarding, distributed storage, Job Scheduling, fault tolerance Processing, network communication, and load balancing of mapreduce computing nodes in the cluster environment do not need to be considered by developers, the MapReduce framework and MapReduce runtime environment have long been considered for this purpose. As shown in, in the cluster environment, a master is responsible for scheduling all the tasks that constitute a job. A large number of tasks exist in the master's Task Queue, assign these tasks to different slave instances. The master monitors their execution. If the task fails, the master assigns the slave (work) to re-execute the task ,:

Generally, input and output of jobs are stored in the file system HDFS. That is to say, the MapReduce framework and Distributed File System run on the same group of nodes, this allows you to efficiently Schedule Tasks on nodes that have stored data. This allows you to efficiently use the network bandwidth of the entire cluster. The MapReduce framework is composed of a separate JobTracker (master) and multiple TaskTracker (slave) cluster nodes.

In the code of the MapReduce client, you need to specify the input/output location (file path/DB/NOSQL). The client, coupled with the job parameters, constitutes the job configuration ), the client code needs to define the map and reduce methods by implementing appropriate abstract classes and writing your business logic in the implementation methods, in the client program, you also need to define your map/reduce input and output types. When the Hadoop job client submits a job (jar package/class/executable program, etc) and configuration information to JobTracker, which is responsible for distributing the software and configuration information to slave, scheduling tasks and monitoring their execution, and providing status and diagnostic information to job-client. The workflow after the client job is submitted is hadoop, which is divided into four processes: input, splitting, sorting, shuffling, merging, and output. The output results are ordered, because the mapreduce framework is naturally sorted. :

MapReduce distributes dataset operations to network nodes, and each node periodically reports the execution status back. When a node connection or computing times out, the master node records that this node is in the dead state and sends tasks assigned to this node to other nodes for running. For example, Apache Hive is an implementation of a MapReduce framework, hive can convert SQL statements into MapReduce tasks to distribute the executed SQL statements to each machine for running, and finally return the calculation results.

I wrote a code example. This example reads data from a folder through the mapreduce framework, formatted, processed the content, and then written into the HBase program. After the input, multiple maptasks are generated based on the input conditions and status to process the input content. The Mapper first reads all the file information from the directory, and then processes and formats it, after all these operations are completed, the computation results are handed over to the Reducer for execution. the Reducer performs corresponding operations based on the input types defined by the client and stores the final results in HBase.

Example:
There is an input directory for input, which contains three files: 1.txt/ 2.txt/ 3.txt. you need to create a tab1 table in HBase, f1 is the column name, and then run the code example, finally, the data in the three txt files in Hbase is all written into hbase. During the running process, you can see on the eclipse console that the data is read/formatted first, and then written into hbase ,:

Reduce calculation result,

Code example: http://javabloger-mini-books.googlecode.com/files/txt-to-hbase.rar

This example assumes a scenario, such as Baidu LibraryEvery moment, tens of thousands of people upload files to Baidu's server. They need to process and format the files in different formats in the shortest time, and finally save them. After the front-end server obtains the documents uploaded by the user, it is like the three files under x: \ input in this example. For Baidu Library's running scenario, the number may be, throwing a large number of documents to mapreduce, mapreduce submits documents that need to be parsed, typeset, and formatted to each distributed hadoop node, and distributes the computing pressure across multiple CPUs for computing, because of the huge strength of many people, we will soon be able to process documents and save the database/NOSQL. Users can immediately read the uploaded documents online.

InOffline scenariosYou can also use another method to write data into HBase through mapreduce. First, you can throw the document to mapreduce, then use HFileOutputFormat to output HBase data files, and then use Hbase to import the data files to HBase, this method can be considered for the migration of massive data. Based on this solution, HBase officially provides the importtsv tool. You can refer to the official HBase document (Ref ).

My nonsense:
Recently, a project needs to provide solutions for the largest network device vendor in China. This project is mainly used to expand the contact function on the android platform, for example, if the contact software is installed on both clients, you can send free text messages. Similar to fetion and kiki, this product will be promoted to markets outside of China, if the number of users online in Phase II reaches tens of millions, a large amount of offline messages may be generated. We plan to use HBase to store offline messages, at this point, I learned exactly from FaceBook.

Related Articles:
Hbase entry 6-communication between MySQL (RDBMS) and HBase
Lily-distributed search based on HBase
MySQL migration tool to Hive/HBase
HBase entry 5 (cluster)-load splitting and forwarding failure
Hive entry 3-Integration of Hive and HBase
HBase entry 4
HBase entry 3
Introduction to HBase 2-Examples of Java operations on HBase
HBase Basics
Hbase-based distributed messaging (IM) system-JABase
HBase entry 7-Security & Permissions

-End-

Original article address: mapreduce operation HBase. Thank you for sharing it with the original author.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.