Hbase application scenarios and success stories (transfer)

Source: Internet
Author: User
Tags thunderbird email opentsdb hadoop mapreduce database sharding
Hbase application scenarios and success stories

Sometimes the best way to understand a software product is to see how it works. What problems it can solve and how these solutions apply to large-scale application architectures can tell you a lot. Because hbase has many open product deployments, we can do this. This section describes how to use hbase.

 

Note: Do not restrict yourself, think hbaseThese scenarios can only be solved. As a new technology, innovation based on application scenarios is driving the development of the system. If you have a new idea, you think it can benefit from hbase.Try the provided functions. The Community is happy to help you and learn from your experience. This is the spirit of open source software.

 

Hbase imitates Google's bigtable. Let's start to explore the typical bigtable problem: storage Internet.

 

Typical internet search problems: the reason for the bigtable invention

Search is an action to locate the information you care about. For example, you can search for the page number of a book, which contains the topic you want to read or a webpage, and contains the information you want to find. To search for a document containing a specific word, you need to search for an index that provides ing between a specific word and all documents containing the word. To be able to search, you must first create an index. This is exactly what Google and other search engines do. Their document library is the entire Internet; the words you search for are anything you typed in the search box.

Bigtable and imitated hbase provide storage for such document libraries. bigtable provides row-level access, so crawlers can insert and update a single document. Search indexes can be efficiently generated using mapreduce Computing Based on bigtable. If the result is a single document, you can retrieve it directly from bigtable. Supporting various access modes is a key factor affecting the design of bigtable.

 

Create an Internet Index

1. crawlers continuously crawl new pages. Each page is stored in bigtable row by row.

2. mapreduce computing jobs run on the entire table and generate indexes to prepare for the network search application.

Search Internet

3. the user initiates a network search request.

4. Search for the index created by the Application query on the Internet, or directly obtain a single document from bigtable.

5. The search result is submitted to the user.

 

After talking about typical hbase application scenarios, let's take a look at other places where hbase is used. The number of users willing to use hbase has increased rapidly over the past few years. This is partly because hbase products become more reliable and better performing, but more and more companies are beginning to invest a lot of resources to support and use it. As more and more commercial service providers provide support, users are more confident to apply hbase to key application systems. An original design was designed to store the technology for continuously updating web page copies on the Internet. It is also suitable for other Internet-related aspects. For example, hbase finds a useful place in various internal and peripheral demands of social network companies. From personal communication information storage to communication information analysis, hbase has become a key infrastructure for companies like Facebook, Twitter, and stumbleupon.

In this field, hbase has three main application scenarios, but not limited to these. To keep this section simple and clear, we will introduce the main application scenarios.

Capture incremental data

Data is usually a long stream of water and accumulated to an existing database for future use, such as analysis, processing, and services. Many hbase scenarios belong to this category-using hbase as data storage to capture incremental data from various data sources. For example, this data source may be a web crawler (a typical bigtable problem we have discussed), which may be a record of what advertisements users have viewed and how many ad effect data have been generated for a long time, it may also be to record time series data of various parameters. We will discuss several successful application scenarios and companies.

 

Capture monitoring parameters: opentsdb

The background infrastructure of Web products serving millions of users generally has hundreds or thousands of servers. These servers provide various functions, such as service traffic, log capturing, data storage, and data processing. To keep the product running normally, it is vital to monitor the health status of the server and the software running above (from OS to user interaction applications ). Large-scale monitoring of the entire environment requires a monitoring system that can collect and store various parameters from different data sources. Each company has its own solution. Some companies use commercial tools to collect and present parameters, while others use open-source frameworks.

Stumbleupon creates an open-source framework to collect various monitoring parameters of the server. Time-based collection parameters are generally called time series data: that is, data is collected and recorded in chronological order. The open-source framework of stumbleupon is called opentsdb, which means open time series database. This framework uses hbase as the core platform to store and retrieve collected parameters. This framework is created to have an extensible monitoring data collection system. On the one hand, it can store and retrieve parameter data and save it for a long time, on the other hand, you can also add various new parameters if you need to add features. Stumbleupon uses opentsdb to monitor all infrastructure and software, including the hbase cluster itself. We will introduce opentsdb in detail as an example application built on hbase in Chapter 7th.

 

Capture user interaction data: Facebook and stumbleupon

Capturing monitoring data is a method of use. Another method is to capture user interaction data. How do I track the activities of millions of users on the website? How do I know which website is the most popular? How can this web browsing directly affect the next one? For example, who saw what? How many times have a button been clicked? Do you still remember the like button in Facebook and stumble and the + 1 button in stumbleupon? Does it sound like a counting problem? Each time you like a specific theme, the counter is added.

Stumbleupon uses MySQL in the initial stage. However, as website services become increasingly popular, this technology has encountered problems. The dramatic increase in user online load demand far exceeds the capability of the MySQL cluster. In the end, stumbleupon chooses hbase to replace these clusters. At that time, hbase products could not directly provide required functions. Stumbleupon made some minor development changes on hbase. Later, these development work contributed back to the project community.

Facebook uses hbase counters to measure the number of times people like a specific webpage. Content creators and web page owners can get almost real-time data information about how many users like their web pages. Therefore, they can be more agile in determining what content should be provided. Facebook has created a system called Facebook insight, which requires a scalable storage system. The company considered many possibilities, including relational databases, memory databases, and Cassandra databases, and finally decided to use hbase. Based on hbase, Facebook can easily scale out services to millions of users, or continue to use their existing experience in running large-scale hbase clusters. The system processes tens of billions of events every day and records hundreds of parameters.

 

Telemetry: Mozilla and Trend Micro

Software operation data and software quality data are not as simple as monitoring parameter data. For example, software crash reports are useful software operation data that are often used to explore software quality and plan software development roadmap. Hbase can be successfully used to capture and store software crash reports generated on users' computers. Unlike the first two scenarios, this is not necessarily related to network service applications.

 

The Mozilla Foundation is responsible for two products: Firefox Web browser and Thunderbird email client. These tools are installed on millions of computers around the world and support various operating systems. When these tools crash, a software crash report is returned to Mozilla in the form of a bug report. How does Mozilla collect the data? How to use it after collection? The actual situation is that a system called Socorro collects these reports to guide the R & D department to develop more stable products. Data storage and analysis of socorrade systems are built on hbase. [1]

Hbase allows the basic analysis to use much more data than before. This analysis guides Mozilla developers to focus more on developing versions with the least bugs.

Trend Micro provides Internet security and intrusion management services for enterprise customers. An important part of security is perception. Log collection and analysis are crucial to providing this perception capability. Trend Micro uses hbase to manage the Network Reputation database, which requires row-level updates and supports mapreduce batch processing. Like Mozilla's Socorro system, hbase is also used to collect and analyze Log Activity and collects billions of records every day. The flexible mode in hbase supports data structure changes. When the analysis process is adjusted again, Trend Micro can add new attributes.

 

Ad effect and click stream

In the past decade, online advertising has become a major source of revenue for Internet products. Provides free services to users, and places ads to target users when users use the services. Precise Serving requires detailed capturing and analysis of user interaction data to facilitate understanding of user characteristics. Based on this feature, select and place ads. Refined User Interaction data brings better models, resulting in better advertising effects and more revenue. However, this type of data has two features: it appears in the form of continuous streams, it is easy to divide by user. Ideally, this type of data can be used immediately once generated, and the user feature model can be continuously optimized without delay-that is, it can be used online.

 

Online Offline System

Online and offline terms appear multiple times. For beginners, these terms describe the conditions for the execution of software systems. Online systems require low latency. In some cases, even if the system gives a response without an answer, it is better than giving a response with the correct answer for a long time. You can think of an online system as an impatient user. Offline systems do not require low latency. Users can wait for answers and do not expect immediate responses. The goal of online or offline application systems affects many technical decisions. HbaseIs an online system. And hadoop mapreduceAnd give it the ability to access offline.

 

Hbase is very suitable for collecting such user interaction data. hbase has been successfully applied in such scenarios. It can incrementally capture first-hand click streams and user interaction data, then we use different processing methods (mapreduce is one of them) to process data (clean up, describe, and use data ). In such companies, you will find many hbase cases.

 

Content Service

One of the biggest application scenarios of traditional databases is to provide content services for users. Various databases support application systems that provide various content services. These applications have been developing over the years, so the databases on which they depend are also developing. Users want to use and interact with more and more types of content. In addition, due to the rapid growth of the Internet and the rapid growth of terminal devices, the connection methods of these applications have higher requirements. A variety of terminal devices bring about a challenge: different types of devices need to use the same content in different formats.

On the one hand, the user uses the content user consuming content, and on the other hand, the user generates the content user generatecontent. Such examples are tweete, Facebook posts, Instagram images, and Weibo.

In the same way, they use and generate a lot of content. A large number of users use and generate content through application systems, and these application systems require hbase as the basis.

The centralized content system CMS can store content and provide services. However, when more and more users generate more and more content, a more scalable CMS solution is required.

This scalable CMS often uses hbase as the basis, coupled with other open-source frameworks, such as SOLR, to form a complete combination of functions.

Salesforce provides a managed CRM product. This product is submitted to users through the Web browser interface, showing a wide range of relational database functions. A long time before Google published a nosql prototype concept paper, the most reasonable choice for large key databases used in the production environment is commercial relational databases. Over the years, Salesforce has expanded its System by Means of database sharding and cutting-edge performance optimization to handle hundreds of millions of transactions per day.

When Salesforce saw the choice of distributed databases, they evaluated all nosql technologies and finally decided to deploy hbase. The main reason for this choice is that. The bigtable system is the only structure that can seamlessly integrate the horizontal scalability and strong row-level consistency. In addition, Salesforce has used hadoop for large-scale offline batch processing tasks, and they can continue to use the valuable experience accumulated above hadoop.

URL short chain

Recently, URL short chains have become very popular, and many similar products have been released. Stumbleupon uses a short chain product named Su. PR. Based on hbase. This product is used to shorten the URL, store a large number of short chains and ing relationships with the original long links. hbase helps the product achieve scalability.

User Model Service

Content processed by hbase is often used to determine what content to be submitted to the user rather than directly submitted to the user. This intermediate processing data is used to enrich user interaction. Do you still remember the user model in the preceding advertisement service scenario? The user mode (or model) comes from hbase. This model is diverse and can be used in a variety of different scenarios. For example, the decision on the specific user's advertisement and the real-time Quotation Decision of the user during e-commerce portal shopping, you can add background information and associated content during search engines. Many such use cases may be difficult for public discussion. If we say more, we will be in trouble.

When a user makes a transaction on the e-commerce website, the user model service can be used for real-time quotation. This model needs to be continuously optimized based on the constantly generated new user data.

Information exchange

The world is getting smaller and smaller as various social networks break through. A major role of social media websites is to help people interact with each other. Sometimes interaction occurs in A group (Small Scale or large scale); sometimes interaction occurs in the eyes of two people. Think about the conversation with hundreds of millions of people on social networks. It is not enough to talk to people in the distance. people also want to see the history of the conversation with others. Social network companies are lucky to find that storage is cheap, and innovations in the big data field can help them make full use of cheap storage.

Facebook's text message system is often publicly discussed, and it may also drive the development of hbase. When you use Facebook, you may receive or send text messages to your friends at some point. This feature of Facebook relies entirely on hbase. All text messages read and written by users are stored in hbase. Systems that support Facebook text messages require high write throughput, large tables, and strong consistency in data centers. In addition to SMS systems, other hbase application systems also require high read throughput, counter throughput, and automatic database sharding. Engineers discovered that hbase was an ideal solution because he supported all these requirements. He had an active user community, and his Facebook operation team had rich experience in hadoop deployment. In the "hadoop goes realtime at Facebook" article, Facebook engineers explained the logic behind this decision and demonstrated their experience in using hadoop and hbase to build online systems.

Facebook engineers shared some interesting data at the hbasecon 2012 conference. On this platform, billions of text messages are exchanged every day, bringing about 75 billion operations every day. Facebook's hbase cluster operates 1.5 million times per second at the peak. From the perspective of data size, Facebook's cluster adds TB of new data each month, which may be the largest known hbase deployment, whether it is the number of servers or the number of users hosted by the server.

The examples above explain how hbase solves some interesting old and new problems. You may notice the same point: hbase can be used to process the same data online or offline. This is the unique feature of hbase.

 

From http://blog.sina.com.cn/s/blog_ae33b83901016azb.html

Hbase application scenarios and success stories (transfer)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.