Sometimes the best way to learn about a software product is to see how it is used. It can solve what problems and how these solutions apply to large application architectures that can tell you a lot. Because HBase has a lot of open product deployments, we can do just that. This section describes in detail some of the scenarios in which people successfully use HBase.
Note: Do not limit yourself to the belief that hbase can only solve these usage scenarios. It is a nascent technology, and innovation based on the use of the scene is driving the development of the system. If you have new ideas and think you can benefit from the functionality offered by HBase, try it. The community is happy to help you and learn from your experience. This is the spirit of open source software.
HBase emulated Google's bigtable, let's start exploring the typical bigtable problem: storing the Internet.
The problem of the typical Internet search: BigTable invented the reason
Search is a behavior that locates information you care about: for example, search the page number of a book that contains the topic you want to read, or the Web page that contains the information you want to find. Search for a document that contains a specific word, and you need to find the index that provides a mapping of specific words and all documents that contain that word. In order to be able to search, you must first establish an index. That's what Google and other search engines do. Their document library is the entire Internet; the specific words you search for are everything you typed in the search box.
BigTable, and imitated HBase, provide storage for this document library, BigTable provide row-level access, so the crawler can insert and update individual documents. Search indexes can be generated efficiently based on bigtable through MapReduce computations. If the result is a single document, you can remove it directly from the bigtable. Supporting various access patterns is a key factor affecting bigtable design.
Setting up an Internet index
The 1 crawler continues to crawl new pages, each of which is stored in a row of bigtable.
2 MapReduce Computing jobs run on the entire table, generating indexes to prepare for web search applications.
Search the Internet
3 Users initiate a network search request.
4 Network Search application query to establish a good index, or directly from the bigtable directly to get a single document.
5 search results are submitted to the user.
After the typical hbase use scenario, let's take a look at other places that use HBase. The number of users willing to use HBase has grown rapidly over the past few years. Part of the reason is that hbase products become more reliable and better performing, and more because more and more companies are starting to invest a lot of resources to support and use it. As more and more commercial service providers provide support, users are increasingly confident in applying hbase to critical application systems. A technology designed to store copies of Web pages that are continually updated on the Internet is also appropriate for other Internet-related aspects. HBase, for example, finds its place within and around the various needs of social networking companies. From the storage of personal communication information to communication information analysis, HBase becomes the key infrastructure in Facebook,twitter, StumbleUpon and other companies.
In this area, HBase has three main use scenarios, but not limited to these. To keep this section simple and straightforward, we'll introduce the main usage scenarios here.
Capturing incremental data
The data is usually a steady roll up to the existing database for future use, such as analysis, processing, and services. Many hbase use scenarios fall into this category-using HBase as the data store, capturing incremental data from a variety of data sources. For example, this data source might be a web crawler (a typical bigtable problem we've discussed), perhaps by recording what the user is looking at and how long the ad effect data is, or by recording the time series data for various parameters. We discuss several successful use scenarios and companies.
Capture monitoring parameters: Opentsdb
The background infrastructure for web products that serve millions of of users typically has hundreds of or thousands of servers. These servers assume a variety of functions-service traffic, capturing logs, storing data, processing data, and so on. In order to keep the product running properly, it is critical to monitor the server and the health of the software running on it (from OS to user interaction). Large-scale monitoring of the entire environment requires the ability to collect and store various parameters from different data sources of the monitoring system. Every company has its own way. Some companies use business tools to collect and display parameters, while others adopt open source frameworks.
StumbleUpon created an open source framework to collect various monitoring parameters for the server. Time-gathering parameters are generally called time-series data: That is, data is collected and recorded in chronological order. StumbleUpon's Open source framework is called OPENTSDB, which means the Open time series database open times Series databases. This framework uses HBase as the core platform to store and retrieve the collected parameters. The purpose of this framework is to have an extensible monitoring data collection system that can store and retrieve parameter data for a long period of time, and add a variety of new parameters if additional functionality is required. StumbleUpon uses OPENTSDB to monitor all infrastructure and software, including the HBase cluster itself. We will introduce opentsdb in detail in the 7th chapter as a sample application constructed on HBase.
Capturing user interaction Data: Facebook and StumbleUpon
Capturing monitoring data is a way to use it. Another is capturing user interaction data. How do I track millions of of users ' activities on the site? How do you know which Web site features are the most popular? How to make this web browsing directly affect the next time? For example, who saw what? How many times has a button been clicked? Remember Facebook and stumble The Like button in the StumbleUpon and the +1 buttons in the box? Does it sound as if it's a counting problem? Each time the user likes a specific topic counter is added once.
StumbleUpon MySQL at the beginning, but as Web services become more popular, the technology choices are having problems. The rapidly growing demand for online load exceeds the ability of the MySQL cluster, and ultimately StumbleUpon chooses HBase to replace the cluster. At that time, hbase products could not provide the necessary functions directly. StumbleUpon made some minor development changes on the HBase, which later contributed back to the project community.
Facebook uses HBase counters to measure the number of people like specific pages. Content creators and web owners can get nearly real-time data about how many users like their web pages. They can therefore more quickly judge what content should be provided. Facebook has created a system called the Facebook Insight, which requires an extensible storage system. The company considered a number of possibilities, including relational databases, memory databases, and Cassandra databases, and finally decided to use HBase. Hbase,facebook can be easily scaled horizontally to provide millions of users, and can continue to use their existing experience of running a large hbase cluster. The system handles tens of billions of events per day, recording hundreds of parameters.
Telemetry:mozilla and TREND MICRO
Software running data and software quality data are not as simple as monitoring parameter data. For example, a software crash report is a useful software running data that is often used to explore the software quality and roadmap for planning software development. HBase can be successfully used to capture and store software crash reports that are generated on a user's computer. Unlike the first two scenarios, this scenario is not necessarily related to network service applications.
Mozilla Foundation is responsible for Firefox web browser and Thunderbird email client two products. These tools are installed on millions of computers worldwide and support a variety of operating systems. When these tools crash, a software crash report is returned to Mozilla in the form of a bug report. How does Mozilla collect this data? How to use it after collection? As a matter of fact, a system called Socorro collects these reports to guide research and development departments in developing more stable products. The data storage and analysis of Socorrade system is constructed on hbase. [1]
Using HBase enables basic analysis to be used much more data than before. This analysis is used to instruct Mozilla developers to be more focused and develop the least-bug version.
Trend Micro provides Internet security and intrusion management services to corporate clients. The important aspect of security is perception, and log collection and analysis is essential to provide this perception. Trend Micro uses HBase to manage the network reputation database, which requires row-level updates and support for MapReduce batches. A bit like Mozilla's Socorro system, HBase is also used to collect and analyze log activity, collecting billions of records a day. Flexible patterns in HBase support data structure changes and Trend Micro can add new attributes when the analysis process is readjusted.
Advertising effects and click Flow
Over the past decade, online advertising has become a major source of revenue for Internet products. Provide free service to the user, when the user uses the service to advertise to the target user. This precise placement requires detailed capture and analysis of user interaction data to facilitate understanding of user characteristics. Based on this feature, select and advertise. Sophisticated user interaction data leads to better models, leading to better advertising results and more revenue. But this kind of data has two characteristics: it appears as a continuous flow, it is easy to divide by user. Ideally, this data can be used as soon as it is produced, and the user feature model is continuously optimized without delay-that is, online.
Online VS Offline Systems
The terms online and offline appear multiple times. For beginners, these terms describe the conditions under which software systems are executed. Online systems require low latency. In some cases, it is better to respond to a system without an answer than to give the correct answer for a long time. You can think of an online system as an impatient user with a jumping foot. Offline systems do not need low latency, users can wait for the answer, do not expect immediate response. Goals that are online or offline when implementing an application system affect many technical decisions. HBase is an online system. The tight combination of Hadoop MapReduce gives it the ability to access offline.
HBase is ideal for collecting this user interaction data, HBase has been successfully applied in this situation, it can incrementally capture the first-click Stream and user interaction data, and then use a different approach (MapReduce is one of them) to process the data (clean, decorate, use data). In this kind of company, you will find many hbase cases.
Content Services
One of the biggest uses of traditional databases is to provide content services to users. Various databases support applications that provide a variety of content services. These applications have evolved over the years, so the databases on which they depend are developing. There are more and more types of content that users want to use and interact with. In addition, due to the rapid growth of the internet and the rapid growth of terminal equipment, these applications have a higher demand for connectivity. A variety of terminal devices pose a challenge: different types of equipment need to use the same content in different formats.
On the one hand, the user consuming content, and the other side is user-generated content generatecontent. Tweete, Facebook posts, Instagram Pictures and microblogs are examples of this.
The same place they use and generate a lot of content. A large number of users use and generate content through application systems, and these applications require hbase as a basis.
Centralized content system CMS can store content and provide services. But when more users are generating more and more content, a more scalable CMS solution is needed.
This scalable CMS often uses hbase as a base, plus other open source frameworks, such as SOLR, to form a complete functional portfolio.
Salesforce provides a managed CRM product that is submitted to users through a Web browser interface, showing a rich relational database function. For a long time before Google published the NoSQL prototype concept paper, the most reasonable choice for a large key database used in a production environment was a commercial relational database. Over the years, Salesforce has expanded the system through database repositories and cutting-edge performance optimization to achieve hundreds of millions of transactions per day.
When Salesforce saw the choice of distributed databases, they evaluated all NoSQL technology products and finally decided to deploy HBase. The main reason for this choice is cause. The BigTable type system is the only structure that seamlessly integrates horizontal scaling and row-level strong consistency. In addition, Salesforce has used Hadoop for large offline batch tasks, and they can continue to take advantage of the valuable experience accumulated on Hadoop.
URL short Chain
The recent period of time URL short chain is very popular, many similar products have been sprouting. StumbleUpon uses the short chain product named SU.PR. This product is based on HBase. This product is used to shorten the URL, store a large number of short chain and the original long link mapping relationship, hbase to help the product to achieve expansion capabilities.
User Model Services
Content processed by HBase is often not submitted directly to the user, but is used to determine what content should be submitted to the user. This intermediate processing data is used to enrich the user's interaction. Remember the user model in the AD service scenario mentioned earlier? user mode (or model) comes from HBase. This model can be used for a variety of different scenarios, such as the decision on what ads are given to a particular user, the decision to quote in real time when the user is shopping in the Electronic Business portal, and the user adding background information and related content to search engine retrieval, and so on. Many of these use cases may not be easy to discuss openly, and we're in trouble.
User model services can be used for real-time quotes when a user trades on an electrical dealer's website. This model needs to be continuously optimized based on the continuous generation of new user data.
Information exchange
All kinds of social networks are breaking through the Earth and the world is getting smaller. One of the important functions of a social network is to help people interact. Sometimes interactions occur within a group (small-scale and large-scale); Sometimes the interaction takes place in the two individual's view. Think about it, hundreds of millions of people talking through social networks. It's not enough to talk to people in the distance, and people want to see the history of conversations with other people. Social networking companies are fortunate that storage is cheap, and innovations in large data areas can help them take advantage of cheap storage.
Facebook's SMS system is often discussed openly, and he may also be a big driver of HBase's development. When you use Facebook, you may receive or send text messages to your friends at some point. This feature of Facebook relies entirely on hbase. All text messages that users read and write are stored in hbase. The system that supports Facebook SMS needs to have high write throughput, a large table, and strong consistency within the datacenter. In addition to the SMS system, other application systems using HBase require high read throughput, counter throughput, and automatic storage. The engineers found that HBase was an ideal solution because he supported all these requirements, he had an active user community, the Facebook operations team had extensive experience with Hadoop deployments, and so on. In the "Hadoop goes realtime at Facebook" article, Facebook engineers explained the logic behind the decision and demonstrated their experience using Hadoop and hbase to build online systems.
Facebook engineers shared some interesting data at the Hbasecon 2012 meeting. Exchange billions of SMS messages per day on this platform, bringing about 75 billion operations per day. At rush hour, Facebook's hbase cluster operates 1.5 million times per second. From a data-scale perspective, the Facebook cluster adds 250TB of new data each month, possibly the largest known hbase deployment, whether the number of servers or the amount of users hosted by the server.
Transferred from http://blog.sina.com.cn/s/blog_ae33b83901016azb.html