The combination of Hadoop and couchbase

Source: Internet
Author: User
Tags couchbase file system

Hadoop and data processing

Hadoop combines many important features, which makes Hadoop useful for decomposing large amounts of data into smaller, more useful chunks of data.

The main component of Hadoop is the HDFS file system, which supports the distribution of information across the cluster. Information stored in this distribution format can be handled individually on each cluster node through a system called MapReduce. The MapReduce process converts information stored in the HDFS file system into smaller, processed, more manageable chunks of data.

Because Hadoop can be run on multiple nodes, it can be used to process large amounts of input data and to simplify the data to a more useful block of information. This process can be handled using a simple MapReduce system.

MapReduce converts incoming information (not necessarily a structured format) to a structure that makes it easier to use, query, and process.

For example, a typical use is to process log information from hundreds of different applications so that specific problems, counts, or other events can be identified. By using the MapReduce format, you can begin to measure and find trends, converting much of the usual information into smaller chunks. For example, when you view a log for a WEB server, you may want to see errors that occur in a specific range on a particular page. You can write a MapReduce function to identify specific errors on a particular page and generate that information in the output. Using this method, you can reduce multiple lines of information from a log file to get a much smaller collection of records that contains only error messages.

Understanding MapReduce

MapReduce's working methods are in two stages. The mapping (map) process obtains incoming information and maps that information to a standardized format. For some types of information, this mapping can be direct and explicit. For example, if you want to work with input data such as Web logs, you can extract only one column of data from the text in the Web log. For other data, mappings can be more complex. When working with text messages, such as research papers, you may need to extract phrases or more complex chunks of data.

The refinement (reduce) phase is used to collect and summarize data. A reduction can actually occur in many different ways, but a typical process is to handle a basic count, sum, or other statistic based on individual data from the mapping phase.

Imagine a simple example, such as the number of words used in Hadoop for example MapReduce, where the mapping phase decomposes the original text to identify individual words and generate an output block for each word. The reduce function gets these mapped chunks of information and makes them lean so that they are incremented on each unique word that you see. Given a text file containing 100 words, the mapping process generates 100 blocks of data, but the refinement phase can summarize it, providing the number of unique words (such as 56) and the number of occurrences of each word.

With Web logs, the mapping takes the input data, creates a record for each error in the log file, and then generates a block of data for each error that contains the date, time, and page that caused the problem.

Within Hadoop, the MapReduce phase appears on each node where each source information block is stored. This enables Hadoop to handle the following large sets of information by allowing multiple nodes to process data concurrently. For example, for 100 nodes, you can process 100 log files at the same time, simplifying many gigabytes (or terabytes) of information much faster than through a single node.

Hadoop Information

A major limitation of core Hadoop products is the inability to store and query information in the database. The data is added to the HDFS system, but you cannot ask Hadoop to return a list of all the data that matches a particular dataset. The main reason is that Hadoop does not store, structure, or understand the structure of the data stored in HDFS. This is why the MapReduce system needs to analyze and process the information for a more structured format.

However, we can combine the processing power of Hadoop with a more traditional database so that we can query the data that Hadoop generates through its own MapReduce system. There are a number of possible solutions, including some traditional SQL databases, but we can maintain the MapReduce style by using Couchbase Server (which is very effective for large datasets).

The basic structure of data sharing between systems is shown in Figure 1.

Figure 1. The basic structure of data sharing between systems

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.