Hadoop and meta data (solving impedance mismatch problems)

Source: Internet
Author: User
Tags file system require

In terms of how the organization handles data, Apache Hadoop has launched an unprecedented revolution--through free, scalable Hadoop, to create new value through new applications and extract the data from large data in a shorter period of time than in the past. The revolution is an attempt to build a Hadoop-centric data-processing model, but it also presents a challenge: how can we collaborate with Hadoop's freedom? How do we store and process data in any format and share it with the user's wishes? In addition, we need to consider how to integrate different tools and other systems together to make the data center the computer.

As a user of Hadoop, the need for a metadata catalog is clear. Users do not want to "invent the wheel" again. They want to collaborate with colleagues to share results and data collections in the process. In combination with user needs, it is easy to determine the common Hadoop upper-layer metadata mechanism: By registering data assets into metadata records, not only can you understand the data assets more clearly, but also enhance the efficiency of discovery and sharing. Remember, try to keep your users working as little as possible.

Users also want different toolset and systems to be used together-such as Hadoop and non-Hadoop systems. As a Hadoop user, there is a clear need for interoperability of different tools on the current Hadoop cluster: Hive,pig,cascading,java MapReduce, using the python,c/c++ of Hadoop streaming, Perl and Ruby, data formats include Csv,tsv,thrift,protobuf,avro,sequencefile as well as rcfile as a hive-specific format.

Finally, the raw data is usually not derived from the HDFs (Hadoop Distributed File System). This requires that the resources of different types of systems be registered on a central node to meet the requirements of HDFs's ETL and Hadoop analysis for other system releases.

Curt, you're right ... Hcatalog is really important.

Curt Monash recently published an article titled "hcatalog--It is important" from a number of aspects to the point, recommend that you read. In this article, Curt that hcatalog as a metadata service for the Hadoop cluster, its value can be comparable to the database management system (DBMS). While this is still being studied, it is still important to note that Hcatalog is the equivalent of a Hadoop connection to an enterprise application ecosystem.

This article also contains Curt's in-depth study of the definition, history and use of Hcatalog.

Definition of Hcatalog

One of the most attractive features of Hadoop is the flexibility to handle semi-structured and unstructured data without using a schema. In most organizations, unstructured data accounts for 80% of the total data, and the growth rate is 10-50 times that of structured data. Hadoop does excel in extracting structured data from unstructured data. Hcatalog helps Hadoop to deliver value through access to the mined structured data to provide analysts, systems, and applications that require this data.

Hcatalog is a management system for Hadoop's metadata and data tables. It is based on the metadata layer in the hive, and displays the association of Hadoop data through SQL-like language. Hcatalog allows users to share data and meta data through hive,pig,mapreduce. Another feature of this is that when a user writes an application, there is no need to care about how the data is stored, where it is stored, and how the user is affected by changes in schema and storage format.

This flexibility ultimately reduces the impact on data generation, users, and managers, providing them with a clear basis for cooperation. The data creator can add new columns to the data without affecting the user's application reading the data. Without affecting the producer or user, the administrator can migrate the data or change the storage format of the data. With Hcatalog, new datasets are easier to find and notify their users.

Through Hcatalog, users can access hive Metastore on Hadoop via tools. It provides connectors for mapreduce and pig, and users can use the tool to read and write associated column-format data for hive. Hcatalog provides command-line tools for users who do not operate metasotre by Hive DDL statements on hive. It also provides notification services that can be notified when new data is available, using a workflow tool such as Oozie.

The rest interface of Hadoop

Templeton is a role in the novel "Charlotte's Web". It is a greedy mouse, will help the protagonist (Piggy Wilbur), but the purpose of the help is only for food. In Hadoop, Templeton provides a rest interface at the top of the metadata to help Hcatalog. It provides a REST API interface for Hadoop, allowing external resources to interact with Hadoop without using the Hadoop Out-of-band APIs. This greedy mouse provides a simple and familiar interface for all of us, opening a door to Hadoop. In this way, it opens Hadoop for all application developers.

Templeton is more like a JDBC connector above hive. The rest interface provides a dynamically shared metadata layer for existing applications and new applications through the HTTP protocol. It opens the resources mapped to hcatalog and hive for HTTP clients.

Practical Application of Hcatalog

Here is a list of the 3 basic uses of Hcatalog.

1. Implementation of communication between tools

Heavy Hadoop users will never use a separate tool for data processing. In general, users and teams may start with only one tool: such as Hive,pig,map Reduce, or something else. As they delve into Hadoop, they will find that the tools they use are not optimal for their new tasks. Users who start using hive to analyze queries are more likely to use pig to process or establish data models for ETL processes. Users who started using pig found that they would prefer to use hive for profiling queries. Although tools such as pig and map reduce do not require metadata, the presence of metadata still provides them with a number of benefits. The sharing of metadata stores makes it easier for users to share data among different tools. For example, it is already common to load data and normalize it in map Reduce or pig, and then analyze it through hive. When all of these tools share a metastore, users of each tool have instant access to the data created by other tools without the need to load and transmit the steps.

2. Data discovery

When used for data analysis, users can use Hadoop to extract structured information from raw data. They typically use Pig,hadoop's streaming and map reduce to analyze the data to find new focus points. In general, only in a large analysis environment, the value of information can be reflected. By hcatalog the results of the analysis, your analytics platform can access the content through rest services. In this case, the schema determines the discovery. These findings are also useful for data scientists. Typically, they use data or analysis results created by others as input to the next discovery. Registering data in Hcatalog is actually announcing that new data is available.

3. System integration

As a processing and storage data environment, Hadoop provides too many opportunities for enterprise applications. However, in order to fully use it, you must enhance existing tools and work with them. Hadoop should be used as input to your analytics platform or integrated with your business data store and Web application. Organizations should enjoy the value of Hadoop, without the use of new content like learning tools. With the rest services provided by Templeton, the platform can be opened to the enterprise through common APIs and class-SQL languages. In this way, it opens up the entire platform.

As a business application for Hadoop, Hcatalog represents the next logical extension. Yes, Curt, it's really important ... It's important!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.