Hadoop and metadata (solve the impedance mismatch problem)

Source: Internet
Author: User
Tags hortonworks

In terms of how organizations process data, Apache hadoop launched an unprecedented revolution-through a freely scalable hadoop, it can create new value through new applications in a shorter period of time than before, and extract the desired data from big data. This revolution attempted to enable enterprises to establish a hadoop-centered data processing model, but it also posed a challenge: how can we collaborate freely with hadoop? How can we store and process any expected data and share the data as expected? In addition, we also need to consider how to integrate different tools and other systems into a data center, that is, a computer?

As a hadoop user, the requirements for metadata directories are clear. Users do not want to "invent the wheel" any more ". They hope to work with colleagues to share results and data sets in the process. In combination with users' needs, it is easy to determine the general hadoop upper-layer metadata mechanism: By registering data assets to metadata records, we can not only better understand data assets, more efficient discovery and sharing. Remember, try to make users work less.

Users also want different toolset and systems to be used together-for example, hadoop and non-hadoop systems. As a hadoop user, there are clear requirements for the interoperability of different tools on hadoop clusters: hive, pig, cascading, and Java mapreduce, using the python of hadoop streaming, c/C ++, Perl, and Ruby. The data formats include CSV, TSV, thrift, protobuf, Avro, sequencefile, and rcfile, which is the specific hive format.

Last, the raw data is usually not from HDFS (hadoop Distributed File System ). In this case, resources of different types of systems need to be registered to a central node to meet the requirements of hdfs etl and hadoop analysis results for other systems to be released.

Curt, you're right ...... Hcatalog is really important

Curt Monash recently published an article titled "hcatalog -- it is very important". We recommend that you read it. In this article, Curt considers hcatalog as a metadata service for hadoop clusters, and its value is comparable to that of database management systems (DBMS ). Although this point is still being studied, it is still necessary to note that hcatalog is equivalent to an interface connecting hadoop to the enterprise application ecosystem, which is very important.

This article also contains the content of Curt's in-depth research into the definition, history, and usage of hcatalog.

Definition of hcatalog

One of the most attractive features of hadoop is that it can flexibly process semi-structured data and non-structured data without using schema. In most organizations, unstructured data accounts for 80% of all data, and the growth rate is 10-50 times that of structured data. Hadoop is really good at extracting structured data from unstructured data. Hcatalog helps hadoop deliver value to analysts, systems, and applications who need this data through structured data mining after access questions.

Hcatalog is a hadoop metadata and data table management system. Based on the metadata layer in hive, it displays hadoop data associations in a language similar to SQL. Hcatalog allows users to share data and metadata through hive, pig, and mapreduce. Another feature of this architecture is that you do not need to worry about how to store and store data when writing applications. It also avoids the impact of schema and storage format changes.

This flexibility eventually reduces the impact on data creators, users, and managers and provides them with a clear foundation for cooperation. When the user's application reads data, the data consumer can add new columns to the data. Without affecting producers or users, administrators can migrate data or change the data storage format. With hcatalog, it is easier to find and notify users of new datasets.

With hcatalog, you can use tools to access hive MetaStore on hadoop. It provides connectors for mapreduce and pig. You can use tools to read and write data in the hive associated column format. Hcatalog provides a command line tool for users who do not operate metasotre using hive DDL statements. It also provides the notification service. If you use a workflow tool such as oozie, you will be notified when new data is available.

Hadoop rest Interface

Templeton is a role in the novel Charlotte's Web. It is a greedy mouse that will help the protagonist (pig Wilber), but it only serves food. In hadoop, Templeton provides a rest interface on the metadata layer to help hcatalog. It provides a rest API interface for hadoop, allowing external resources to interact with hadoop without using its own APIs. This greedy mouse provides us all with a simple and common interface that opens the door to hadoop. In this way, hadoop is enabled for all application developers.

Templeton is more like a JDBC connector on hive. The rest interface uses the HTTP protocol to provide a dynamic metadata layer for existing and new applications. It opens resources mapped to hcatalog and hive for the HTTP client.

Practical application of hcatalog

The three basic functions of hcatalog are listed here.

1. implement communication between tools

Severe hadoop users will never use separate tools for data processing. Generally, users and teams may start to use only one tool, such as hive, pig, map reduce, or something else. With their in-depth use of hadoop, they will find that the tools they use are not optimal for their new tasks. Users who begin to use hive for analysis and query are more willing to use pig for ETL process processing or data model creation. Users who started to use pig found that they wanted to use hive for analyticdb queries. Although tools such as pig and map reduce do not need metadata, the emergence of metadata still brings many benefits to them. By sharing metadata storage, users can easily share data between different tools. For example, it is common to load data in map reduce or pig, normalize data, and analyze data through hive. When all these tools share a MetaStore, the users of each tool can instantly access the data created by other tools without loading and transferring steps.

2. Data discovery

When used for data analysis, users can use hadoop to extract structured information from raw data. They usually use pig, hadoop's streaming and map reduce to analyze data and find new concerns. Generally, the value of information can be reflected only in a large analysis environment. Publish the analysis result through hcatalog, and your analysis platform can access the content through the rest service. In this case, schema determines discovery. These findings are also useful for data scientists. Generally, they use the data or analysis results created by others as the input for the next discovery. Registering Data in hcatalog is actually announcing the availability of new data.

3. System Integration

As a data processing and storage environment, hadoop provides too many opportunities for enterprise applications. However, to make full use of the tool, you must enhance existing tools and use them together. Hadoop should be used as an input for your analysis platform, or integrated with your business data storage and Web applications. Organizations should enjoy the value of hadoop without learning tools. With the rest service provided by Templeton, you can open the platform to enterprises through common APIs and SQL-like languages. In this way, it opens the entire platform.

As a preparation for enterprise applications of hadoop, hcatalog represents a reasonable extension. Yes, Curt. It really matters ...... Very important!

About the author

Alan gatesHe was co-founder of hortonworks and once a member of the Yahoo lab. His team made pig an independent open-source Apache project from the lab. Gates also participates in the hcatalog design and guides it to become an Apache incubator project. Gates earned his Bachelor's degree in mathematics from Oregon State University and a Master's degree in theoretic science at forle State University. He is also the author of programming pig published by o'reilly. Follow gates on Twitter: @ alanfgates.

 

 

RusseljurneyHe is now focusing on casino game data. He created web applications to analyze the performance of tiger machines in the United States and Mexico. Russel is the author of agile data (o'reilly will be published in March 2013 ). After getting involved in entrepreneurship, interactive media and news, he traveled to Silicon Valley to build large-scale analysis applications for Ning and LinkedIn. He now provides hadoop promotion in hortonworks. He lives with his wife Kate and two furry puppies on a cliff along the Pacific coast of California.

 

 

View Original English text: Hadoop and metadata (removing the impedance mis-match)

View Original: http://www.infoq.com/cn/articles/HadoopMetadata

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.