Building a database using hive

Source: Internet
Author: User
Tags file system mathematical functions connect

What if a company doesn't have the resources to build a complex, large data analysis platform? What if Business Intelligence (BI), data warehousing, and analysis tools cannot connect to the Apache Hadoop system, or are they more complex than requirements? Most businesses have employees with relational database management systems (rdbmses) and Structured Query Language (SQL) experience. Apache Hive allows these database developers or data analysts to use Hadoop without having to understand the Java programming language or MapReduce. Now you can design a data warehouse for a star model, or a normalized database without challenging MapReduce code. Suddenly, BI and analytics tools, such as IBM Cognos or SPSS Statistics, can connect to the Hadoop system.

Database

Building a database and being able to use that data is not a Hadoop or database problem. For years, people have been accustomed to organizing data into libraries. There are many age-old questions: How do you categorize data? How do I connect all the data to an integrated platform, chassis, or library? Over the years, various programmes have emerged.

People have invented many methods, such as the Dewey Decimal system. They arrange the names of people or business names in the address Book in alphabetical order. There are metal filing cabinets, warehouse with shelves, Address Card file system, and so on. Employers try to track employees with time cards, clocks, and timetables. People need to structure and organize data, and they need to reflect and examine that data. If you can't access, structure, or understand the data, what is the practical significance of storing so much data?

RDBMSes uses the set theory and the third normal form. The Data Warehouse has Kimball, Inmon, star model, Corporate information Factory, and private data marts. They have master data management, enterprise resource planning, customer relationship management, electronic medical records, and many other systems that people use to organize transactions into some kind of structure and topic. Now, we have a large number of unstructured or semi-structured data from various industries, such as social media, mail, call histories, mechanical instructions, remote information, and so on. These new data need to be integrated into a very complex, very large system that stores the new and old data in a structured format. How can I classify the sales manager to improve the report? How do I build libraries to enable Executive Directors to access charts and graphs?

You need to find a way to structure data to a database. Otherwise, only a large number of data scientists can access the data. Sometimes, people just need simple reports. Sometimes, they just want to drag or write SQL queries.

Large data, Hadoop and Infosphere biginsights

This section will introduce you to Infosphere biginsights and how it relates to Hadoop, large data, Hive, databases, and so on. Infosphere Biginsights is an IBM partition for Hadoop. You may know more about Apache and Cloudera, but many people in the industry have been involved in Hadoop. It starts with open source use of MapReduce Hadoop and Hadoop Distributed File Systems (HDFS), and often includes other tools such as Zookeeper, Oozie, Sqoop, Hive, Pig, and HBase. The difference between these distributions and ordinary Hadoop is that they are added to the top of the Hadoop. Infosphere biginsights belong to this type of version.

You can use Infosphere biginsights on top of the Cloudera version of Hadoop. In addition, Infosphere Biginsights provides a fast, unstructured analysis engine that you can combine with Infosphere Streams. Infosphere Streams is a real-time analysis engine that initiates the possibility of joint real-time analysis and batch-oriented analysis.

Infosphere Biginsights also has a built-in, browser-based spreadsheet bigsheets. This spreadsheet allows analysts to use large data and Hadoop on a spreadsheet-style basis every day. Other features include role-based security and managed LDAP integration, integration with Infosphere DataStage for extraction, transformation, loading (ETL), commonly used accelerators for use cases, such as log and machine data analysis, and application directories containing common directories and reusable work ; Eclipse plug-in; and Bigindex, which is actually a Lucene indexing tool built on Hadoop.

You can also improve performance with adaptive MapReduce, compressed text files, and adaptive scheduling enhancements. In addition, you can integrate other applications, such as content analysis and Cognos Consumer insights.

Hive

Hive is a powerful tool. It uses HDFS, the metadata store (an Apache Derby database by default), shell commands, drives, compilers, and execution engines. It also supports Java database connectivity (JDBC) connections. Because of its SQL-like capabilities and database-like capabilities, Hive can open large data Hadoop ecosystems for non-programmers. It also provides external BI software, for example, through JDBC drives and WEB clients and Cognos connections.

You can rely on existing database developers to find Java MapReduce programmers without time-consuming effort. The benefit of this is that you can have a database developer write 10-15 lines of SQL code and then optimize and translate it into MapReduce code instead of forcing a non programmer or programmer to write 200 lines of code, or even more complex MapReduce code.

Hive is often described as a data warehouse infrastructure built on Hadoop. The fact is, Hive has nothing to do with the Data warehouse. If you want to build a real data warehouse, you can use some tools, such as IBM Netezza. But if you want to use Hadoop to build a database, but you don't have the knowledge of Java or MapReduce, then Hive would be a great choice (if you know SQL). Hive allows you to write SQL-like queries using Hadoop and HBase's HIVEQL, and allows you to build star models on top of HDFS.

Hive and RDBMSes

Hive is a read-mode system, while rdbmses is a typical write-mode system. The traditional rdmbses validates the model when writing data. If the data does not match the structure, it is rejected. Hive doesn't care about the structure of the data, at least not at the first time, it doesn't validate the model when you load the data. Rather, it cares about the model only after you run the query.

Restrictions on Hive

There may be some challenges when using Hive. First, it is incompatible with SQL-92. Some standard SQL functions, such as not in, don't like, and not EQUAL, do not exist, or require some sort of workspace. Similarly, some mathematical functions are strictly restricted, or do not exist. Timestamp or date is the most recently added value, and is more Java date compatible than SQL date compatibility. Some simple features, such as data differences, do not work properly.

In addition, Hive is not developed for low latency, real-time, or near-real-time queries. SQL queries are converted to MapReduce, which means that for some queries, performance may be lower compared to traditional RDBMS.

Another limitation is that the metadata store is a Derby database by default and is not intended for enterprise or production. Some Hadoop users instead use external databases as metadata stores, but these external metadata stores also have their own challenges and configuration problems. This also means needing someone to maintain and manage RDBMS systems outside of Hadoop.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.