Building a database using Hive

Last Update:2014-12-25 Source: Internet

Author: User

Keywords Or can these if

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Storing them is a good choice when you need to work with a lot of data. An incredible discovery or future prediction will not come from unused data. Big data is a complex monster. In Java? The programming language writes the complex MapReduce program to be time-consuming, the good resources and the specialized knowledge, this is the most enterprise does not have. This is why building a database with tools such as Hive on Hadoop can be a powerful solution.

What if a company doesn't have the resources to build a complex, large data analysis platform? What if Business Intelligence (BI), data warehousing, and analysis tools cannot connect to the Apache Hadoop system, or are they more complex than requirements? Most businesses have employees with relational database management systems (rdbmses) and Structured Query Language (SQL) experience. Apache Hive allows these database developers or data analysts to use Hadoop without having to understand the Java programming language or MapReduce. Now you can design a data warehouse for a star model, or a normalized database without challenging MapReduce code. Suddenly, BI and analytics tools, such as IBM Cognos? or SPSS? Statistics, you can connect to the Hadoop system.

Database

builds the database and is able to use this data, which is not a Hadoop or database problem. For years, people have been accustomed to organizing data into libraries. There are many age-old questions: How do you categorize data? How do I connect all the data to an integrated platform, chassis, or library? Over the years, various programmes have emerged.

people invented many methods, such as the Dewey Decimal system. They arrange the names of people or businesses in the Address Book in alphabetical order. There are metal filing cabinets, warehouse with shelves, Address Card file system, and so on. Employers try to track employees with time cards, clocks, and timetables. People need to structure and organize data, and they need to reflect and examine that data. If you can't access, structure, or understand the data, what is the practical significance of storing so much data?

RDBMSes uses the set theory and the third normal form. The Data Warehouse has Kimball, Inmon, star model, Corporate information Factory, and private data marts. They have master data management, enterprise resource planning, customer relationship management, electronic medical records, and many other systems that people use to organize transactions into some kind of structure and topic. Now, we have a large number of unstructured or semi-structured data from various industries, such as social media, mail, call histories, mechanical instructions, remote information, and so on. These new data need to be integrated into a very complex, very large system that stores the old and new data in the structure. How can I classify the sales manager to improve the report? How do I build libraries to enable Executive Directors to access charts and graphs?

you need to find a way to structure data to a database. Otherwise, only a large number of data scientists can access the data. Sometimes people just need simple reports. Sometimes they just want to drag or write SQL queries.

large data, Hadoop and Infosphere biginsights

This section will introduce you to Infosphere? Biginsights, and how it relates to Hadoop, large data, Hive, databases, and so on. Infosphere Biginsights is an IBM partition for Hadoop. You may know more about Apache and Cloudera, but many people in the industry have dabbled in Hadoop. It starts with open source use of MapReduce Hadoop and Hadoop Distributed File Systems (HDFS), and often includes other tools such as Zookeeper, Oozie, Sqoop, Hive, Pig, and HBase. The difference between these distributions and ordinary Hadoop is that they are added to the top of the Hadoop. Infosphere biginsights belong to this type of version.

You can use Infosphere biginsights on top of the Cloudera version of Hadoop. In addition, Infosphere Biginsights provides a fast, unstructured analysis engine that you can combine with Infosphere Streams. Infosphere Streams is a real-time analysis engine that initiates the possibility of joint real-time analysis and batch-oriented analysis.

Infosphere Biginsights also has a built-in, browser-based spreadsheet bigsheets. This spreadsheet allows analysts to use large data and Hadoop in a spreadsheet style every day. Other features include role-based security and managed LDAP integration; with Infosphere DataStage? Integration for extraction, transformation, loading (ETL), commonly used accelerators for use cases, such as log and machine data analysis, application directories containing common directories and reusable work, Eclipse plug-ins, and Bigindex, which is actually a Lucene indexing tool built into Hadoop.

You can also improve performance with re-use MapReduce, compressed text files, and adaptive scheduling enhancements. In addition, you can integrate other applications, such as content analysis and Cognos Consumer Insights.

Hive

Hive is a powerful tool. It uses HDFS, the metadata store (an Apache Derby database by default), shell commands, drives, compilers, and execution engines. It also supports Java database connectivity (JDBC) connections. Because of its SQL-like capabilities and database-like capabilities, Hive can open large data Hadoop ecosystems for non-programmers. It also provides external BI software, for example, through JDBC drives and WEB clients and Cognos connections.

you can rely on existing database developers to find Java MapReduce programmers without time-consuming effort. The benefit of this is that you can have a database developer write 10-15 lines of SQL code and then optimize and translate it into MapReduce code instead of forcing a non programmer or programmer to write 200 lines of code, or even more complex MapReduce code.

Hive is often described as a data warehouse infrastructure built on Hadoop. The fact is, Hive has nothing to do with the Data warehouse. If you want to build a real data warehouse, you can use some tools, such as IBM Netezza. But if you want to build a database with Hadoop but don't have the knowledge of Java or MapReduce, then Hive would be a great choice (if you know SQL). Hive allows you to write SQL-like queries using Hadoop and HBase's HIVEQL, and allows you to build star models on top of HDFS.

Hive and RDBMSes

Hive is a read mode system, and rdbmses is a typical write mode system. Traditional rdmbses validates the model when writing data. If the data does not match the structure, it is rejected. Hive doesn't care about the structure of the data, at least not at the first time, it doesn't validate the model when you load the data. Rather, it cares about the model only after you run the query.

Limitations of
Hive

may have some challenges when using Hive. First, it is incompatible with SQL-92. Some standard SQL functions, such as not in, don't like, and not EQUAL, do not exist, or require some sort of workspace. Similarly, some mathematical functions are strictly restricted or do not exist. Timestamp or date is the most recently added value, and is more Java date compatible than SQL date compatibility. Some simple features, such as data differences, do not work properly.

In addition, Hive is not developed for low latency, real-time, or near-real-time queries. SQL queries are converted to MapReduce, which means that for some queries, performance may be lower compared to traditional RDBMS.

Another limitation is that the metadata store is a Derby database by default and is not intended for enterprise or production. Some Hadoop users instead use external databases as metadata stores, but these external metadata stores also have their own challenges and configuration problems. This also means needing someone to maintain and manage RDBMS systems outside of Hadoop.

installation Infosphere Biginsights

This baseball data sample shows you how to build a common database from a flat file in Hive. Although this example is relatively small, it shows how easy it is to build a database using Hive, which you can use to run statistics to make sure it meets expectations. You don't need to check that information in the future when you try to organize unstructured data.

Complete the database build, you can build the Web or GUI front-end using any language, as long as you connect to Hive JDBC. (Configuring and setting up a thrift server, Hive JDBC is another topic). I used VMware Fusion to create a infosphere biginsights virtual machine (VM) on my Apple Macbook. This is a simple test so that my VM has 1 GB of RAM and GB of solid disk storage space. Operating system is CentOS 6.4 64-bit distro Linux? You can also use certain tools, such as Oracle VM VirtualBox, if you are Windows? Users, you can also use VMware Player to create Infosphere biginsights VMs. (Setting up VMS, VMware Player, or VirtualBox on Fusion is not covered in this article.) ）

starts with the download of IBM Infosphere biginsights base. You need to have an IBM ID, or you can register an ID and then download the Infosphere biginsights base version.

input and analysis data

Now you can get data anywhere. Most sites provide data in comma-separated value (CSV) format: Weather, energy, sports, finance, and blog data. For example, I use structured data from the Sean Lahman Web site. The use of unstructured data can be laborious.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More