A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Storing them is a good choice when you need to work with a lot of data. An incredible discovery or future prediction will not come from unused data. Big data is a complex monster. Writing complex MapReduce programs in the Java programming language takes a lot of time, good resources and expertise, which is what most businesses don't have. This is why building a database with tools such as Hive on Hadoop can be a powerful solution.
Peter J Jamack is a large data analysis consultant with over 13 years of experience in business intelligence, data warehousing, analytics, large data and information management. He integrates structured and unstructured data into innovative integrated analysis solutions, use a wide variety of large data and MPP platforms to provide customers from all walks of life with large-scale, integrated analysis platforms, including insurance, government, media, finance, retail, social media, marketing and software.
What if a company doesn't have the resources to build a complex, large data analysis platform? When business Intelligence (BI), data warehousing, and analysis tools cannot connect to the Apache Hadoop system, or if they are more complex than requirements, what should they do? Most enterprises have a relational database management system ( rdbmses) and Structured Query Language (SQL) experienced employees. Apache Hive allows these database developers or data analysts to use Hadoop without having to understand the Java programming language or MapReduce. Now you can design a data warehouse for a star model, or a normalized database without challenging MapReduce code. Suddenly, BI and analytics tools, such as IBM Cognos or SPSS Statistics, can connect to the Hadoop system.
Building a database and being able to use that data is not a Hadoop or database problem. For years, people have been accustomed to organizing data into libraries. There are many age-old questions: How do you categorize data? How do you connect all of your data to an integrated platform, chassis, or library? Over the years, various programmes have emerged.
Many methods have been invented, such as the Dewey Decimal system. They arrange the names of people or businesses in the Address Book in alphabetical order. There are metal filing cabinets, warehouse with shelves, Address Card file system, and so on. Employers try to track employees with time cards, clocks, and timetables. People need to structure and organize data, and they need to reflect and examine that data. If you can't access, structure, or understand the data, what is the practical significance of storing so much data?
RDBMSes uses the set theory and the third normal form. The Data Warehouse has Kimball, Inmon, star model, Corporate information Factory, and private data marts. They have master data management, enterprise resource planning, customer relationship management, electronic medical records, and many other systems that people use to organize transactions into some kind of structure and topic. Now, we have a large number of unstructured or semi-structured data from various industries, such as social media, mail, call histories, mechanical instructions, remote information, and so on. These new data need to be integrated into a very complex, very large system that stores the old and new data in the structure. How can I classify the sales manager to improve the report? How do I build a library so that executives can access charts and graphs?
You need to find a way to structure data to a database. Otherwise, only a large number of data scientists can access the data. Sometimes people just need simple reports. Sometimes they just want to drag or write SQL queries.
Large data, Hadoop and Infosphere biginsights
This section will introduce you to Infosphere biginsights and how it relates to Hadoop, large data, Hive, databases, and so on. Infosphere Biginsights is an IBM partition for Hadoop. You may know more about Apache and Cloudera, but many people in the industry have dabbled in Hadoop. It starts with open source use of MapReduce Hadoop and Hadoop Distributed File Systems (HDFS), and often includes other tools such as Zookeeper, Oozie, Sqoop, Hive, Pig, and HBase. The difference between these distributions and ordinary Hadoop is that they are added to the top of the Hadoop. Infosphere biginsights belong to this type of version.
You can use Infosphere biginsights on top of the Cloudera version of Hadoop. In addition, Infosphere Biginsights provides a fast, unstructured analysis engine that you can combine with Infosphere Streams. Infosphere Streams is a real-time analysis engine that initiates the possibility of joint real-time analysis and batch-oriented analysis.
Infosphere Biginsights also has a built-in, browser-based spreadsheet bigsheets. This spreadsheet allows analysts to use large data and Hadoop in a spreadsheet style every day. Other features include role-based security and managed LDAP integration, integration with Infosphere DataStage for extraction, transformation, loading (ETL), commonly used accelerators for use cases, such as log and machine data analysis, and application directories containing common directories and reusable work; Eclipse plug-in; and Bigindex, which is actually a Lucene indexing tool built on Hadoop.
You can also improve performance with re-use MapReduce, compressed text files, and adaptive scheduling enhancements. In addition, you can integrate other applications, such as content analysis and Cognos Consumer Insights.
Hive is a powerful tool. It uses HDFS, the metadata store (an Apache Derby database by default), shell commands, drives, compilers, and execution engines. It also supports Java database connectivity (JDBC) connections. Because of its SQL-like capabilities and database-like capabilities, Hive can open large data Hadoop ecosystems for non-programmers. It also provides external BI software, for example, through JDBC drives and WEB clients and Cognos connections.
You can rely on existing database developers to find Java MapReduce programmers without time-consuming effort. The benefit of this is that you can have a database developer write 10-15 lines of SQL code and then optimize and translate it into MapReduce code instead of forcing a non programmer or programmer to write 200 lines of code, or even more complex MapReduce code.
Hive is often described as a data warehouse infrastructure built on Hadoop. The fact is, Hive has nothing to do with the Data warehouse. If you want to build a real data warehouse, you can use some tools, such as IBM Netezza. But if you want to build a database with Hadoop but don't have the knowledge of Java or MapReduce, then Hive would be a great choice (if you know SQL). Hive allows you to write SQL-like queries using Hadoop and HBase's HIVEQL, and allows you to build star models on top of HDFS.
Hive and RDBMSes
Hive is a read mode system, while rdbmses is a typical write mode system. Traditional rdmbses validates the model when writing data. If the data does not match the structure, it is rejected. Hive doesn't care about the structure of the data, at least not at the first time, it doesn't validate the model when you load the data. Rather, it cares about the model only after you run the query.
There may be some challenges when using Hive. First, it is incompatible with SQL-92. Some standard SQL functions, such as not in, don't like, and not EQUAL, do not exist, or require some sort of workspace. Similarly, some mathematical functions are strictly restricted or do not exist. Timestamp or date is the most recently added value, and is more Java date compatible than SQL date compatibility. Some simple features, such as data differences, do not work properly.
In addition, Hive is not developed for low latency, real-time, or near-real-time queries. SQL queries are converted to MapReduce, which means that for some queries, performance may be lower compared to traditional RDBMS.
Another limitation is that the metadata store is a Derby database by default and is not intended for enterprise or production. Some Hadoop users instead use external databases as metadata stores, but these external metadata stores also have their own challenges and configuration problems. This also means needing someone to maintain and manage RDBMS systems outside of Hadoop.
Install Infosphere biginsights
This baseball data sample shows you how to build a common database from a flat file in Hive. Although this example is relatively small, it shows how easy it is to build a database using Hive, which you can use to run statistics to make sure it meets expectations. You don't need to check that information in the future when you try to organize unstructured data.
Once you have finished building your database, you can build the Web or GUI front-end using any language, as long as you connect to Hive JDBC. (Configuring and setting up a thrift server, Hive JDBC is another topic). I used VMware Fusion to create a infosphere biginsights virtual machine (VM) on my Apple Macbook. This is a simple test so that my VM has 1 GB of RAM and GB of solid disk storage space. The operating system is the linux® of CentOS 6.4 64-bit distro. You can also use certain tools, such as Oracle VM VirtualBox, if you are a windows® Users, you can also use VMware Player to create Infosphere biginsights VMs. (Setting up VMS, VMware Player, or VirtualBox on Fusion is not covered in this article.) )
Start by downloading the IBM infosphere biginsights Base Edition. You need to have an IBM ID, or you can register an ID and then download the Infosphere biginsights base version.
Entering and analyzing data
Now you can get the data anywhere. Most sites provide data in comma-separated value (CSV) format: Weather, energy, sports, finance, and blog data. For example, I use structured data from the Sean Lahman Web site. The use of unstructured data can be laborious.
If you prefer to be in a more manual environment, then you can get from the linux® To complete it, you need to create a directory and then run wget:
$ Sudo Mkdir/user/baseball.
sudo wget http://seanlahman.com/files/database/lahman2012-csv.zip
This data uses the Creative Commons attribution-sharealike 3.0 Unported License.
The compressed file contains statistics for baseball and baseball players in a CSV file. The example contains four primary tables, each with only one column (player_id):
Master table.csv--athlete name, date of birth and biographical information
Hall of fame.csv--Hall of Fame poll data
teams.csv--Annual statistics and rankings
battingpost.csv--season's batting stats
pitchingpost.csv--season's pitching stats
fieldingof.csv--field Position data
fieldingpost.csv--field data after season
Quarterly Data on managershalf.csv--brokers
Quarterly data for the teamshalf.csv--team
salaries.csv--Player Payroll data
seriespost.csv--season Post series information
awardssharemanagers.csv--Broker Award Vote
awardsshareplayers.csv--Player Award vote
Most of the design database has been completed. PLAYER_ID is the primary key for four primary tables (master, batting, pitching, and Fielding). (For a better understanding of table structure and dependencies, read Readme2012.txt.) )
The design is very simple: the primary table is connected by player_id. Hive does not really use a primary key or the notion of referential integrity. Schema on Read means that Hive will discard everything you enter into the table. If the files are chaotic, you may need to find the best way to connect them. In addition, some transformations are required before loading data into HDFS or Hive. According to the Schema on Rea principle, bad data will completely become bad data in Hive. This is why data analysis, whether source-level or HDFS-level, is an important step. Without data analysis, the original data obtained was not available to anyone. Luckily, the baseball example contains some data that was cleaned and organized before you entered Hadoop.
Loading data into HDFS or Hive
Loading data into Hadoop uses many different theories and practices. Sometimes you can enter the original file directly into the HDFS. You might create a directory and subdirectories to organize your files, but copying or moving files from one place to another is a simple process.
For this example, simply issue the put command and create a directory named Baseball:
Building a database using Hive
As data analysis and design are completed, the next step is to build the database.
Although I didn't introduce all of the examples, if you followed me to build the first example, you would be able to understand how to do the rest of the steps. I usually build some SQL text scripts and then type them into or paste them into Hive. Other people can use Hue or other tools to build databases and tables.
For simplicity, we used the Hive Shell. The advanced steps are:
Create a baseball database
Create a table
Verify that the form is correct
You'll see some options, such as creating external or internal databases and tables, but in this case, you need to follow the internal default settings. In fact, internal means that the Hive processes the internal stored database. Listing 1 illustrates the process of the Hive shell.
Listing 1. Creating a Database
Create Database Baseball;
Create table Baseball. Master
(Lahmanid int, playerID int, ManagerID int, Hofid int, birthyear int,
birthmonth int, BirthDay int, birthcountry string, birthstate string,
Birthcity STRING, deathyear int, deathmonth int, Deathday int,
Deathcountry string, deathstate string, deathcity string,
Namefirst string, namelast string, namenote string, namegive string,
Namenick string, weight decimal, height decimal, bats string,
Throws STRING, debut int, Finalgame int,
Eton STRING, Lahman40id int, lahman45id int, retroid int,
Holtzid int, Hbrefid int)
ROW FORMAT delimited FIELDS terminated by ', ';
All other tables are also subject to this procedure. To load the data into the Hive table, the Hive shell will open again, and then run the following code:
LOAD DATA local Inpath master.csv OVERWRITE into TABLE baseball. Master;
Using Hive to build a standardized database
The baseball database is more or less standardized: There are four main tables and several auxiliary tables. Again, Hive is a Schema on Read, so you have to complete most of the data analysis and ETL phases because there is no index or referential integrity in the traditional rdbmses. If you want to use the indexing feature, you should use a tool similar to HBase next. Check the code in Listing 2.
Listing 2. Run a query
Select * from Master;
Select playerID from Master;
Select A.playerid, B.teamid, B.ab, B.R, B.H, b.2b, b.3b, b.hr, B.rbi
From Master A JOIN batting B on a.playerid = B.playerid;
This is the advantage of Hive and the benefits of building a database: it creates a structure for a chaotic world. As with unstructured or semi-structured data that we like to discuss, it is ultimately about who can analyze data, who can run reports based on it, and how you can get it to work quickly. Most users see Hive as some kind of black box: They don't care where the data comes from or what needs to be done to get the data in the right format. Nor does it matter how difficult it is to integrate or validate the data, as long as the data is accurate. This usually means that you have to be organized and structured. Otherwise, your database becomes a dead zone where unrestricted data is permanently stored, and no one can or will want to use that data.
The complex data warehouse is no longer the scene. Although the situation has improved in recent years, the concept remains the same: this is a business where business users want results rather than programming logic. This is why building a database in Hive is the right start.
Start building with 50+ products and up to 12 months usage for Elastic Compute Service