I use the online movie Big data, a total of 3 files, Movies.dat, User.dat, Ratings.dat. There were 3000/6000 and 1 million data respectively, just to do the experiment.
The following first describes the data structure:
Ratings FILE DESCRIPTION
================================================================================
All ratings is contained in the file "Ratings.dat" and is in the
Following format:
Userid::movieid::rating::timestamp
-UserIDs range between 1 and 6040
-Movieids range between 1 and 3952
-Ratings is made on a 5-star scale (Whole-star ratings only)
-Timestamp is represented in seconds since the epoch as returned by Time (2)
-Each user have at least ratings
USERS FILE DESCRIPTION
================================================================================
User information is in the file "Users.dat" and are in the following
Format
Userid::gender::age::occupation::zip-code
All demographic information are provided voluntarily by the users and are
Not checked for accuracy. Only the users who has provided some demographic
Information is included in the this data set.
-Gender is denoted by a ' M ' for male and ' F ' for female
-Age was chosen from the following ranges:
* 1: "Under 18"
* 18: "18-24"
* 25: "25-34"
* 35: "35-44"
* 45: "45-49"
* 50: "50-55"
* 56: "56+"
-Occupation is chosen from the following choices:
* 0: ' Other ' or not specified
* 1: "Academic/educator"
* 2: "Artist"
* 3: "Clerical/admin"
* 4: "College/grad student"
* 5: "Customer Service"
* 6: "Doctor/health Care"
* 7: "Executive/managerial"
* 8: "Farmer"
* 9: "Homemaker"
* Ten: "K-12 student"
* One: "lawyer"
*: "Programmer"
*: "Retired"
*: "Sales/marketing"
*: "Scientist"
*: "Self-employed"
*: "Technician/engineer"
*: "Tradesman/craftsman"
*: "Unemployed"
*: "Writer"
MOVIES FILE DESCRIPTION
================================================================================
Movie information is in the file "Movies.dat" and are in the following
Format
Movieid::title::genres
-titles is identical to titles provided by the IMDB (including
Year of release)
-Genres is pipe-separated and is selected from the following genres:
* Action
* Adventure
* Animation
* children ' s
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
****************************************************************************************************
Ii. entry into the focus
Start building a library, building a table:
Create DATABASE movies;
Use movies;
Try to build a table
CREATE TABLE users (Userid:long);
CREATE table users (Userid:bigint);
CREATE TABLE Ratings (userid Int,movieid int,rating int,timestamp timestamp) partitioned by (dt String) ROW FORMAT Delimited Fields TERMINATED by ':: ';
Error: Failed:parseexception line 1:55 FAILED to recognize predicate ' timestamp '. Failed rule: ' identifier ' in column specification
Timestamp does not support the string in the data structure, change it.
CREATE TABLE Ratings (userid Int,movieid int,rating int,timestamped Timestamp) partitioned by (dt String) ROW FORMAT Delimit ED fields TERMINATED by ', ';
LOAD DATA LOCAL inpath '/home/dyq/documents/movies/ratings-douhao.dat ' into table ratings PARTITION (dt= "20161201");
Hive> SELECT * from ratings limit 10;
Ok
111935null20161201
16613null20161201
19143null20161201
134084null20161201
123555null20161201
111973null20161201
112875null20161201
128045null20161201
15944null20161201
19194null20161201
Seems to use "::" To do the delimiter with trouble, replace it with my favorite ","
drop table ratings;
CREATE TABLE Ratings (userid Int,movieid int,rating int,timestamped string) partitioned by (dt String) ROW FORMAT Delimited F Ields TERMINATED by ', ';
Hive> SELECT * from ratings limit 10;
Ok
11193597830076020161201
1661397830210920161201
1914397830196820161201
13408497830027520161201
12355597882429120161201
11197397830226820161201
11287597830203920161201
12804597830071920161201
1594497830226820161201
1919497830136820161201
Time taken:0.122 seconds, Fetched:10 row (s)
All ok! The semantics of hive are really not strong enough to say.
The movies and users tables are established below.
CREATE TABLE Movies (MovieID int,title string,genres string) partitioned by (dt String) ROW FORMAT delimited fields TERMINATE D by ', ';
LOAD DATA LOCAL inpath '/home/dyq/documents/movies/movies-douhao.dat ' into table movies PARTITION (dt= "20161201");
CREATE TABLE Users (userid int,gender string,age int,occupation string,zip-code string) partitioned by (dt String) ROW FORMA T delimited fields TERMINATED by ', ';
Failed:parseexception line 1:73 cannot recognize input near '-' code ' "String ' in column type
CREATE TABLE Users (userid int,gender string,age int,occupation string,zipcode string) partitioned by (dt String) ROW FORMAT Delimited TERMINATED by ', ';
LOAD DATA LOCAL inpath '/home/dyq/documents/movies/users-douhao.dat ' into table users PARTITION (dt= "20161201");
Hive> SELECT * from users limit 10;
Ok
1f1104806720161201
2m56167007220161201
3m25155511720161201
4m4570246020161201
5m25205545520161201
6f5095511720161201
7m3510681020161201
8m25121141320161201
9m25176161420161201
10f3519537020161201
Time taken:0.168 seconds, Fetched:10 row (s)
*****************************************************************
To create an index:
CREATE index ratings_userid_index on table ratings (userid) as ' COMPACT ' with deferred rebuild;
Show index on ratings;
Drop index ratings_userid_index on ratings;
CREATE index ratings_movieid_index on table ratings (MovieID) as ' COMPACT ' with deferred rebuild;
Show index on ratings;
Drop index ratings_movieid_index on ratings;
Join before index:
Select movies.movieid,movies.title,ratings.rating from movies join ratings on (Movies.movieid=ratings.movieid);
Time taken:40.721 seconds, fetched:1000209 row (s)
Join after index:
Time taken:40.816 seconds, fetched:1000209 row (s)
To query a value:
Select movies.movieid,movies.title,ratings.rating from movies join ratings on (Movies.movieid=ratings.movieid) where movies.movieid=2716;
Time taken:33.834 seconds, fetched:2181 row (s)
After indexing:
Drop index ratings_movieid_index on ratings;
Drop index ratings_userid_index on ratings;
Select movies.movieid,movies.title,ratings.rating from movies join ratings on (Movies.movieid=ratings.movieid) where movies.movieid=2716;
Time taken:29.428 seconds, fetched:2181 row (s)
hive1.2.1 actual operation movie Big Data!