hive1.2.1實戰操作電影大資料！

最後更新：2016-12-02 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：market des 開始 release tab lin ken following cal

我採用的是網上的電影大資料，共有3個檔案，movies.dat、user.dat、ratings.dat。分別有3000/6000和1百萬資料，正好做實驗。

下面先介紹資料結構：

RATINGS FILE DESCRIPTION
================================================================================
All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings
USERS FILE DESCRIPTION

================================================================================
User information is in the file "users.dat" and is in the following
format:

UserID::Gender::Age::Occupation::Zip-code

All demographic information is provided voluntarily by the users and is
not checked for accuracy. Only users who have provided some demographic
information are included in this data set.

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

* 1: "Under 18"
* 18: "18-24"
* 25: "25-34"
* 35: "35-44"
* 45: "45-49"
* 50: "50-55"
* 56: "56+"

- Occupation is chosen from the following choices:

* 0: "other" or not specified
* 1: "academic/educator"
* 2: "artist"
* 3: "clerical/admin"
* 4: "college/grad student"
* 5: "customer service"
* 6: "doctor/health care"
* 7: "executive/managerial"
* 8: "farmer"
* 9: "homemaker"
* 10: "K-12 student"
* 11: "lawyer"
* 12: "programmer"
* 13: "retired"
* 14: "sales/marketing"
* 15: "scientist"
* 16: "self-employed"
* 17: "technician/engineer"
* 18: "tradesman/craftsman"
* 19: "unemployed"
* 20: "writer"

MOVIES FILE DESCRIPTION
================================================================================

Movie information is in the file "movies.dat" and is in the following
format:

MovieID::Title::Genres

- Titles are identical to titles provided by the IMDB (including
year of release)
- Genres are pipe-separated and are selected from the following genres:

* Action
* Adventure
* Animation
* Children‘s
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western

****************************************************************************************************

二、進入重點

開始建庫、建表：

create database movies;
use movies;
//試試建表
CREATE TABLE users(userid:Long);
create table users(userid:Bigint);
CREATE TABLE ratings(userid Int,movieid Int,rating Int,timestamp Timestamp)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘::‘;
出錯：FAILED: ParseException line 1:55 Failed to recognize predicate ‘timestamp‘. Failed rule: ‘identifier‘ in column specification

timestamp不支援資料結構裡的字串，改之。

CREATE TABLE ratings(userid Int,movieid Int,rating Int,timestamped Timestamp)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘;

LOAD DATA LOCAL INPATH ‘/home/dyq/Documents/movies/ratings-douhao.dat‘ into table ratings PARTITION(dt="20161201");
hive> select * from ratings limit 10;
OK
111935NULL20161201
16613NULL20161201
19143NULL20161201
134084NULL20161201
123555NULL20161201
111973NULL20161201
112875NULL20161201
128045NULL20161201
15944NULL20161201
19194NULL20161201

看來用"::"做分隔字元有了麻煩，替換成我喜歡的","

drop table ratings;
CREATE TABLE ratings(userid Int,movieid Int,rating Int,timestamped String)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘;

hive> select * from ratings limit 10;
OK
11193597830076020161201
1661397830210920161201
1914397830196820161201
13408497830027520161201
12355597882429120161201
11197397830226820161201
11287597830203920161201
12804597830071920161201
1594497830226820161201
1919497830136820161201
Time taken: 0.122 seconds, Fetched: 10 row(s)

一切OK！hive的語義真是不夠強大的說。

下面建立Movies和users表。

CREATE TABLE movies(movieid Int,title String,genres String)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘;

LOAD DATA LOCAL INPATH ‘/home/dyq/Documents/movies/movies-douhao.dat‘ into table movies PARTITION(dt="20161201");

CREATE TABLE users(userid Int,gender String,age Int,occupation String,zip-code String)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘;

FAILED: ParseException line 1:73 cannot recognize input near ‘-‘ ‘code‘ ‘String‘ in column type

CREATE TABLE users(userid Int,gender String,age Int,occupation String,zipcode String)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘;

LOAD DATA LOCAL INPATH ‘/home/dyq/Documents/movies/users-douhao.dat‘ into table users PARTITION(dt="20161201");

hive> select * from users limit 10;
OK
1F1104806720161201
2M56167007220161201
3M25155511720161201
4M4570246020161201
5M25205545520161201
6F5095511720161201
7M3510681020161201
8M25121141320161201
9M25176161420161201
10F3519537020161201
Time taken: 0.168 seconds, Fetched: 10 row(s)

*****************************************************************
建立索引：

create index ratings_userid_index on table ratings(userid) as ‘COMPACT‘ with deferred rebuild;
show index on ratings;
drop index ratings_userid_index on ratings;

create index ratings_movieid_index on table ratings(movieid) as ‘COMPACT‘ with deferred rebuild;
show index on ratings;
drop index ratings_movieid_index on ratings;

加索引前的join：
select movies.movieid,movies.title,ratings.rating from movies join ratings on(movies.movieid=ratings.movieid);
Time taken: 40.721 seconds, Fetched: 1000209 row(s)

加索引後的join：
Time taken: 40.816 seconds, Fetched: 1000209 row(s)

查詢某一個值：
select movies.movieid,movies.title,ratings.rating from movies join ratings on(movies.movieid=ratings.movieid) where movies.movieid=2716;
Time taken: 33.834 seconds, Fetched: 2181 row(s)

索引後：
drop index ratings_movieid_index on ratings;
drop index ratings_userid_index on ratings;
select movies.movieid,movies.title,ratings.rating from movies join ratings on(movies.movieid=ratings.movieid) where movies.movieid=2716;

Time taken: 29.428 seconds, Fetched: 2181 row(s)

hive1.2.1實戰操作電影大資料！

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

hive1.2.1實戰操作電影大資料！

聯繫我們

熱門內容

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

hive1.2.1實戰操作電影大資料！

聯繫我們

熱門內容

熱門主題

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support