Description of the Hadoop environment:
Master node: Node1
Slave node: node2,node3,node4
Remote server (Python connection hive): Node29
Requirement: The top 10 URLs that are queried in the CDN log for the highest number of URL accesses in a specified time period via hive
PS: With pig query can query article:
http://shineforever.blog.51cto.com/1429204/1571124
Description: The python operation remote operation requires the use of the Thrift interface:
The hive source package comes with a thrift plugin:
[Email protected] shell]# ls-l/usr/local/hive-0.8.1/lib/py
Total 28
Drwxr-xr-x 2 Hadoop hadoop 4096 Nov 5 15:29 fb303
Drwxr-xr-x 2 Hadoop hadoop 4096 Oct fb303_scripts
Drwxr-xr-x 2 Hadoop hadoop 4096 Nov 5 15:29 Hive_metastore
Drwxr-xr-x 2 Hadoop hadoop 4096 Oct Hive_serde
Drwxr-xr-x 2 Hadoop hadoop 4096 Nov 5 15:29 Hive_service
Drwxr-xr-x 2 Hadoop hadoop 4096 Nov 5 15:20 Queryplan
Drwxr-xr-x 6 Hadoop Hadoop 4096 Nov 5 15:20 Thrift
1) The relevant file SCP to the remote Node29 related directory:
Scp-r/usr/local/hive-0.8.1/lib/py/* 172.16.41.29:/usr/local/hive_py/.
2) Develop hive on the NODE1 server:
[Email protected] py]$ hive--service hiveserver
Starting Hive Thrift Server
3) Write the query script on node29:
#!/usr/bin/env python#coding:utf-8# Find the CDN log for the specified time period, the top 10 URLs visited;import sysimport osimport stringimport reimport mysqldb# loading hive Python-related library files; sys.path.append ('/usr/local/hive_py ') from hive _service import thrifthivefrom hive_service.ttypes import hiveserverexceptionfrom thrift import Thriftfrom thrift.transport import TSocketfrom Thrift.transport import ttransportfrom thrift.protocol import tbinaryprotocoldbname= " Default "hsql=" Select request,count (Request) as counts from cdnlog where time >= ' [27/oct/2014:10:40:00 +0800] ' and time <= ' [27/oct/ 2014:10:49:59&NBSP;+0800] ' group by request order by counts desc limit 10 "Def hiveexe (hsql,dbname): try: transport = tsocket.tsockET (' 172.16.41.151 ', 10000) transport = Ttransport.tbufferedtransport (transport) protocol = Tbinaryprotocol.tbinaryprotocol (transport) client = Thrifthive.client (Protocol) transport.open () #加载增长表达式支持, required ( The following load path is the remote hive path, not the local path of the script! ) client.execute (' add jar /usr/local/hive-0.8.1/lib/ Hive_contrib.jar ') # client.execute ("use " + dbname) # row = client.fetchone () client.execute (Hsql) return Client.fetchall () #查询所有数据; transport.close () except thrift.texception, tx: print '%s ' % (tx.message) if __name__ == ' __main__ ': results=hiveexe (Hsql, dbname) num=len (results) for i in range (num):
Execute the script on node29 and enter the result:
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/4D/DB/wKioL1RbHoPT0rHZAAHHICCIv7U438.jpg "title=" Hive result. jpg "alt=" wkiol1rbhopt0rhzaahhicciv7u438.jpg "/>
The hive calculation process on the Node1 server is:
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/4D/DC/wKiom1RbHjnhxxvdAAcmfK-RneY166.jpg "title=" Hive process. jpg "alt=" wkiom1rbhjnhxxvdaacmfk-rney166.jpg "/>
This article is from the "Shine_forever blog" blog, make sure to keep this source http://shineforever.blog.51cto.com/1429204/1573439
Hive in Hadoop queries CDN Access logs the top 10 URLs in the specified time period (in conjunction with the Python language)