PHP accesses hadoop hive through thrift

Source: Internet
Author: User
Tags how to use sql

This article describes how to use SQL to query hadoop data. The main technology used is: PhP submits SQL queries to hive through thrift, hive converts SQL queries to hadoop tasks, and returns a result URI after hadoop completes execution, then we only need to read the content in this URI.

Thrift was born to solve access problems between different languages. It supports multiple programming languages, such as C ++, PHP, Java, and python. Thrift is a lightweight cross-language service framework developed by Facebook and has now been handed over to the Apache Foundation. Similar to Google's protocol buffer and ice. One of thrift's major advantages is that it supports a wide range of languages. It uses its own IDL Language to serve interfaces and data exchange formats.

Hive can access hbase in a language similar to SQL. Hbase is an open-source nosql product that implements an open-source product of Google bigtable paper. Together with hadoop and HDFS, hbase can be used to store and process massive column family data.

Thrift Official Website: http://thrift.apache.org/

1. Download and install Thrift on the server

Install Thrift on clusters that have been installed in both hadoop and hbase.
(1) download: wget plugin.

(2) unzip: tar-xzf thrift-0.8.0.tar.gz
(3) Compilation and installation: if the source code is compiled, you must first execute./Bootstrap. Sh to create the./configure file;
Next, execute./configure;
Run make & make install
(4) Start:./bin/hbase-daemon.sh start Thrift
The default listening port of thrift is 9090.
Reference: http://blog.csdn.net/hguisu/article/details/7298456

Ii. Create a. Thrift File

After the thrift compiler is installed, you need to create a thrift file. This file is an interface definition file. You need to define the thrift type (types) and service (services ). Services defined in this file are implemented by the server and called by the client. The role of the thrift compiler is to generate the thrift file you wrote into the Client Interface source code, which is generated by different client libraries and services you wrote. To use the thrift file to generate the Interface source code in different languages, we need to run:

Thrift -- Gen <language> <thrift FILENAME>

Iii. Thrift file description

Supported variable types

Type description bool # True, false byte #8-bit signed integer I16 #16-bit signed integer i32 #32-bit signed integer i64 #64-bit signed integer double #64-bit floating point string # UTF-8 encoded string binary # character array struct # struct list <type> # ordered element list, similar to the STL vector set <type> # unordered non-repeating element set, similar to the STL set Map <type1, type2> # key-Value Type ing, map exception similar to STL # is an exception base class service inherited from the local language # The service contains multiple function interfaces (pure virtual functions)
4. How to develop a simple helloworld program from the server to the client

Python is used as the server, PHP is used as the client, and C ++ can also be used as the server. The following describes the development process.

(1) First, write a helloworld. Thrift file on the server, as shown below:

service HelloWorld{     string ping(1: string name),     string getpng(),  }

(2) Compile helloworld. Thrift on the server side

Compile the helloworld. Thrift file to generate the source code of the interface in the server and client languages.
/Usr/local/thrift/bin/thrift-r -- Gen py helloworld. Thrift
/Usr/local/thrift/bin/thrift-r -- Gen PHP helloworld. Thrift
# The gen-* directory is generated under the current directory.
The generated gen-py directory is placed on the server, and the generated gen-PHP Directory is placed on the client.
(3) Write server code

    import sys      sys.path.append('./gen-py')             from helloworld import HelloWorld      from helloworld.ttypes import *             from thrift.transport import TSocket      from thrift.transport import TTransport      from thrift.protocol import TBinaryProtocol      from thrift.server import TServer             class HellowordHandler:          def __init__ (self):              pass                 def ping (self, name):              print name + ' from server.'              return "%s from server." % name          def getpng (self):              f = open("./logo.png", "rb")              c = f.read()              f.close()              return c      handler = HellowordHandler()      processor = HelloWorld.Processor(handler)      transport = TSocket.TServerSocket(9090)      tfactory = TTransport.TBufferedTransportFactory()      pfactory = TBinaryProtocol.TBinaryProtocolFactory()             server = TServer.TSimpleServer(processor, transport, tfactory, pfactory)             # You could do one of these for a multithreaded server      #server = TServer.TThreadedServer(processor, transport, tfactory, pfactory)      #server = TServer.TThreadPoolServer(processor, transport, tfactory, pfactory)             print 'Starting the server...'      server.serve()      print 'done.'  

(4) write client code

First copy the Gen-PHP Directory to the client.

<? PHP try {// contains the thrift client library file $ globals ['thrift _ root'] = '. /PHP/src '; require_once $ globals ['thrift _ root']. '/thrift. PHP '; require_once $ globals ['thrift _ root']. '/protocol/tbinaryprotocol. PHP '; require_once $ globals ['thrift _ root']. '/transport/tsocket. PHP '; require_once $ globals ['thrift _ root']. '/transport/thttpclient. PHP '; require_once $ globals ['thrift _ root']. '/transport/tbufferedtransport. PHP '; error_repo Rting (e_none); // contains the helloworld interface file $ gen_dir = '. /Gen-php'; require_once $ gen_dir. '/helloworld. PHP '; error_reporting (e_all); $ socket = new tsocket ('*. *. *. * ', 9090); $ transport = new tbufferedtransport ($ socket, 1024,102 4); $ protocol = new tbinaryprotocol ($ transport); $ client = new helloworldclient ($ Protocol ); $ transport-> open (); $ A = $ client-> Ping ('xyq'); echo $ A; $ transport-> close ();} c Atch (texception $ Tx) {print 'texception: '. $ TX-> getmessage (). "/N" ;}?>

Finally give a reference link: http://blog.csdn.net/heiyeshuwu/article/details/5982222

2. Examples provided on thrift's official website

Apache thrift allows you to define data types and service interfaces in a simple. Thrift file. The. Thrift file is used as the input file. The server and client source code are compiled by the compiler to build cross-language programming between the RPC client and the server.
The key code is provided below.

(1) thrift definition file

/* Interface data type defined */struct USERPROFILE {1: i32 uid, 2: string name, 3: string blurb}/* defined interface function */service userstorage {void store (1: USERPROFILE user), USERPROFILE retrieve (1: i32 UID )}

(2) Client Python implementation

  # Make an object  up = UserProfile(uid=1,      name="Test User",      blurb="Thrift is great")  # Talk to a server via TCP sockets, using a binary protocol  transport = TSocket.TSocket("localhost", 9090)  transport.open()  protocol = TBinaryProtocol.TBinaryProtocol(transport)  # Use the service we already defined  service = UserStorage.Client(protocol)  service.store(up)  # Retrieve something as well  up2 = service.retrieve(2)

(3) server C ++ implementation

class UserStorageHandler : virtual public UserStorageIf {   public:    UserStorageHandler() {      // Your initialization goes here    }    void store(const UserProfile& user) {      // Your implementation goes here      printf("store\n");    }    void retrieve(UserProfile& _return, const int32_t uid) {      // Your implementation goes here      printf("retrieve\n");    }  };  int main(int argc, char **argv) {    int port = 9090;    shared_ptr<UserStorageHandler> handler(new UserStorageHandler());    shared_ptr<TProcessor> processor(new UserStorageProcessor(handler));    shared_ptr<TServerTransport> serverTransport(new TServerSocket(port));    shared_ptr<TTransportFactory> transportFactory(new TBufferedTransportFactory());    shared_ptr<TProtocolFactory> protocolFactory(new TBinaryProtocolFactory());    TSimpleServer server(processor, serverTransport, transportFactory, protocolFactory);    server.serve();}
3. Practical Experience

(1). Thrift interface file

The file name is hive. Thrift, as shown below:

Namespace Java com. Gj. Data. hive. Thrift/*** submit the hive task class to hadoop. A typical usage is to submit a task, check whether the task is completed, obtain the result URI of the task, and read the result file. * The usage of the client in Java is shown here. * Long taskid = client. submittask ("abc@gj.com", "Web", "select * From Table1 where dt = '2017-04-10 'limit 10;"); * If (taskid <= 0) {* system. out. println ("error submit"); * return; *} // whether the polling task is completed * int COUNT = 50; * While (count> = 0) {* try {* thread. sleep (30*1000); *} catch (interruptedexception ex) {}* if (client. istaskfinished (taskid) {* system. out. println (client. getresulturi (taskid )); * Break; *} * count --; *} */service hive {/** submit task * User-username, work mailbox, such as abc@xxx.com * env-submit environment. Currently, two environments are supported: mobile terminal and web-master station. * SQL-submitted SQL statements. * Return value: the task id value greater than 0 is returned for successful submission. This ID is used for subsequent queries. 0 or-1 is returned for failure. */i64 submittask (1: String user, 2: String ENV, 3: String SQL);/** check whether the task is completed * taskid-Task Number * return value: true if the task is completed, false */bool istaskfinished (1: i64 taskid) is returned./** obtain the URI of the task result. You can use this URI to obtain the result data * taskid-Task Number *. Return Value: if the task has a result, Uri is returned. Otherwise, an empty string */string getresulturi (1: i64 taskid) is returned./** retrieve the user's list of all tasks * User-user name, complete Ganji mailbox * return value: List of task numbers. If no task exists, blank */list <i64> gettasksbyusername (1: String user) is returned );}

(2) generate the PHP and hbase interface files (implemented on the server)

/Usr/local/thrift/bin/thrift -- Gen PHP hive. Thrift
Then the hive. php and hive_types.php files are generated under the Gen-PHP Directory.
Copy the hive. php and hive_types.php files to the directory developed by the PHP client.

(3). Configure the PHP Client

When using thrift as a client, you need to configure thrift as follows.

(A) Prepare the thrift PHP client basic class

These basic classes can be found in the thrift source code package. Under thriftsrc/lib/PHP/src, the first class contains the following directories and files: EXT/, protocol/, transport/directory, and thrift. php and autoload. php files. Copy these directories and files to the/Server/www/thrift_part/thrift-0.5.0/directory of the client.

(B) configure the thrift extension of PHP to support Thrift

If php wants to use thrift, it also needs to install the thrift extension of PHP.
As follows:
Download the corresponding PHP thrift extension and decompress it;
Go to ext/thrift_protocol under the source code;
/Usr/local/PHP/bin/phpize
./Configure -- With-PHP-Config =/usr/local/PHP/bin/PHP-config -- enable-thrift_protocol
Make
Make install
Configure the generated thrift_protocol.so file to PhP. ini and restart the apache service.

(4). php client implementation

File Name: updatehivedata. php

<? PHP $ globals ['thrift _ root'] = '/Server/www/third_part/thrift-0.5.0'; require_once $ globals ['thrift _ root']. '/thrift. PHP '; require_once $ globals ['thrift _ root']. '/packages/scribe. PHP '; require_once $ globals ['thrift _ root']. '/protocol/tbinaryprotocol. PHP '; require_once $ globals ['thrift _ root']. '/transport/tsocket. PHP '; require_once $ globals ['thrift _ root']. '/transport/thttpclient. PHP '; require_once $ Globals ['thrift _ root']. '/transport/tframedtransport. PHP '; require_once $ globals ['thrift _ root']. '/transport/tbufferedtransport. PHP '; // The generated file require_once dirname (_ file __). '/hive. PHP '; // require_once dirname (_ file __). '/hive_types.php'; error_reporting (e_all); ini_set ('display _ errors ', 'on'); $ socket = new tsocket ('hive .corp.gj.com', 13080 ); $ socket-> setdebug (true); // set the receiving timeout (millisecond) $ socket-> setsen Dtimeout (10000); $ socket-> setrecvtimeout (10000); // $ transport = new tbufferedtransport ($ socket, 1024,102 4); $ transport = new tframedtransport ($ socket ); $ protocol = new tbinaryprotocol ($ transport); $ client = new hiveclient ($ Protocol); try {$ transport-> open ();} catch (texception $ Tx) {echo $ TX-> getmessage ();} // obtain pv uv $ taskid = $ client-> submittask ('xxx @ xxx.com ', 'web ', "select regexp_extrac T (gjch, '^/([^/] +)', 1), count (*), count (distinct UUID) from Table1 where dt = '2017-04-22 'and gjch Regexp' [^ @] */detail 'group by regexp_extract (gjch, '^/([^/] + )', 1); "); if ($ taskid <= 0) {echo 'error submit '; exit;} echo $ taskid. "\ n"; $ COUNT = 50; while ($ count> 0) {try {// sleep in seconds, here, sleep (3*60) is polling every 3 minutes;} catch (texception $ Tx) {}if ($ client-> istaskfinished ($ taskid )) {// echo $ client-> getresult Uri ($ taskid); $ url = $ client-> getresulturi ($ taskid); // echo $ URL; $ handle = fopen ($ URL, "rb "); $ content = stream_get_contents ($ handle); echo $ content; fclose ($ handle); break;} $ count -- ;}$ transport-> close ();?>

Because the server is not responsible for itself, the PHP client was implemented only according to the thrift definition file. The running result is as follows:

Here, the URL is the result obtained by $ client-> getresulturi. The webpage content is the content corresponding to this URI.

5. Thrift class description

Tsocket: uses TCP socket for data transmission;
Transport class (Transport Layer ):
Responsible for data transmission and introduces several common categories:
Tbufferedtransport: buffer the data operated by a transport object, that is, the data is read from the buffer for transmission, or the data is directly written into the buffer;
Tframedtransport: Same as tbufferdtransport, it also buffer data and supports sending and receiving of fixed-length data;
Tfiletransport: a file (log) transmission class that allows the client to send files to the server. The word order server writes the received data to the file;

Protocol Class (Protocol ):
Responsible for data encoding, mainly including the following common classes:
Tbinaryprotocol: binary encoding;
Tjsonprotocol: JSON encoding.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.