This article describes how to use SQL to query hadoop data. The main technology used is: PhP submits SQL queries to hive through thrift, hive converts SQL queries to hadoop tasks, and returns a result URI after hadoop completes execution, then we only need to read the content in this URI.
Thrift was born to solve access problems between different languages. It supports multiple programming languages, such as C ++, PHP, Java, and python. Thrift is a lightweight cross-language service framework developed by Facebook and has now been handed over to the Apache Foundation. Similar to Google's protocol buffer and ice. One of thrift's major advantages is that it supports a wide range of languages. It uses its own IDL Language to serve interfaces and data exchange formats.
Hive can access hbase in a language similar to SQL. Hbase is an open-source nosql product that implements an open-source product of Google bigtable paper. Together with hadoop and HDFS, hbase can be used to store and process massive column family data.
Thrift Official Website: http://thrift.apache.org/
1. Download and install Thrift on the server
Install Thrift on clusters that have been installed in both hadoop and hbase.
(1) download: wget plugin.
(2) unzip: tar-xzf thrift-0.8.0.tar.gz
(3) Compilation and installation: if the source code is compiled, you must first execute./Bootstrap. Sh to create the./configure file;
Next, execute./configure;
Run make & make install
(4) Start:./bin/hbase-daemon.sh start Thrift
The default listening port of thrift is 9090.
Reference: http://blog.csdn.net/hguisu/article/details/7298456
Ii. Create a. Thrift File
After the thrift compiler is installed, you need to create a thrift file. This file is an interface definition file. You need to define the thrift type (types) and service (services ). Services defined in this file are implemented by the server and called by the client. The role of the thrift compiler is to generate the thrift file you wrote into the Client Interface source code, which is generated by different client libraries and services you wrote. To use the thrift file to generate the Interface source code in different languages, we need to run:
Thrift -- Gen <language> <thrift FILENAME>
Iii. Thrift file description
Supported variable types
Type description bool # True, false byte #8-bit signed integer I16 #16-bit signed integer i32 #32-bit signed integer i64 #64-bit signed integer double #64-bit floating point string # UTF-8 encoded string binary # character array struct # struct list <type> # ordered element list, similar to the STL vector set <type> # unordered non-repeating element set, similar to the STL set Map <type1, type2> # key-Value Type ing, map exception similar to STL # is an exception base class service inherited from the local language # The service contains multiple function interfaces (pure virtual functions)
4. How to develop a simple helloworld program from the server to the client
Python is used as the server, PHP is used as the client, and C ++ can also be used as the server. The following describes the development process.
(1) First, write a helloworld. Thrift file on the server, as shown below:
service HelloWorld{ string ping(1: string name), string getpng(), }
(2) Compile helloworld. Thrift on the server side
Compile the helloworld. Thrift file to generate the source code of the interface in the server and client languages.
/Usr/local/thrift/bin/thrift-r -- Gen py helloworld. Thrift
/Usr/local/thrift/bin/thrift-r -- Gen PHP helloworld. Thrift
# The gen-* directory is generated under the current directory.
The generated gen-py directory is placed on the server, and the generated gen-PHP Directory is placed on the client.
(3) Write server code
import sys sys.path.append('./gen-py') from helloworld import HelloWorld from helloworld.ttypes import * from thrift.transport import TSocket from thrift.transport import TTransport from thrift.protocol import TBinaryProtocol from thrift.server import TServer class HellowordHandler: def __init__ (self): pass def ping (self, name): print name + ' from server.' return "%s from server." % name def getpng (self): f = open("./logo.png", "rb") c = f.read() f.close() return c handler = HellowordHandler() processor = HelloWorld.Processor(handler) transport = TSocket.TServerSocket(9090) tfactory = TTransport.TBufferedTransportFactory() pfactory = TBinaryProtocol.TBinaryProtocolFactory() server = TServer.TSimpleServer(processor, transport, tfactory, pfactory) # You could do one of these for a multithreaded server #server = TServer.TThreadedServer(processor, transport, tfactory, pfactory) #server = TServer.TThreadPoolServer(processor, transport, tfactory, pfactory) print 'Starting the server...' server.serve() print 'done.'
(4) write client code
First copy the Gen-PHP Directory to the client.
<? PHP try {// contains the thrift client library file $ globals ['thrift _ root'] = '. /PHP/src '; require_once $ globals ['thrift _ root']. '/thrift. PHP '; require_once $ globals ['thrift _ root']. '/protocol/tbinaryprotocol. PHP '; require_once $ globals ['thrift _ root']. '/transport/tsocket. PHP '; require_once $ globals ['thrift _ root']. '/transport/thttpclient. PHP '; require_once $ globals ['thrift _ root']. '/transport/tbufferedtransport. PHP '; error_repo Rting (e_none); // contains the helloworld interface file $ gen_dir = '. /Gen-php'; require_once $ gen_dir. '/helloworld. PHP '; error_reporting (e_all); $ socket = new tsocket ('*. *. *. * ', 9090); $ transport = new tbufferedtransport ($ socket, 1024,102 4); $ protocol = new tbinaryprotocol ($ transport); $ client = new helloworldclient ($ Protocol ); $ transport-> open (); $ A = $ client-> Ping ('xyq'); echo $ A; $ transport-> close ();} c Atch (texception $ Tx) {print 'texception: '. $ TX-> getmessage (). "/N" ;}?>
Finally give a reference link: http://blog.csdn.net/heiyeshuwu/article/details/5982222
2. Examples provided on thrift's official website
Apache thrift allows you to define data types and service interfaces in a simple. Thrift file. The. Thrift file is used as the input file. The server and client source code are compiled by the compiler to build cross-language programming between the RPC client and the server.
The key code is provided below.
(1) thrift definition file
/* Interface data type defined */struct USERPROFILE {1: i32 uid, 2: string name, 3: string blurb}/* defined interface function */service userstorage {void store (1: USERPROFILE user), USERPROFILE retrieve (1: i32 UID )}
(2) Client Python implementation
# Make an object up = UserProfile(uid=1, name="Test User", blurb="Thrift is great") # Talk to a server via TCP sockets, using a binary protocol transport = TSocket.TSocket("localhost", 9090) transport.open() protocol = TBinaryProtocol.TBinaryProtocol(transport) # Use the service we already defined service = UserStorage.Client(protocol) service.store(up) # Retrieve something as well up2 = service.retrieve(2)
(3) server C ++ implementation
class UserStorageHandler : virtual public UserStorageIf { public: UserStorageHandler() { // Your initialization goes here } void store(const UserProfile& user) { // Your implementation goes here printf("store\n"); } void retrieve(UserProfile& _return, const int32_t uid) { // Your implementation goes here printf("retrieve\n"); } }; int main(int argc, char **argv) { int port = 9090; shared_ptr<UserStorageHandler> handler(new UserStorageHandler()); shared_ptr<TProcessor> processor(new UserStorageProcessor(handler)); shared_ptr<TServerTransport> serverTransport(new TServerSocket(port)); shared_ptr<TTransportFactory> transportFactory(new TBufferedTransportFactory()); shared_ptr<TProtocolFactory> protocolFactory(new TBinaryProtocolFactory()); TSimpleServer server(processor, serverTransport, transportFactory, protocolFactory); server.serve();}
3. Practical Experience
(1). Thrift interface file
The file name is hive. Thrift, as shown below:
Namespace Java com. Gj. Data. hive. Thrift/*** submit the hive task class to hadoop. A typical usage is to submit a task, check whether the task is completed, obtain the result URI of the task, and read the result file. * The usage of the client in Java is shown here. * Long taskid = client. submittask ("abc@gj.com", "Web", "select * From Table1 where dt = '2017-04-10 'limit 10;"); * If (taskid <= 0) {* system. out. println ("error submit"); * return; *} // whether the polling task is completed * int COUNT = 50; * While (count> = 0) {* try {* thread. sleep (30*1000); *} catch (interruptedexception ex) {}* if (client. istaskfinished (taskid) {* system. out. println (client. getresulturi (taskid )); * Break; *} * count --; *} */service hive {/** submit task * User-username, work mailbox, such as abc@xxx.com * env-submit environment. Currently, two environments are supported: mobile terminal and web-master station. * SQL-submitted SQL statements. * Return value: the task id value greater than 0 is returned for successful submission. This ID is used for subsequent queries. 0 or-1 is returned for failure. */i64 submittask (1: String user, 2: String ENV, 3: String SQL);/** check whether the task is completed * taskid-Task Number * return value: true if the task is completed, false */bool istaskfinished (1: i64 taskid) is returned./** obtain the URI of the task result. You can use this URI to obtain the result data * taskid-Task Number *. Return Value: if the task has a result, Uri is returned. Otherwise, an empty string */string getresulturi (1: i64 taskid) is returned./** retrieve the user's list of all tasks * User-user name, complete Ganji mailbox * return value: List of task numbers. If no task exists, blank */list <i64> gettasksbyusername (1: String user) is returned );}
(2) generate the PHP and hbase interface files (implemented on the server)
/Usr/local/thrift/bin/thrift -- Gen PHP hive. Thrift
Then the hive. php and hive_types.php files are generated under the Gen-PHP Directory.
Copy the hive. php and hive_types.php files to the directory developed by the PHP client.
(3). Configure the PHP Client
When using thrift as a client, you need to configure thrift as follows.
(A) Prepare the thrift PHP client basic class
These basic classes can be found in the thrift source code package. Under thriftsrc/lib/PHP/src, the first class contains the following directories and files: EXT/, protocol/, transport/directory, and thrift. php and autoload. php files. Copy these directories and files to the/Server/www/thrift_part/thrift-0.5.0/directory of the client.
(B) configure the thrift extension of PHP to support Thrift
If php wants to use thrift, it also needs to install the thrift extension of PHP.
As follows:
Download the corresponding PHP thrift extension and decompress it;
Go to ext/thrift_protocol under the source code;
/Usr/local/PHP/bin/phpize
./Configure -- With-PHP-Config =/usr/local/PHP/bin/PHP-config -- enable-thrift_protocol
Make
Make install
Configure the generated thrift_protocol.so file to PhP. ini and restart the apache service.
(4). php client implementation
File Name: updatehivedata. php
<? PHP $ globals ['thrift _ root'] = '/Server/www/third_part/thrift-0.5.0'; require_once $ globals ['thrift _ root']. '/thrift. PHP '; require_once $ globals ['thrift _ root']. '/packages/scribe. PHP '; require_once $ globals ['thrift _ root']. '/protocol/tbinaryprotocol. PHP '; require_once $ globals ['thrift _ root']. '/transport/tsocket. PHP '; require_once $ globals ['thrift _ root']. '/transport/thttpclient. PHP '; require_once $ Globals ['thrift _ root']. '/transport/tframedtransport. PHP '; require_once $ globals ['thrift _ root']. '/transport/tbufferedtransport. PHP '; // The generated file require_once dirname (_ file __). '/hive. PHP '; // require_once dirname (_ file __). '/hive_types.php'; error_reporting (e_all); ini_set ('display _ errors ', 'on'); $ socket = new tsocket ('hive .corp.gj.com', 13080 ); $ socket-> setdebug (true); // set the receiving timeout (millisecond) $ socket-> setsen Dtimeout (10000); $ socket-> setrecvtimeout (10000); // $ transport = new tbufferedtransport ($ socket, 1024,102 4); $ transport = new tframedtransport ($ socket ); $ protocol = new tbinaryprotocol ($ transport); $ client = new hiveclient ($ Protocol); try {$ transport-> open ();} catch (texception $ Tx) {echo $ TX-> getmessage ();} // obtain pv uv $ taskid = $ client-> submittask ('xxx @ xxx.com ', 'web ', "select regexp_extrac T (gjch, '^/([^/] +)', 1), count (*), count (distinct UUID) from Table1 where dt = '2017-04-22 'and gjch Regexp' [^ @] */detail 'group by regexp_extract (gjch, '^/([^/] + )', 1); "); if ($ taskid <= 0) {echo 'error submit '; exit;} echo $ taskid. "\ n"; $ COUNT = 50; while ($ count> 0) {try {// sleep in seconds, here, sleep (3*60) is polling every 3 minutes;} catch (texception $ Tx) {}if ($ client-> istaskfinished ($ taskid )) {// echo $ client-> getresult Uri ($ taskid); $ url = $ client-> getresulturi ($ taskid); // echo $ URL; $ handle = fopen ($ URL, "rb "); $ content = stream_get_contents ($ handle); echo $ content; fclose ($ handle); break;} $ count -- ;}$ transport-> close ();?>
Because the server is not responsible for itself, the PHP client was implemented only according to the thrift definition file. The running result is as follows:
Here, the URL is the result obtained by $ client-> getresulturi. The webpage content is the content corresponding to this URI.
5. Thrift class description
Tsocket: uses TCP socket for data transmission;
Transport class (Transport Layer ):
Responsible for data transmission and introduces several common categories:
Tbufferedtransport: buffer the data operated by a transport object, that is, the data is read from the buffer for transmission, or the data is directly written into the buffer;
Tframedtransport: Same as tbufferdtransport, it also buffer data and supports sending and receiving of fixed-length data;
Tfiletransport: a file (log) transmission class that allows the client to send files to the server. The word order server writes the received data to the file;
Protocol Class (Protocol ):
Responsible for data encoding, mainly including the following common classes:
Tbinaryprotocol: binary encoding;
Tjsonprotocol: JSON encoding.