Impala consists of three components: impalad, statestored, and clientimpala-shell. The basic functions of these three components have been introduced in this article. Client? : It can be PythonCLI (officially provided impala_shell.py), JDBCODBC or Hue. No matter which one is actually a Thrift client, connect to impala
Impala consists of three components: impalad, statestored, client/impala-shell. The basic functions of these three components have been introduced in this article. Client? : It can be Python CLI (officially provided impala_shell.py), JDBC/ODBC or Hue. No matter which one is actually a Thrift client, connect to impala
Impala consists of three components: impalad, statestored, client/impala-shell. The basic functions of these three components have been introduced in this article.
Client? : It can be Python CLI (officially provided impala_shell.py), JDBC/ODBC or Hue. No matter which one is actually a Thrift client, connect to the impalad port 21000.
Impalad: It is divided into two parts: frontend and backend. This process has three ThriftServer (beeswax_server, hs2_server, be_server) to provide services outside the system and in the system.
Statestored: the data exchange center of each backend service in the cluster. Each backend is registered in statestored. In the future, statestored will exchange update messages with all registered backend services.
RPC
Component |
Service |
Port |
Access Requirement |
Comment |
ImpalaDaemon |
Impala Daemon Backend Port |
22000 |
Internal |
ImpalaBackendService export |
Impala Daemon Frontend Port |
21000 |
External |
ImpalaService export |
Impala Daemon HTTP Server Port |
25000 |
External |
Impala debug web server |
StateStoreSubscriber Service Port |
23000 |
Internal |
StateStoreSubscriberService |
? ImpalaStateStore Daemon |
StateStore HTTP Server Port |
25010 |
External |
StateStore debug web server |
StateStore Service Port |
24000 |
Internal |
StateStoreService export |
The following describes the Thrift RPC between the three components ("<->" indicates the RPC client, and "<->" indicates the RPC server)
(1) Client <-> impalad (frontend)
BeeswaxService (beeswax. thrift): the client submits an SQL request through query (), and then asynchronously calls get_state () to listen to the query progress of the SQL statement. Once the query is completed, it calls fetch () to retrieve the result.
TCLIService (cli_service.thrift): the client submits an SQL request. The function is similar to the above. More importantly, it supports DDL operations. For example, GetTables () returns the metadata of the specified table.
ImpalaService and ImpalaHiveServer2Service (ImpalaService. thrift) are subclasses of the above two classes, each of which enriches some functions, and their core functions remain unchanged.
(2) Impalad (backend) <-> statestored
StateStoreService (StateStoreService. thrift): statestored stores global databases in all backend service states of the entire system. This is a single-node central data exchange center (the state of this node is soft state. Once it goes down, the saved status information is gone ). For example, when each impala backend is started, StateStoreService is called. registerService () registers itself with statestored (actually identified by the StateStoreSubscriber bound with this backend service), and then calls StateStoreService. registersubscriber () indicates that the StateStoreSubscriber receives the update from statestored.
(3) Statestord <-> impalad (backend)
StateStoreSubscriberService (StateStoreSubscriberService. thrift. Then, the backend calls StateStoreSubscriberService. UpdateState () to update the status. At the same time, the UpdateState () call will return some update information of this backend to statestored in impalad backend/StateStoreSubscriber.
(4) Impalad (backend) <-> other impalad (backend) (the two are mutually client/server)
ImpalaInternalService (ImpalaInternalService. thrift): a backend coordinator sends a request to execute a plan fragment to another backend execute engine (submit ExecPlanFragment and request to return ReportExecStatus ). This feature will be discussed in detail in the backend analysis.
(5) Impalad backend <-> other frontend
ImpalaPlanService (ImpalaPlanService. thrift): Other Forms of frontend can generate a TExecRequest and hand it to the backend for execution.
In addition, Impala frontend is written in Java, while backend is written in C ++. Frontend parses the input SQL statement, generates an execution plan, and then transmits it to backend through Thrift serialization/deserialization. TExecRequest (frontend. thrift) is a data structure transmitted in the middle, indicating a Query/DML/DDL Query request. It is also the data interface between frontend and backend during SQL Execution. Therefore, we can replace impala-frontend and piece together the TExecRequest in other forms to pass it to the backend for execution. This is what ImpalaPlanService did previously.
Impala component execution process 1, impala-shell
The client can submit a query to Impala through the Thrift API of Beeswax and HiveServer2. The two access interfaces serve the same purpose (both are used by the client to submit the query and return the query result ).
Impala_shell.py accesses impala through Beeswax. Let's see how impala_shell.py submits a query to impalad.
(1) parse the command line parameters through OptionParser. If the parameter contains-query or-query_file, execute execute_queries_non_interactive_mode (options). This is a non-interactive query (that is, querying an SQL statement or a file full of SQL statements). Otherwise, ImpalaShell is entered. nested loop (intro) loop.
(2) after entering the command line loop, connect an impalad first, input "connect localhost: 21000", and enter the do_connect (self, args) function. This function generates a socket connection to impalad Based on the host and port specified by the user. The most important thing is this line of code:
Self. imp_service = ImpalaService. Client (protocol)
At this point, imp_service is the proxy of the client, and all requests are submitted through it.
(3) The following uses the select command as an example to describe how to enter the do_select (self, args) function if the client inputs the command "select col1, col2 from tbl. In this function, the BeeswaxService. Query object is generated first, and the query statement and configuration are filled into this object. Then go to the _ query_with_result () function and submit the query through imp_service.query (query. Note that ImpalaService is asynchronous. After submission, a QueryHandle is returned, and then the status of _ get_query_state () is continuously queried in a while loop. If you find that the SQL statement is in FINISHED state, you can obtain the result through fetch () RPC.
2, statestored
The Statestored process provides the StateStoreService RPC service, while the StateStoreSubscriberService RPC service is provided in the impalad process. StateStoreService the RPC logic implementation is implemented in the StateStore class.
When Statestored receives the RegisterService RPC request sent by backend, it calls StateStore: RegisterService () for processing. It mainly performs the following two tasks:
(1) Add the service to StateStore. service_instances _ according to the service_id provided by TRegisterServiceRequest _.
Generally, only the service "impala_backend_service" exists in the whole impala cluster, so service_id = "impala_backend_service ". And each backend is bound The relationship between the service and backend is formed. The relationship is stored in the StateStore. service_instances _ group.
(2) Impalad backend sends subscriber_address to statestored RegisterService. On the statestored end, a corresponding Subscriber object (indicating the backend bound to the Subscriber) is generated based on the subscriber_address ). Add the Subscriber bound to the backend to the map StateStore. subscribers. Each Subscriber has a unique id, so the impala backend distributed in the cluster has a globally unique id.
In this way, if a backend/StateStoreSubscriber fail or an SQL task running in it has a problem, it will be reflected in statestored, and other related backend will be notified.
So how is each backend updated? StateStore: UpdateLoop () is responsible for regularly pushing updates to all members of the service subscribed to by each backend. The current update policy is full update and will consider incremental updates in the future.
3. impalad
The services of the Impalad process are wrapper in the ImpalaServer class. ImpalaServer includes fe and be functions and implements ImpalaService (Beeswax), ImpalaHiveServer2Service (HiveServer2), and ImpalaInternelService APIs.
The global function CreateImpalaServer () creates an ImpalaServer that contains multiple thriftservers:
(1) create a ThriftServer named beeswax_server to provide ImpalaService (Beeswax) service outside the system. It mainly serves Query and is the core service of fe/frontend. Port 21000
(2) create a ThriftServer named hs2_server to provide ImpalaHiveServer2Service outside the system, provide Query, DML, DDL related operations, port 21050
(3) create a ThriftServer named be_server to provide ImpalaInternalService to other impalad in the system, port 22000
(4) Create an ImpalaServer object. The TProcessor of the first three ThriftServer is assigned the ImpalaServer object. Therefore, the RPC requests of the first three Thrift services are handled by the ImpalaServer object. The most typical example is that we submitted a BeeswaxService through the Beeswax interface. query () requests, the processing logic at the impalad end is done by the void ImpalaServer: query (QueryHandle & query_handle, const Query & query) function (implemented in the impala-beeswax-server.cc.
Below is the main function of the impalad-main.cc:
Int main (int argc, char ** argv) {// Parameter Parsing, enabling log (based on Google gflags and glog) InitDaemon (argc, argv); LlvmCodeGen: InitializeLlvm (); // Enable Kerberos security if requested. if (! FLAGS_principal.empty () {EXIT_IF_ERROR (InitKerberos ("Impalad");} // because frontend, HBase and other related components are developed by Java, therefore, the following lines initialize JNI-related reference and method id JniUtil: InitLibhdfs (); EXIT_IF_ERROR (JniUtil: Init (); EXIT_IF_ERROR (hbasetabletimeout :: init (); EXIT_IF_ERROR (HBaseTableCache: Init (); InitFeSupport (); // ExecEnv class is the execution environment of Query/PlanFragment on impalad backend. // Generate SubscriptionManager, SimpleScheduler, and various Cache ExecEnv exec_env; // generate Beeswax, hive-server2, and backend ThriftServer to receive client requests, however, the real backend processing logic of these three services is the object of ImpalaServer * server. ThriftServer * listener = NULL; ThriftServer * hs2_server = NULL; ThriftServer * be_server = NULL; ImpalaServer * server = CreateImpalaServer (& exec_env, listener, & signature, & hs2_server, & be_server); // because be_server provides services in the system, start it first. Be_server-> Start (); // the key here is to Start SubscriptionManager and Scheduler Status status = exec_env.StartServices (); if (! Status. OK () {LOG (ERROR) <"Impalad services did not start correctly, exiting"; ShutdownLogging (); exit (1 );} // register be service * after * starting the be server thread and after starting // the sub‑mgr handler thread scoped_ptr cb; if (FLAGS_use_statestore) {THostPort host_port; host_port.port = FLAGS_be_port; host_port.ipaddress = FLAGS_ipaddress; host_port.hostname = FLAGS_hostname; // Register the be service to statestored, and all the be service groups in the cluster form a group, so that the Query request can be dispatched between backend. Status status = exec_env.subscription_mgr ()-> RegisterService (IMPALA_SERVICE_ID, host_port); unordered_set services; services. insert (IMPALA_SERVICE_ID); // registers the callback function. This function is called every time StateStoreSubscriber receives an update from statestored. Cb. reset (new SubscriptionManager: UpdateCallback (bind (mem_fn (& ImpalaServer: MembershipCallback), server, _ 1); exec_env.subscription_mgr ()-> RegisterSubscription (services, "impala. server ", cb. get (); if (! Status. OK () {LOG (ERROR) <"cocould not register with state store service:" <status. getErrorMsg (); ShutdownLogging (); exit (1) ;}// this blocks until the beeswax and hs2 servers terminate // the be_server of the Internal Service has been successfully started, start beeswax_server and hs2_server beeswax_server-> Start (); hs2_server-> Start (); beeswax_server-> Join (); hs2_server-> Join (); delete be_server; delete beeswax_server; delete hs2_server ;}
Exec_env.StartServices () calls SubscriptionManager. Start () and StateStoreSubscriber. Start () to Start a ThriftServer.
StateStoreSubscriber implements the StateStoreSubscriberService (defined in StateStoreSubscriberService. thrift), which is used to receive updates from statestored and feedback the update of the backend bound with this StateStoreSubscriber to statestored. In this way, the backend can be visible to other backends, so that you can accept the task Updates sent from other impala backend (of course, the backend update is received through statestored ).
References:
Http://www.sizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdf
Original article address: Impala source code analysis (1)-Impala architecture and RPC, thanks to the original author for sharing.