Impala Learning –impala Back-end code Analysis Table of Contents 1 code structure 2 Statestore 3 Scheduler 4 impalad boot process 5 Coordinator 6 Execnode 7 PLANFRAGM Entexecutor 1 Code Structure service: Connect the front end and accept the client's request Runtime: Runtime-required classes, including coordinator, DataStream, Mem-pool, tuple, etc. exec:execnode , execute node expr: expression evaluation Transport:thrift sasl:simple authentication and security Layer statestore: Scheduling, Nameservice, resource pool codegen: Code Generation 2 Statestore
Statestore is a C/s structure of information subscription services, in the Impala, mainly for the management of the current cluster membership state, and for scheduling and discovery of the fault process. Statestore is a separate process in which each Impalad establishes one or more connections to the Statestore.
Statestore provides four interfaces: Registerservice: To register with a service is to join and become a member of this service Unregisterservice: cancel the registration of a service Registersubscription: Subscribing to member information for a service unregistersubscription: canceling registration of information for a service member
Statestore will check whether the registered members are still alive through the heartbeat, the condition can be a heartbeat timeout, continuous loss of n heartbeat. Statestore periodically updates the member information of its subscribed services to subscriber. The current update strategy is full-volume updating, which will consider incremental updates in the future. There is usually only one service in the Impala cluster, and each impalad will register the service. After registering the service, this impalad can be visible to other Impalad, so you can accept other Impalad tasks. 3 Scheduler
Coordinator after getting the execution plan, get the backend that can be executed by scheduler and send the execution command to these back-end
Scheduler provides two interfaces Gethosts: Provides a set of machine addresses to access the data, returns a set of machine addresses as close to the data as possible getallknownhosts: Returns all the surviving machine addresses.
Simplescheduler is the only implementation of scheduler at present. Coordinator by calling the Simplescheduler gethosts method, scheduling and remote task assignment. In the Gethosts method, the algorithm is: First look for the back end with the same data position, if not the same, then use the Roud-robin algorithm. At present, Simplescheduler does not consider the real-time load of the machine. The number of backend returned depends on the number of machines in which the data is being distributed. 4 Impalad Start process initialization llvm,hdfs,jni,hbase, front start impalaserver start Thriftserver, accept thrift request start execenv start webserver Start SubscriptionManager to start Scheduler to Statestore subscriptions, and register callback functions simplescheduler::updatemembership to provide the currently available backend for scheduling Subscriptionmanager::registerservice Statestore Check whether the service exists, and if not, create a new service_instance check whether the client exists in this Service_ In the membership of instance, if it does not exist, add a subscriptionmangaer::registersubscription statestore add a subscriber, Subscribe to the membership of this service and register the callback function Membershipcallback Update the IMPALA-SERVER membership status when there is an update callback for failure detector
Once the Impalad is started, you can accept query requests, or you can accept requests from other Impalad to execute a planfragment. 5 Coordinator
Responsible for performing a set of planfragments. is also responsible for responding to client requests. The coordinator fragment is executed locally, and the other is sent to the remote Impalad for execution. Coordinator also monitors the entire execution state.
The Exec () function is its most important function, briefly describes the process in this function:: Computefragementexecparams (): Computefragmenthosts (): For each fragment, depending on the node in which the input data resides , call the Scheduler Gethosts method, and get each phase to execute on those back end for each fragment, compute its exchangenode parameters computescanrangeassignment () : Calculates how much data should be scanned for each back end. Executor_ = new Planfragmentexecutor () creates a planfragmentexecutor. Executor_->prepare () for each fragment, call execremotefragment for each remote backend. Progressupdater: Update status regularly. 6 Execnode
The parent class for all Execnode. The main methods are prepare (), Open (), GetNext (), Close (), Createtree () Execnode are classes that actually process data on Impalad, including Hash-join, aggregations, scan, and so on. Multiple Execnode make up an executive tree. The root node is finally executed and the leaf node is first executed.
The order of execution in Impala is the opposite of hive. In Impala, the method of dragging is adopted, and the way of pushing is adopted in hive. In Impala, the execution entry is the open method of the root node. The open method invokes the Open method and the GetNext method of the child's node.
Main data structures include: objectpool* pool_ vector<expr*> conjuncts_ vector<execnode*> children_ RowDescriptor row_ Descriptor_
The main functions include: Prepare () is invoked before open. The Code Generation Open () is called before GetNext, ready to work. The GetNext () GetNext () that invokes the child node returns a set of row, and the EOS evalconjuncts () is evaluated for all expressions and returns a Boolean result of 7 Planfragmentexecutor
Execute a planfragment. Includes initialization and cleanup. Cleanup includes releasing resources and shutting down the data stream. Each executor will have a callback for reporting execution status.
The main three functions are: Status Prepare (texecplanfragmentparams): Ready to execute, the main process is as follows: Descriptortbl::create (): Initialize descriptor table. Execnode::createtree (): Initializes the execution tree. The execution tree consists of execnode. Each execnode also provides prepare (), Open (), GetNext () three functions. When initialization is complete, Plan_ points to the root node of the execution tree. Plan_->prepare (): Initializing the execution tree if you can use code generation, call Runtime_state_->llvm_codegen ()->optimizedmodule () to optimize set scan Ranges set up sink, if required set up profile counter status Open (): Start execution, and start a separate thread to the Coordinator report status: Plan_->open () from the root node Start calling the Open function to begin execution. If has sink:sink_->send () if there is a writeback operation, such as the INSERT statement in query, the initiative pushes the calculation results to HDFs or hbase. The Status GetNext (rowbatch) is used to trigger the GetNext function of the execution tree. When GetNext returns done, all data has been processed and executor can exit.
Author: <kyle@localhost.localdomain>
date:2013-02-25 17:44:34 CST
HTML generated by Org-mode 6.21b in Emacs 23