Laxcus Big Data Management System 2.0 (9)-seventh chapter distributing task components

Last Update:2016-05-12 Source: Internet

Author: User

Tags new set sha1 hash

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

seventh. Distributing Task Components

The Laxcus 2.0 version of the distributed task component, based on the 1.x release, consolidates middleware and distributed computing technology, designs a new set of data-computing components and data-building components that run under new, distributed state, and creates new operational frameworks, operational management specifications, API interface and so on.

The change of the new distributed task component is mainly embodied in the data processing ability. After the re-adjustment of the operating structure, the original because of architectural problems are all the restrictions are canceled, distributed task components can continue to expand with the cluster, synchronization provides unlimited data processing capabilities. This is sufficient to meet our current and future needs for a wide range of large-scale data processing businesses. At the same time, we also put a large number of management work performed by the Cluster Administrator to omit, by the system to self-regulation and implementation, automated management capabilities thus greatly improved.

In the Laxcus 2.0 release, the distribution task component is divided into two types: internal components and external components, by component deployment locations. Internal components run inside the cluster, performing most of the distributed processing; the external components belong to the client component and run only on the front node, and the results are processed based on the user's need for feedback.

From the programmer's point of view, the new Distributed task component is simple, you can not have to consider the various system-related operations, only requires the invocation of a few API interfaces, you can develop a conforming distributed task component, which is similar to a local database development process, so many people with Java programming experience can be mastered.

Since this article is about the Laxcus Big Data management system, the programming section is not in its column, so the following is still the practice of describing the concepts and functions of distributed task components. It is hoped that by introducing the distributed task components, we will have a better understanding of the laxcus operation mechanism.

7.1 stage naming

In Chapter Two we refer to the concept of "naming", which is an abstract representation of an entity's resources. Stage naming is one of the naming, it is based on the name, combined into the distribution of data processing "step", plus the user signature extension, used to describe the running distribution task components, the state and location of the time. 8.1, "step" is used to describe the operation steps of the current distributed component, "root" is the root name, each root name must be globally unique. Work that determines the uniqueness of the root naming can be contacted by the system administrator, or directly to the administrator. A "sub" is a child named that is used in the iterative data processing process. "Issuer" is a user signature, which is the user name SHA1 hash code entered when the user logs in, for the security check in the distributed environment.

The stage naming is the unique identity of the distributed task component at the Laxcus cluster deployment, runtime, which is used for both data calculation and data building.

Figure 7.1 Stage named parameters

7.2. data calculation and conduct commands

The data computing component is implemented based on the Diffuse/converge algorithm, and the command in the distribution Description language is conduct. A complete data computing component consists of 5 phases, each of which is a separate sub-component. The work contents and processing scope of the data calculation stages are clearly stipulated in the system. The call node is centrally located and is responsible for provisioning data resources and controlling data processing operations.

7.2.1 Init stage

The init phase is the initialization of the data calculation, and according to the parameters and requirements provided in the conduct command, the data resource is distributed for subsequent work, and the running parameters are provisioned. The resulting calculation results will be the reference for subsequent data calculations and are saved in the conduct command. The init phase component belongs to the internal component and specifies deployment to the call site.

7.2.2 from stage

The from phase corresponds to the diffuse in the Diffuse/converge algorithm. At this stage, the raw data needed to calculate the data is generated. Relative to the subsequent to-stage components, these data are their input data, which are intermediate data for the entire data calculation process. There are currently three sources of data generation: 1. Generated using the SQL SELECT statement; 2. According to the custom parameters, based on the user interpretation rules, 3. The SQL SELECT statement is combined with the custom parameter. In either case, the original data is initially saved on the hard disk or in memory (saved to the hard disk by default, saved to memory needs to be specifically specified by the programmer, depending on the license of the operating environment), and the resulting data bitmap is returned to the call node. The data bitmap is a meta-data of the Laxcus system and has various styles for different needs. In order to standardize the process, a basic interface has been defined, and various applications can add new elements on this basis to realize user-defined self-interpretation. The From phase component belongs to the internal component and is specified for deployment to the data node.

7.2.3 to stage

The to phase corresponds to the converge in the Diffuse/converge algorithm. At this stage, the actual calculation work will be performed. Corresponding to the above converge Introduction, the to phase is iterative, the requirements occur at least once, can occur any number of times. If it is multiple times, the child to stage statement in the distribution description language uses SUBTO. Corresponding to the subto, this sub-name is also unique in the class set. The to phase iterations are programmer-specified and written to the conduct command, which is performed by the call node as understood. The data originally used in the to phase is derived from the from phase and later to the last to phase. The To phase iteration is continued under the control of the call node until the final completion, the data calculation results are output to the calling node, and if not, the data bitmap is returned to the call node. The to phase component is an internal component that specifies deployment to the work node.

7.2.4 Balance stage

The balance phase exists between the from/to/subto phases. Its work is based on the data bitmap information returned in the previous phase, for the subsequent to/subto phase to allocate the data resources scattered on the Data/work node, so that each to/subto task as much as possible to obtain the same amount of computation, the same time as possible to return the calculation results, To achieve the purpose of saving the total calculation time and increasing the efficiency of calculation. For data bitmaps, if the previous stage was generated by the system, then in the balance phase, the interpretation work is also handled by the system, and if a user-defined rule is generated, then the balance phase is done according to the user-defined rule. The balance phase component is an internal component that specifies the deployment on the call node.

7.2.5 Put stage

The final output of the above data processing results is to undertake the last to/subto stage calculation work in the put stage. These outputs are usually targeted at the computer screen and disk, and the data is stored on the disk, or displayed on a computer screen in text or graphic form. The put stage component features a simple, external component that is deployed by the user on its own front site.

7.3 Data Building and establish commands

The data building component is implemented based on the Scan/sift algorithm and corresponds to the Distribution Description Language command establish. A complete data building component consists of 7 phases, and as with the conduct command, each phase deals only with one mandated work and 7 phases converge to form a complete data-building process. Data constructs are executed sequentially, without iterative behavior, the output of each phase is the input of the next stage, or the direction of the next stage input is guided. The call node continues to be centrally located and is responsible for provisioning data resources built by the data and monitoring data processing processes.

7.3.1 Issue stage

The issue phase is very similar to the conduct.init stage, and it is also a preparation before data processing. This includes checking all the parameters in the establish command and checking and leveling the data resources for subsequent phases. If there is an error in the specified parameter, or if the data resource does not meet the requirements, it rejects the execution and returns an error report to the user. The issue phase component is an internal component that specifies the deployment on the call node.

7.3.2 Scan phase

The scan phase performs data scanning, which occurs only on the Data Master node and examines the index information of the block. This speed is very fast because only the index of the data block is scanned. In the Laxcus system, a standardized scan phase component interface is now available, and if there is no special need, the programmer just needs to call this interface. If there is a personalized need, it can also be implemented on the basis of this interface. When the scan phase is complete, a data bitmap is returned to the call node that contains the various information needed for the next phase. The scan phase component is an internal component that specifies the deployment on the data node.

7.3.3 Assign stage

The Assign phase work takes the scan phase, which holds the data bitmap from the database node until all collection is complete. Then judging the validity of all the associated build nodes that have been defined in the issue phase, and then comparing and reorganizing the data bitmaps, the adjusted data bitmap is distributed to them according to the principle of the average distribution. There are some similarities between the work in the Assign phase and the Conduct.balance, and the next phase of the data balancing problem is considered. The Assign phase component is an internal component that specifies the deployment on the call node.

7.3.4 Sift stage

The SIFT phase is the most important part of data construction, and all the substantive work of data building is executed and completed in SIFT phase, which is the core stage of data building. In the actual application, because the data constructs have too many differences, so the system provides the SIFT phase interface is also wide. The system is responsible only for downloading data from the master node of the work, the rest of the work, the programmer can be within the scope of security, the SIFT interface to perform a variety of data operations. When the SIFT phase is complete, it will generate a new block of data that will generate a fixed-format data bitmap and return it to the call node. The SIFT phase component is an internal component that specifies the deployment on the build node.

7.3.5 each stage

Each stage accepts the data bitmap returned by the SIFT phase, which, in accordance with the information provided in the data bitmap, confirms their presence by combining the associated data master nodes, then reorganizing and leveling the bitmap, generating a new data bitmap, and passing it to all the required data master nodes. Thus, each phase of such operations, from the processing mode, and the Assign phase is basically consistent, the difference is that the assign phase is to allocate metadata to the build node, each stage is to return the build node metadata redistribution back to the data Master node. This can be understood as the assign phase of the reverse operation. Each phase component belongs to the internal component, which specifies that the call node is deployed.

7.3.6 Rise stage

Rise phase of the work to undertake each stage, its work is to follow the data bitmap provided by each, find the specified build node and data block, to download. When the download is complete, these blocks will be forwarded to the associated other data slave nodes. Because the content of the rise phase work is fixed, as in the scan phase, the system also provides a standardized interface, the programmer only need to call this interface, the work to the system to handle it. When the rise phase is complete, a data bitmap will still be returned, and the data bitmap can be handed to the system default settings or user-defined organization. The rise phase component is an internal component that is specified to be deployed on the data Master node.

7.3.7 End stage

End stage to undertake rise stage, is the final link of data construction. Its work is similar to Conduct.put, and is the final processing result of displaying the data build. Because the content of the end output is much simpler and more standard than the conduct.put, it is only a hint message, so the programmer usually does not need too much processing, and gives the standardized interface output of the system. The end stage component belongs to an external component and is specified to be deployed on the front node.

7.4 Packaging

After the programming of a project is completed, the next task is to pack. Packaging is to put the compiled code files in a file. The package operation can be done using the jar commands provided by the Java Runtime Environment, or the Eclipse integrated Ant tool, which is identical to the generic Java packaging operation. In a different way, the Laxcus system requires that the distributed task component file name suffix is ". DTC", and the task node needs to determine the distribution tasks component based on the suffix name. In addition, in each distributed task component package, you must provide a "tasks.xml" file. This is an XML-formatted configuration file, placed under the "Task-inf" directory in the file package root directory. It specifies a component package, all the data that needs to be provided to the system for identification and processing. The task node pushes each distributed task component under its associated node, based on the parameters provided in the file. This is a very important file, where the configuration and format, the system is clearly defined, do not allow any errors.

Figure 7.4.1 Tasks.xml configuration file

Figure 7.4.2 Distribution Task Component Catalog

7.5 Release

After the packing work is complete, the next step is to perform the distribution of the distributed task component. In order to not affect the cluster's normal operation, before the distribution task component is formally submitted to the cluster, we provide a test environment for programmers to examine their own distributed task components. Because this is a test environment and usually does not have the resources of the operating environment (users can be built in the test environment), in this test environment, the distribution task component is mainly to check the format, configuration, running process. When these are confirmed correctly, they can be submitted to the cluster.

The Laxcus Big Data system supports "hot release", and distributed task components are recognized and processed within a few minutes of entering the cluster to take effect. It is first delivered to the task node, and the task node will perform a series of checks on the distributed task component according to the configuration parameters provided in the "Tasks.xml" file, and distribute it to the associated node after it is confirmed. If an error is found during the check, the task node rejects the distribution and pushes the distribution task component and the error that occurred to the watch node, which the administrator will contact the user. To avoid version conflict errors during hot publishing, users should stop working before and during the release process, waiting for a new publish operation to complete.

Distributed task components are posted to the task node, there are two ways to do it: one is to contact the Cluster Administrator, let him finish; the other is that users skip the Cluster Administrator through a Web publishing interface and have the Web program post directly to the task node. In fact, the Cluster Administrator is primarily using this web interface to publish distributed task components. Only the management of the Web publishing interface is mastered by the Cluster Administrator, whether he is willing to take security risks, open this interface, is determined by the Cluster Administrator.

7.6 Partition

Whether it is data calculation or data building, they are essentially distributed processing. There is a problem in how to divide a large chunk of data into several small pieces of data during the calculation of intermediate data. According to the different characteristics of the data to establish the corresponding segmentation rules, is the "Partition" to do the work.

Partitions are typically first divided by the number of nodes that follow. In this way, when a piece of data is split, each subsequent node can get a small chunk of data. More often than not, consider data attributes, which are segmented by data attributes. For example, when performing a select search with an order by and a GROUP BY clause, the data is sorted into different categories based on the data types arranged and grouped. Other personalized segmentation requirements, because the system can not achieve unified processing, set aside the API interface, so that programmers to do their own.

At present, the distributed task component of FROM/TO and Scan phase provides the default partition data processing.

7.7 data balancing

On the basis of partition, if the computation time of the data is shortest and the calculation efficiency is maximized, the problem of balanced distribution data should be considered.

In fact, in order to achieve data balance processing, the balance processing parameters have been prepared at the time of partitioning. During the execution, the data balancing work of the data calculation is the responsibility of the balance phase, and the data balancing work of data building is the responsibility of each stage. When computer performance is not considered, the simple data balance is handled according to the length of the data, which is already provided in the data bitmap. In theory, two sets of data of the same content and length, on two computers with the same hardware configuration, have the same execution time. More complicated, it is necessary to take into account the data types and processing content, such as the addition and subtraction calculation is certainly faster than the multiplication, integer calculation is certainly faster than floating point calculation, multimedia audio, video data is certainly more dense than the text data calculation. As with partitions, more sophisticated, personalized data-balancing calculations are handled by the programmer through API interfaces.

7.8 memory mode

Distributing the task component during the run process produces a large amount of intermediate data. By default, these intermediate data is saved to the hard disk by the system and released by the system after the use is complete. It has been mentioned many times before, because the disk read and write efficiency is very low, the system has done a lot of optimization processing. In the process of distributing task components, in order to meet some users ' fast data processing requirements, speed up data processing and improve processing efficiency, the system provides an interface option to allow intermediate data to be stored in memory. This feature requires the user to display the specified when the data is written to the interface, and the system is selectively allocated based on the resource usage at that time. In practice, this work, if used in conjunction with "Memory Computing" in data access, will allow data processing to completely skip the bottleneck of the hard drive, turning the data processing into a purely streaming process, with a dozens of times-fold efficiency boost. Streaming data processing is an urgent need for time-sensitive information services.

Figure 7.8 Data Write memory interface

Laxcus Big Data Management System 2.0 (9)-seventh chapter distributing task components

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More