Rase Distributed Computing System

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Introduction

Ranking and selection engine (rase) is a specific distributed computing framework used for distributed simulation computing through the ranking and selection algorithm.

Here is an example to briefly describe the role of rase. Now there are 1000 table tennis players. We need to select one of them who is strong enough to represent the country. How can we see who is strong? We need to carry out some competitions. The results show their strength. The most stupid way is to let them play the game in two and finally find the most competitive. But it is obvious that this is time-consuming and laborious. We can let them play the game while eliminating some relatively weak players, in this way, the person who needs it can be found quickly, and the confidence interval theory in probability theory can be used during elimination, so that the elimination is theoretically controllable. This is ranking and
The basic idea of selection.

Rase needs to input some alternative (such as athlete data), a simulation algorithm (such as the Competition Algorithm), and a PK algorithm (such as the obsolete algorithm ). When the system is running, let the alternative loop perform simulation calculation to obtain the corresponding calculation data (such as score, use the PK algorithm to drop alternative with relatively poor strength and kick out the overall alternative list. In this way, after a period of operation, an optimal alternative can be obtained.

2. System Analysis

According to the above description, the rase system requires at least two threads, one thread for simulation calculation, one thread for PK calculation, two threads for concurrency, and the data of alternative needs to be shared, this is a simple model. However, according to the actual situation, the use of this system often requires a large number of simulation times, and the running time is measured in days. At this time, some improvements are needed, because simulation and simulation can be performed at the same time, just as a table tennis competition can be conducted in many games at the same time, so that there can be multiple threads to perform simulation at the same time, that is to say, the simulation computation can be scaled horizontally, and the PK computation can also be scaled horizontally (this is supported by certain mathematical theories ). In this way, the original two threads have changed to the current two types of threads, each of which can be horizontally extended by multiple threads. We want to use as many machines as possible for computing. This involves a distributed technology.

3. System Description

After many designs and modifications, a good distributed solution is finally completed. The following describes the structure of the system.

The system consists of one master and multiple agents. The master is held by one machine, and all other machines have their own agents. The main function of the master is the data center. It distributes alternative data to the agent, obtains the result of alternative calculation from the agent, PK alternative, and eliminates part of alternative. The agent obtains alternative data from the master node for simulation calculation and returns the result to the master node.

In the data structure on the master, the first is a mainlist array, which stores all original alternative parameters and simulation results, and then a mainqueue and a prequeue, the two queues store the index information of mainlist. There is also a master thread and multiple selector threads on the master. The master puts all alternative in the mainlist into the mainqueue at startup. The master thread sends data in mainqueue to the agent, accepts the simulation result of the agent through the master thread, and puts the data in prequeue. At the same time, all selector threads obtain the simulation result from the prequeue, add the result to the alternative corresponding to the mainlist, and PK the result with other surviving alternative data in the mainlist, if it can survive, it will be placed in mainqueue, waiting for the next simulation calculation.

The data structure on the agent is composed of two queues: altqueue and samplequeue. altqueue stores alternative data sent from the master, and samplequeue stores the simulation results. The agent has one agent thread and multiple slave threads. At startup, the agent obtains alternative data from the master through the agent thread, puts the data into altqueue, and transmits the simulation result to the master through the agent thread. At the same time, the slave thread obtains data from altqueue, perform the simulation calculation and put the result into samplequeue.

For network data communication, the master process of the master node enables services on multiple ports, and the agent pulls data and pushes computing results. In order to ensure load balancing between agents, this is achieved when the agent pulls data in this way. It sets a threshold value for each time period to perform a test and pulls data from the master (Threshold Value-current value) number of data. In this way, you can perform more work to balance the computing load.

4. design advantages

1) sequential execution to a certain extent

Through the queue mechanism, all alternative tasks are executed in sequence to a certain extent, and the number of simulation tasks between alternative tasks in PK is not much different. In this way, the PK algorithm can be theoretically confirmed.

2) multi-thread concurrency

The threads of slave and Selector are multi-thread concurrent, And the concurrency of slave is relatively simple, because each alternative is relatively independent in the simulation. Although the concurrency of selector has a certain relationship between threads, however, this concurrency method can be proved by mathematics. The advantage of concurrency is that the multi-core CPU can maximize the usage.

3) asynchronous transmission ensures the highest CPU usage

Because the most time-consuming part of the system is the simulation process of the slave thread and the pk process of the selector thread, these two are the bottlenecks of the system. The queue mechanism is used to implement system Asynchronization, so that the slave thread and Selector thread do not need to wait for the transmission of network data, so as to keep running for a moment.

4) index queue

Mainqueue and prequeue are implemented using linked lists because they need to frequently add and delete nodes. mainlist is implemented using arrays because the array traversal speed is much faster than the linked list, in the primary key, all alternative in the mainlist must be traversed. Index queue refers to the index that stores mainlist in mainqueue and prequeue, and can quickly obtain data through the index. The implementation of this index queue gives full play to the characteristics of linked lists and arrays, improve efficiency.

5. Design disadvantages

Balance between simulation and PK

The system is ordered to a certain extent, namely, the prequeue, mainqueue, altqueue, and samplequeue on the master, which form a ring between them, data is transmitted sequentially in this ring. If the speed of selector PK is greater than that of slave simulation, the data will be accumulated in mainqueue, and the simulation cannot be performed in a timely manner, and the data in prequeue will be fewer and fewer, resulting in selector waiting, reduce the CPU usage of the master node. If the speed of selector's PK operation is less than the speed of slave's simulation, the data will be accumulated in the prequeue, And the PK cannot be performed in time, while the mainqueue data will be fewer and fewer, leading to slave waiting, reduce the CPU usage of the agent. Only by adjusting the parameters can the simulation and PK be balanced to maximize the CPU usage. However, as some alternative are eliminated, the total number of alternative changes, and the overall balance also changes, reducing the system efficiency. All this problem remains to be solved.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Rase Distributed Computing System

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Rase Distributed Computing System

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support