A new idea for solving massive data-distributed database)

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I have tried to develop three layers with Pb many times. Transport, COM +, MTS, and even WebServices after pb11 are included in pb7, but they are always annoyed. Why? Transport is the same as COM +, MTS, and WebServices. It is a proxy idea. The client first tells the middle layer that data is serialized by the middle layer, transmitted to the client, then the client performs operations and modifications, and then tells the middle layer to update the data. Theory and practice show that the development difficulty, stability, and access speed are incomparable to the two-layer direct connection. Petabytes of transport and MTS are more efficient, because the data returns are concise, while WebServices are XML, while XML consumes a lot of power in data serialization, making access a little slower. At present, I myself, including other small application project developers, cannot develop three layers without any pain. They are all using two layers for rapid development. I have evaluated it myself before, for example, layer-3 COM + development, accessing a database, the development volume is 5 times that of layer-2. If the workload is five times, the development cost is five times. The same is true for the delivery date and maintenance later. In addition, the so-called three-tier client update reduction is also a nightmare! It is just a good saying that modifying the middle layer does not change the client! I really doubt that people who advocate three layers are at the same heart!

For example, we need to complete 1. Common query (read-only); 2. New data entry; 3. modify data

1. Common query (read-only):. Connection intermediate layer, B. The intermediate layer instantiates objects, queries, serializes, and returns the description string of adoset or ds, C. The description string imported from the client to adoset or ds.

2. New data entry: insert data locally, B. Serialized to the middle layer, C. The intermediate layer is filled with DS. Connect to the database and write data. And return results.

3. modify data:. The middle layer reads raw data, B. Serialized to the client, C. Client fill, D. Client modification, E. When the client saves it, it sends it back to the middle layer, updates its status, and finally updates it.

It can be seen that in direct connection with Layer 2 development, the handling process in the middle is complicated, and many exceptions need to be handled, and MessageBox cannot be directly popped up. A lot of error code must be returned and a local prompt is displayed, the client cannot see the cause of the error because an exception is not well handled. In essence, any operation on Layer 2 has only one or at most two steps. And the stability is much higher. The third layer adds more data circulation links.

In addition, it is impossible to process data row-by-row by programming code in a mass data processing system. For example, to count a Business Report, parameters such as the transfer condition are obtained through procedure and the result set that meets the condition is returned. Therefore, three layers are redundant in data processing programs! It is really redundant. It's just an access proxy. It would be a joke if we only advocate three layers to reduce the pressure on client data processing, because the computer configuration is very high now. It can be said that some old computers mentioned in previous articles can easily process some client logic at present, rather than relying on the middle layer. For example, the current dual-core computer. I think the average server is not as good as 20 normal PCs. Unless it is a minicomputer, the computing power is unconvincing. And the server cannot get a lot of memory! The CPU and memory of the PC are absolutely large now! Big and scary! Moreover, it is unlikely that many intermediate servers will be deployed. So in the past, when the client was very poor, the calculation workload was placed on the middle layer. Today, it should be a wrong idea. If all 50 users need to process a process, it is better to extract the basic data, which will be handled by 50 clients. In addition, many people think that the type of HP entry-level server (2-, that is, the server, in fact, is not), if the server does not improve the configuration, such as adding CPU and memory, its basic configuration is very low, and some have to only bring IIS. My client heard that the ECC memory is 1500-2500/1 GB, so I don't want to add any memory .!!! So Layer 3 is a bit like working with server vendors.

The smart upgrade and automatic upgrade of the client can solve a lot of problems now, so there is no reason to use three layers.

Procedure is required for complex statistical logic, which does not constitute a reason for layer-3, because business logic is not processed in layer-3.

One advantage of layer-3 performance is that layer-3 servers and databases are in a LAN. Communication may be advantageous. In addition, it can play a major role in controlling complex transactions. But not all applications do.

That comes to mind: distributed databases.

Not only can the client's basic data traffic be distributed across multiple SQL servers, but procedure and function can be directly executed. the SQL server now supports development of module programs and extended functions. Then, I think it is meaningless to act only as an access proxy. The distributed database system can be used on a larger scale. The layer-3 server is used to configure a server group for SQL distributed data. In addition, access and programming are simpler. The concurrency and number of connections should not be a problem. From the perspective of large-scale application such as Google, the best way to large-scale high-concurrency access is not to use a mainframe, not a layer, but to be distributed. servers are assembled into groups and pipelines, which can be managed separately. As for data replication, synchronization, and partitioning among multiple servers. Or complex access policies and stress distribution can all be considered at the programming level. Isn't it all our expectation for programming like accessing a server.

So I will read more relevant articles and study them in depth. Of course, this distribution mainly refers to load balancing! Distribute multiple users to multiple servers.

// The following are others' articles:

Http://www.javaeye.com/topic/225650

At present, the concept of distributed is becoming more and more popular, but in the database field, distributed applications are relatively small. After reading the MAP/reduce concept of Google, I conceived a distributed database architecture and implemented its prototype. Now I have written down its basic ideas, it is expected to serve as an example. I have not been working for a long time, but I am wrong about it. If it is not perfect, I would like to ask more. Thank you.

The purpose of designing this distributed database is to quickly process massive data. The basic idea is actually very simple. Data is distributed to multiple data nodes. When executing SQL statements, you can analyze the semantics of SQL statements and operate on one or more databases. In this way, the query pressure can be distributed to each node, and the processing time for massive data is greatly shortened.

Let's take a few simple SQL statements for analysis to see what is different in a distributed environment. Assume that we have two data nodes, A and B, and the table name is table, where ID is 1 ~ 100 of the data is stored in node A and the ID is 101 ~ 200 of the data is stored in Node B. The following SQL statements are executed on both databases.

Select * from table where id = 1
In this way, database a returns data with ID 1, and database B returns NULL. In this case, the data of A and B is simply merged to obtain the correct result.

Select top 10 * from table
Database a returns 10 data records, and database B returns 10 data records. If database A and database B are merged, 20 results are returned. In this case, it is correct to remove the remaining 10 data entries.

Select * from Table order by ID
At this time, database A and database B will return all the data, but to make the data meet the order by condition, it is clear that a sort operation should be performed.

Select top 10 * from Table order by ID
At this time, database A and database B will return 10 pieces of data. after merging, the data must be sorted and removed to ensure that the results are correct.

Keywords to be processed in SQL statements include Max, Min, Count, sum, and AVG, which will not be written here. Through these examples, we can see that as long as the query on different data nodes is processed separately, it can be converted into a single database query equivalent results. In summary, only merging, sorting, and removal are involved. In fact, this is very similar to the map/reduce idea. No matter what complicated actions, in the end, you can perform a few simple operations. Of course, these processes take some time, but in the face of massive data volumes, in many cases, the processing time is negligible.

The above are just some simple SQL statements. In the face of some complicated SQL statements, we need to process the SQL statements, data exchange between data nodes can be completed (the example is provided at the end of the article ). Therefore, to implement a distributed database that can fully process SQL statements, you must modify the database kernel. When implementing this component, the time is limited and the transformation of the kernel is unrealistic. So I adopted the middleware method to implement the prototype of this distributed database, the database used is mssql2000, which is the concept diagram of the distributed database I designed (see Appendix 1 ):

Data is distributed to each data node according to certain rules (generally, the primary key can be directly hashed), and each data node is accessed by the distributed database server for merging, sorting, and removing Operations, then, return to the program through the data interface.

The following data interfaces are applicable to the following scenarios:

Reader: provides an interface for reading database query results one by one. In the case of massive data, sometimes a large amount of data needs to be read for processing. It is obviously unrealistic to read the data into the memory at a time. In this case, you can read data one by one in reader mode for batch processing.

Datafiller: Provides XML packaging for data and is suitable for reading small data volumes. It mainly provides a convenient interface for Web applications.

Command: Execute Delete, update, insert, and other SQL statements that do not return data.

Bulkcopy: Batch insert interface. It mainly provides high-speed interfaces for large data import.

To implement this middleware, the difficulty should be in the semantic analysis of SQL statements. This should be implemented using the compilation principle, but it is not used in my implementation. One is the time problem, and the other is the middleware-based approach, some complex SQL statements cannot get the correct results. Therefore, regular expressions and methods are used to analyze SQL statements, analyze how execution results should be processed, and whether SQL statements should be sent to a single node or multiple nodes. The process is as follows (see Appendix 2 ):

Note that there is no latency between sending and executing SQL statements and returning results. Otherwise, up to a few dozens of SQL statements can be executed per second. The model I used at the beginning is a common query thread model (see appendix 3 ):

After each statement is executed, set the execution status to finished in hashmap. A query thread is used to traverse the hashmap and send the statements that have been executed to the result processing module. To avoid 100% CPU usage, the query thread must have a sleep statement. However, in windows, the minimum thread polling time is 15 ms, the CPU will give priority to other threads so that sleep takes at least 20 ms at a time. In this way, no matter how fast the SQL query is, the processing speed of the distributed database is limited to 1000/20 = 50 per second. In my first model, at most 20 SQL statements can be processed per second. It is obviously not enough for Web applications.

Later, I used the semaphore mechanism, that is, when a query thread is generated, a semaphore is assigned to it. When every SQL statement is executed, a monitoring thread is added to the thread pool, and the monitoring thread is blocked, wait until all semaphores are set to the sending status, and then immediately send the results to the result processing module. Windows processes semaphores very quickly and can be measured in the CPU instruction period. After this improvement, the distributed database processes a query statement, which is basically equivalent to the time required to execute the query. Of course, this design causes a lot of threads to be used, and debugging is very difficult. You must be very careful with the design. When there are many data nodes, A thread pool with hundreds of threads must be maintained. In my opinion, it is very bad. I noticed that no matter how much data is processed, there are only more than 20 threads in MSSQL. It can be determined that their design is very delicate, and it must be different from my design. If you have a better solution to this problem, please do not hesitate to inform us. Thank you.

The above is a basic concept and implementation of Distributed Database middleware. Of course, there is still a lot of work to be done to implement a commercial middleware, such as permission, data security, node troubleshooting, and log modules, which all have many improvements. Currently, the middleware I implemented is very simple. Due to the limitation of MSSQL, many modules are not elegantly implemented. However, the only thing that is gratifying is that the performance is very good, to achieve the original intention of the distributed system. Currently, three machines run as data nodes. When random data access is performed, the load is evenly distributed to each node. Reading large data volumes and writing large data volumes are generally at a speed of more than twice that of a single database. Of course, distributed systems are not omnipotent, and some problems cannot be solved at present. For example:

1. Multi-table problem: for example, there is a user table, a product ID table, and a transaction record table. The user table and product ID table are used as foreign keys, if you execute

Select * from transaction record table where transaction record table. Product ID = product ID table. ID and transaction record table. User ID = User table. User ID

In such a statement, if you only process the execution result, no matter how these tables are structured, errors will occur. Why? The reason is hard to say clearly. If you are interested, think carefully.
For such statements, middleware cannot process them at all. Only by modifying the kernel can data exchange be performed on each data node during statement execution. The current solution is to put one of the tables on a single database. However, the program looks weird. Two different database category classes are used for a query action, and programmers who do not understand the entire framework do not know why to do so.

2. Semantic Analysis: in a distributed environment, it is more difficult to convert SQL statements into operation primitives. It is very difficult to make sure that the logic is completely correct. I learned very poorly in discrete mathematics, at present, the accuracy rate cannot reach 100%, so we have to retain the manual mode in the data interface, that is, the manual decision on how to process data is very ugly. According to the current recognition rate, some complex SQL statements can be written several times separately, or the processing process can be customized in manual mode to ensure correctness. At present, there is no time to improve the analysis module, it can only be followed.

I hope you can give us some advice on these questions. After all, there are many limitations in developing your own ideas. I personally feel that there are still many places to mine, which may be another way to process massive data. Finally, thank you for watching.

Description: query the thread model.
The size is 15.3 kb.

Description: SQL query process.
The size is 36.3 kb.

Description: Concept diagram of Distributed Database.
The size is 24.9 kb.

There are also some articles for reference:

Http://tech.ddvip.com/2008-09/122180807067490.html

Http://www.ningoo.net/html/2009/amoeba_for_mysql_distribute_environment.html

Http://fineboy.cnblogs.com/archive/2005/08/03/206395.html

Http://news.chinabyte.com/368/115368.shtml

Http://news.csdn.net/n/20061124/98200.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A new idea for solving massive data-distributed database)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support