Problems and attempts of relational algebra (5) Cloud data organization

Source: Internet
Author: User

Abstract: This article is from Beijing Run Dry Software Technology Co., Ltd. Chairman Shang in the Tsinghua Big Data Industry Federation lectures.

Finally, let's briefly discuss the data organization of cloud computing.

Cloud data has several characteristics:

First, diversity . Cloud computing to solve the problem of multi-tenancy, it is obvious that the data structure of different users is often not the same, even if the same user, the same piece of business, data structures in different regions, different periods will be dissimilar. Like us a small company's financial system, data structures are changing year after year, no such sales commission, next year, there will be added some fields or tables to deal with.

The diversity of data is actually a very essential requirement, and the world is so complex. Diversity also exists in the age of relational databases. But the era of relational databases, because the scope of application is relatively small, usually only in an organization or even a local, this contradiction is not particularly serious, can deal with the past. But in the cloud era is not easy to deal with. You're faced with a plethora of users who want to provide continuous service to a wide range of users in the cloud.

Second, the separation of . Data in different times and places is separate, and data management requires federalism, not unitary. For example, Beijing's data, below the Haidian, Chaoyang these, but you want to replace Hebei province, the middle of a layer of Shijiazhuang, Baoding these cities, the level is not the same, not to make them into the same data structure will be awkward. But relational algebra requires singleness in theory, and data structures are awkward.

Separation performance brings diversity, but not just diversity. Sometimes even if the data is not diverse, it needs to be separated. For example, there is a lot of data in Beijing, I want to index it, to Shanghai this kind of data is not much, do not need to build index. Relational data theoretically does not have this mechanism, can only be in the engineering to go around, in fact, many database manufacturers will find ways to do data partitioning, but not from the conceptual support will lead to many implementation level of things very difficult.

Third, easy to calculate . This is the most critical point.

Saving the data is not a problem, we have to calculate, if not calculate, this data is meaningless.

We need diverse and separated data can be calculated randomly, it is important to support the high-order set of batch data table operations, so-called high-level set, is the set of sets, we face the diversity of data will be a lot of this higher-order set of operations.

The computation of relational algebra involves only a limited number of tables, and you have to explicitly write out which tables this operation is for, and if the tables that require operations form a set, then it is not in relational algebra.

You can hardly imagine that there are tens of thousands of or even millions of tables in a database, of course, I heard that Beijing mobile they have tens of thousands of watches, the accumulation of history, who dare not delete these tables. The relational database certainly does not advocate you in the database to make tens of thousands of tables, but the multiplicity demand is objectively, actually has so many kinds of data structure. If we use the table as an operational data, providing higher-order set operations, which can be done for a set of these tens of thousands of tables, this problem is no longer a problem.

In the current several major technologies, SQL is easy to calculate the good. The more fashionable nosql of the past few years has better support for diversity, but it is a serious sacrifice of easy computing and separation. SQL is easy to calculate, but the separation and diversity of the comparison is poor. The direct manipulation of file systems, such as Hadoop, is good for diversity and separation, but it has little computational ease.

We want to design a three of these indicators are good things, so as to fit the cloud, the lack of one is very troublesome.

My design idea is to simulate everyday paper documents.

The basic data unit is used to guarantee the diversity, the data structure is attached to this unit, different data units can have different data structure, but also can have sub-structure, such as a person in addition to name, gender, but also may have family members, work history, etc., can be a, can also be many, put together. This kind of data I call the super-structured data, the disadvantage of this kind of storage test is lower computational efficiency, but OLTP this business generally a single task involves a small amount of data, the performance loss of diversity is still tolerable for modern computers.

The tree-based mechanism is used to store data units to ensure separation, such as storing paper as a small folder for large folders. A relational database is actually a linear structure in which all tables are lined up in a single line, with a maximum of two layers, even when the schema concept is considered.

Then, the key here is that we ask for information in the tree directory, in fact, we store paper documents in the time that is done, I wrote the financial department in the folder, the paper can no longer write the financial department. The advantage is that you can change the hierarchy or move the data. For example, Beijing data is two layer, to Hebei becomes three layer, in the middle of a more prefecture-level city.

The mechanism of paper for storage is no problem, but the paper has almost no computational power. And our task is to add this ability.

This is going to use the higher-order set operations that we have just said, in order to define batch data calculations on the data units stored in these diverse tree-like separations, I can count multiple catalogs at once, including multiple data units, each with more than one record, just like a small table, which I can calculate.

Separation can reduce the coupling between data, I only count Beijing Haidian data, you put the CPU exhausted, Shandong people do not care, completely feel. The data in these two regions is irrelevant. It can be calculated uniformly and separated.

In this aspect of research we are also relatively junior, we now based on the common file system and combined with the previous development of their own programming language to achieve a database prototype, logical diversity, separation and ease of calculation are achieved, but the computational efficiency is relatively poor, still can not reach the practical stage, Further engineering optimization will be done below.

Today, we talk about these problems of relational algebra and propose solutions for each problem as well as tentative products, but now the solution is to deal with each problem separately, for example, to solve the problem of the interaction of the multi-layer table in the solution of the problem of association description, We have not yet been able to design a unification algebra system to solve all the problems in one framework, which is what I hope I can do in the next few years.

Finally, for the first time to do such a wide range of exchanges, I would like to take this opportunity to interpret the slogan of our company, so that we know more about the dry company.

We want to promote the progress of the application, the content of today's talk looks very theoretical, but we do not do pure theoretical research, the theory is to allow it to be applied, it must be able to engineering is meaningful.

In addition, we promote application progress by technology, not money, management, business model and other things, those we are not good at, the most critical, this technology must be innovative, have the core technology, and constantly new, negative self, can continue to move forward in the market.

Report these today, thank you!

Problems and attempts of relational algebra (5) Cloud data organization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.