Data processing and algebra

Source: Internet
Author: User

This is my lecture on the Tsinghua Big Data Industry Federation, organized into words, to communicate with you.

The whole content is divided into five parts: (1) data processing and algebra, (2) correlation operation and Description, (3) sequence operation and Discretization, (4) hierarchical data and interaction; (5) Cloud data organization

This paper introduces the basic concepts and backgrounds; The third part is the content of data analysis, is the focus; the last piece of research is not deep enough, but also involves the relationship algebra, put in together to talk.

Start with the programming process.

Programming is still not an easy job. Here we don't talk about the difficulties caused by the unclear demand or change, which is the goal of software engineering. There are some questions that are completely unambiguous, you know the solution, use the programming language that you are most familiar with, but the program is still not well written.

For example, calculate how many days a stock has risen continuously, and how much of the principal is left on the mortgage when the interest rate changes. To complete such an operation, you still have to write a lot of code.

Recently the stock market is good, there are many examples today are stock-related.

Essentially, the process of writing a program is to translate the idea of problem solving into a precise, formal language that can be recognized by computers . For example, like elementary school students to solve the problem, want to know how much water per minute how much effluent, the number of chickens how many rabbits, and finally listed arithmetic expression, that is, the precise form of language, you can solve problems. The process of solving a problem with a computer is similar, getting a problem, figuring out a solution, and then translating the solution into a computer that can understand the actions that can be done.

So why is the code difficult to write?

A large part of the reason is that the formal language used to document the solution is far from the natural thinking of people, and it cannot directly describe our ideas. You have to do it in a formal language, which often leads you to tell your computer how to do it harder than it does itself. That is, the process of translating a problem into a formal language is often far more difficult than solving the problem itself . It is often necessary to have professionally trained personnel, that is, programmers, and our energies should be more focused on solving problems than translating solutions.

Further, the main reason, I think, is that the algebraic system used in formal language is not good.

What is the algebraic system, in layman's terms, is to define some data types, and specify some operations on these data types, and ensure that the operation is closed, that is, can not calculate the new type of data, logic to self-consistent, that is, can not calculate the contradiction.

The data types here are somewhat like object-oriented classes, but they are different. The object-oriented emphasis is on the inheritance and overloading of classes, and the emphasis here is on operations.

Broadly speaking, we do data processing in the corresponding algebraic system to do the operation. Just like we usually do arithmetic based on numbers. If we define the calculations in these numbers is inconvenient, it will cause a lot of trouble, for example, if only add and subtract, no multiplication, people go to the streets to buy food will be problematic.

The algebraic system used in formal language will greatly affect the efficiency and ability of data processing. We cite two examples of algebra that are not good enough: Roman numerals and assembly language. Using Roman numerals to make general addition and subtraction is very difficult, only suitable for adding 1 minus 1 Such an overly simple operation.

MOV ax,3

MOV bx,5

Mul bx,7

Add AX,BX

This is written in assembly language 3+5*7 Such an operation, you must translate such a calculation into a register operation, because the assembly language is similar to byte-like data types and operations. This is obviously very troublesome, and if it is a floating-point operation, then it is completely unknown what to do. It is much easier to use a high-level language, because the high-level language directly has integers, real numbers, these data types, and arithmetic. In this sense, Fortran is a great invention.

The purpose of analyzing data with computers is to find the connection between things, and things are determined by their attributes, which are technically structured data. The vast majority of data analysis processing requirements are data that has been or is about to be structured.

Unstructured data, which may be large in storage, is not much of a correlation operation. There are some very specialized algorithms, such as image recognition, that are difficult to classify into the categories of data analysis. Other logs, geographic information and other analysis are to be structured before they can be done.

Big data comes, many people say the era of structured data has passed, I don't think so, structured data is still the focus of data analysis processing.

The reference to structured data is about relational algebra, which is an algebraic system developed specifically for structured data and a theoretical basis for modern relational databases. A relational database is one of the most widely used databases. Although in recent years, especially after the concept of big data, the relational database has a variety of problems, but still very strong.

Structured data is a large number of data types that have been used extensively since the computer is widely used, and the data types are seldom involved in traditional mathematics, and relational algebra is one of the few mathematical inventions specifically invented for computer science. There is a lot of mathematics in computer applications, but most of them were invented by mathematicians decades or even hundreds of years ago. Now popular data mining, the use of probability theory, graph theory, linear algebra is basically this, there is no computer academia what.

Relational algebra is a set of structured data defined by a number of operations, set orthogonal difference, and filtering, grouping, connection and other operations. The definition of relation in relational algebra is abstract, the set of attributes is called the relation, and then the operation on the set of the relation is studied, this is not easy to understand. Here we use the layman's terms, simply understand the set of objects that have attributes, or a set of records with fields that programmers often say.

To refer to relational algebra is to say that Sql,sql is the formal language of relational algebra. We say that this form of language does not reach its intended purpose.

What is SQL designed for? It wants to make it easier for ordinary business people and people who don't need to know too much about technology. Why do you say that because it looks like English, the programming language before it is formalized, and the words are just symbols. And SQL is very much like English, even some sentences can be read as English, people who want to understand English can use. However, this goal is not achieved, a little more complex query with SQL is difficult to write, need a very professional talent can do it. There is a large number of data in the database that are difficult to calculate due to performance factors.

A little bit farther, many of the SQL textbooks will claim that using SQL to write code, you just have to explain what to do, and do not care about how to do, said the very big, but in fact, is nonsense. That's what I read about SAS the other day, but the nonsense doesn't affect the greatness of these products.

In fact, any kind of programming language satisfies this claim at some level, and you don't have to worry about how the flow of electrons flows through the assembly language, and you still have to care about how the data in the table is organized in SQL.

The lack of SQL, I summarize is such a few:

Step this is easier to understand, a problem in a step to make the total score a few steps to make it difficult, SQL is not support step, is not advocated.

Other issues will be covered in the following discussion.

In this sense, there is much room for development in relational algebra. It is now also 40 years, compared with the traditional mathematics field is still very young, far from the industry said, has already finished the structured data, in fact, very far. An analogy, now the relational algebra is a bit like 300 years ago calculus, can engage in a lot of things, the threshold is not high. 300 years ago Calculus, now the junior students can understand, and the current mathematical frontier of things, you do not read 20 years of books, there is no way to talk with people. The theory of computer science is still in a stage where many things can be done.

Data processing and algebra

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.