Basic Data Types in data processing language

Source: Internet
Author: User

Different programming languages have different basic data types depending on their design purposes. Java, C #, and other languages are designed for general application development. The basic data types are string, number, Boolean, and other atomic data types, as well as arrays and common objects. SQL, PowerBuilder, R, and esproc are designed for data processing. The basic data type is a structured two-dimensional data table object. For example, this SQL statement:Select t1.id, t1.name, t1.value from T1 left join T2 on t1.id = t2.idHere, T1, T2, and the calculation result are of this type. A record is composed of multiple fields and two-dimensional data is composed of multiple records with the same structure. Such a combination of data and its field names is a structured two-dimensional data table object.

Why does the data processing language not use the atomic type or common object as the basic type? Try to represent t1 and t2 in the preceding SQL statement using arrays or arraylist objects in Java. You will find that the complexity increases several times and the code length increases by dozens of times!

The basic data types in data processing languages are structured two-dimensional data table objects. This is not a coincidence, but a profound reason.

Corresponds to real business. Business data in the real world is mostly structured data, such as payroll, employee numbers, employee names, departments, dates, pre-tax wages, post-tax wages, and so on. For example, retail records, there are order time, store number, Cashier Number, Cashier Number, product name, unit price, and so on. For example, website logs include browsing time, URL, visitor IP address, browser version, and other attributes. These attributes are equivalent to fields. Each data record has the same structure. Although it is often stored in text instead of a database, it is still structured data, it is natural to use a structured two-dimensional data table object. A structured two-dimensional data table object can present business data in the most intuitive way and express the actual business in the most authentic way, whether it is storage, computing, exchange, or sharing, this type of data is the most convenient and easy to understand.

Easy Batch Processing. Business Data is usually data with the same structure, such as payroll, retail records, and website logs. When processing such data, a field of a record is processed in some cases, but in most cases, all data is processed in records. For example: calculate the post-tax salary based on the pre-tax salary; Calculate the amount based on the unit price and commodity quantity; and calculate the online duration of each IP address every day. The preceding processing method is batch processing. To implement batch processing, you can traverse every member of the array by row number and column number like Java, or operate data by business field name like SQL or esproc, the latter is easy to use and does not require loop statements. It is easier for programmers to operate data intuitively and conveniently from the business point of view, and the corresponding code is simpler and easier to read.

Relational algebra. Relational algebra is e.f. codd is the underlying theory designed specifically for data processing and data query. It uses basic operations, join operations, aggregation operations, and Division operations to describe the association and operation rules between business data in detail, in theory, the computing problem of any difficulty in data processing and data query can be completed. Because relational algebra is concise and complete, most databases are designed according to this theory. e.f. codd is also called the father of relational databases. The structured two-dimensional data table object is the data type recommended by e.f. codd. It can perfectly express all kinds of operations in relational algebra, so as to easily implement computing problems in data processing. In fact, the database result set is the earliest structured two-dimensional data table object.

As you can see, because it can correspond to real business data, it is easy to implement batch processing computing, in line with the relational algebra theory, therefore, different data processing program languages use structured two-dimensional data table objects as the basic data types. Using two-dimensional data table objects, the code is concise and easy to understand, and the development efficiency is higher. The following examples illustrate this point:

Resultset: grouping by book type. What is the average price of books whose average price is more than 15 yuan?

Select AVG (price), type frombooks group by type having AVG (price)> 15

Sequence Table (tseq) of the esproc: group by department to find the top 10 products sold by each department.

Products.Group(Department ).(~.Top(Quantity; 10)

Data window of PowerBuilder: sort orders by price

Order. setsort ('value D ')

Order. Sort ()

Data. Frame: Associate the orders table and the customer table left by mermerid.

Merge (A1, B1, by. x = "customerid", by. Y = "customerid", all. x = true)

Compare the code of SQL, esproc, and R: Group order data by department, and summarize the order quantity and sales of each department.

SQL:

Selectcount (*), sum (sales) from orders group by Dept

Set operator esproc:

Orders. Groups (Dept; count (~), Sum (sales ))

R language:

Result <-aggregate (orders $ sales, list (orders $ Dept), sum)

Result $ count <-tapply (orders $ sales, orders $ Dept, length)

Although the result set, sequence table, data window, and data box are structured two-dimensional data table objects with similar functions, there are still slight differences between them.

SQLResult setData is rich, widely used, versatile, and easy to use. It is the most popular data type in data processing languages. However, SQL does not fully implement relational algebra, which leads to some inconvenient operations, such as set division.

Data windowGenerally, the number is retrieved from the SQL statement, and the final result is returned to the database. Its main function is to break through the gap between data and UI controls, this allows programmers to quickly design database applications with excellent interactivity. The main function of datawindow is data presentation and editing. Only single table computing is supported, and the data processing capability is weak.

Data boxIt has a certain degree of structured computing capability, but as shown in the preceding example, its syntax is hard to understand and the implementation of the same function is relatively complicated. This is because the main function of R is scientific statistical computing. The key data types are series and matrices, and the data boxes are newly added to achieve structured data computing. From this point of view, the Data box is not as professional as the other three.

Sequence TableIt is specialized in data processing. It has the general advantages of SQL result sets and implements relational algebra completely. Ordered tables also have the characteristics of order and are suitable for solving problems related to order in data processing, such as comparison with the previous period, comparison with the same period, ranking, and calculation of relative intervals. Ordered tables also have the characteristics of generics. It is easier to establish associations between data and easily access Multi-level associated data in the form of objects. Compared with SQL, a sequential table is a pure memory object and cannot directly process big data.

As you can see, structured two-dimensional data table objects are directly related to the professionalism of the data processing program language. The more powerful the former is, the more professional the latter is. On the other hand, if a language lacks structured two-dimensional data table objects, it is difficult to be professional in terms of data processing. To test whether a programming language can efficiently develop data analysis and processing programs, the key is to check whether it has professional 2D data table objects and corresponding class libraries.

 

Perl is often used for string search and has certain data processing capabilities. However, its code is lengthy and complex and cannot be considered as a professional data processing language. For example, the Perl code is as follows:

% Groups = ();

Foreach (@ carts ){

$ Name = $ _-> [1];

If ($ groups {$ name} = NULL ){

$ Groups {$ name} = [$ _];

}

Else {

Push ($ groups {$ name}, $ _);

}

}

My @ result = ();

Foreach (Keys (% groups )){

$ Value = 0;

While ($ ROW = pop $ groups {$ _}){

$ Value + = $ row-> [2];

}

Push @ result, [$ _, $ value];

}

Python is easier to write, but there are still some differences in development efficiency compared with SQL, esproc, and R. The sample code is as follows:

Result = []

For key, items ingroupby (data, itemgetter (0 )):

Value1 = 0

Value2 = 0

For subitem in items:

Value1 + = subitem [1]

Value2 + = subitem [2]

Result. append ([key, value1, value2])

Print (result)

Perl and Python are not professional enough in data processing. The most important reason is the lack of structured 2D table data objects.

The esproc sequence table is not only a structured sequence table data object, but also has the characteristics of order, generics, and step-by-step computing. It is more professional than similar languages. For example, to achieve a more complex computing goal: Find stocks that have been continuously rising for more than five days. The esproc code is as follows:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M00/49/9B/wKioL1QWgUqRP7llAAFP2LAgJy0623.jpg "Title =" 2014-09-15_140326.jpg "alt =" wkiol1qwguqrp7llaafp2lagjy0623.jpg "/>

Basic Data Types in data processing language

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.