What is data abstraction?
Data abstraction action is a programming paradigm parallel to object-oriented (object-oriented ). It may seem strange to say "data abstraction". Its other name is "abstract data type/ADT.
"Supporting data abstraction" has always been The design goal of The C ++ Language. Bjarne Stroustrup wrote in his second version of The C ++ Programming Language (published in 1991) [2nd]:
The C ++ programming language is designed
Be a better C
Support data processing action
Support object-oriented programming
The third edition of this book (published in 1997) [3rd] adds one:
C ++ is a general-purpose programming language with a bias towards systems programming that
Is a better C,
Supports data processing action,
Supports object-oriented programming, and
Supports generic programming.
There is a Bjarne Stroustrup in 1984 to write "Data into action in C ++" http://www.softwarepreservation.org/projects/c_plus_plus/cfront/release_e/doc/DataAbstraction.pdf. On this page, you can find the article Bjarne wrote about the C ++ operator overload and plural operations, as a detailed description and example of data abstraction. It can be seen that C ++ used data abstraction as a selling point in the early days. Supporting data abstraction is a major advantage of C ++ over C.
As a language designer, Bjarne uses data abstraction as one of the four sub-languages of C ++. This idea is not universally accepted. For example, Scott Meyers, a language user, divides C ++ into four sub-languages in Objective C ++ 3: c. Object-Oriented C ++, Template C ++, and STL. In Scott Meyers's classification method, there is no data abstraction, but it is classified into object-oriented C ++.
So what is data abstraction?
In short, data abstraction is used to describe the data structure. Data abstraction is ADT. An ADT mainly shows some operations supported by it, such as stack. push and stack. pop. These operations should have clear time and space complexity. In addition, an ADT can hide its implementation details. For example, a stack can be implemented using both dynamic arrays and linked lists.
According to this definition, data abstraction is similar to object-based, so what are their differences? Different semantics. ADT is usually a value semantics, while object-based is an object language. (For the definitions of these two semantics, see the previous article "C ++ Engineering Practice (8): Value Semantics"). The ADT class can be copied, And the copied instance is out of the original instance.
For example, stack a; a. push (10); stack B = a; B. pop (); at this time, there are still 10 elements in.
Data abstraction in the C ++ standard library
Complex <>, pair <>, vector <>, list <>, map <>, set <>, string, stack, and queue in the C ++ standard library are examples of data abstraction. Vector is a dynamic array. Its main operations include push_back (), size (), begin (), and end (). These operations not only have clear meanings, but also have constant computing complexity. Similarly, list is a linked list, map is an ordered Association array, set is an ordered collection, stack is a FILO stack, and queue is a FIFO queue. "Dynamic Array", "linked list", "ordered set", "associated array", "stack", and "queue" are all abstract data types with clear definitions (operations and complexity.
Differences between data abstraction and object-oriented
This article regards data processing action, object-based, and object-oriented as three programming paradigms. This meticulous classification may help you understand the differences between them.
In layman's terms, object-oriented (object-oriented) has three main features: encapsulation, inheritance, and polymorphism. Object-based has only encapsulation, no inheritance and polymorphism, that is, only specific classes and no abstract interfaces. Both of them are object semantics.
The real core idea of Object-oriented thinking is messaging. "encapsulation inheritance polymorphism" is just a representation.
The boundaries between data abstraction and them are "Semantics". Data abstraction is not object semantics, but value semantics. For example, TcpConnection and Buffer in muduo are both specific classes, but the former is object-based, while the latter is data abstraction.
Similarly, muduo: Date and muduo: Timestamp are all data abstraction. Although the two classes have only one int/long data member, they define a set of operations and hide internal data, so that it changes from data aggregation to data aggregation action.
Data abstraction is for "data", which means that the ADT class can be copied, as long as the data is copied. If a class represents other resources (files, employees, printers, accounts), it is object-based or object-oriented, rather than data abstraction.
The ADT class can be used as a member of the Object-based/object-oriented class, but in turn it is not true, because the copy of the ADS class is meaningless.
Language facilities required for data abstraction
Not every language supports data abstraction. The following briefly lists the language facilities required for "data abstraction.
Support data aggregation
Data aggregation, or value aggregates. Define C-style struct and put the relevant data in the same struct. FORTRAN77 does not have this capability, and FORTRAN77 cannot implement ADT. This type of data aggregation struct is the basis of ADT. struct List, struct HashTable, and so on can put the data in the linked List and hash table structure together, rather than representing it with several scattered variables.
Global functions and overloading
For example, if I define complex, I can define complex sin (const complex & x); and complex exp (const complex & x ); and so on. Sin and exp are not members of complex, but the global functions of double sin (double) and double exp (double. This allows double a = sin (B); and complex a = sin (B); to have the same code form, without having to write complex a = B. sin ();.
The C language can define global functions, but it cannot be the same as an existing function, so there is no overload. Java does not have global functions, and Math class is closed, and sin (Complex) cannot be added to it ).
Member functions and private data
Data can also be declared as private to prevent accidental modification. Not every ADT is suitable for declaring data as private, such as complex, point, pair <>.
You must be able to define operations in struct, rather than using global functions to operate struct. For example, a vector has a push_back () operation, and push_back is a part of a vector. It must directly modify the private data members of the vector, so it cannot be defined as a global function.
These two points are actually the definition of the class, which can be directly supported by the current language, except for the C language.
Copy control)
Copy control is the joint name of stack a, stack B = a, and stack B; B =.
What happens when I copy an ADT? For example, should each element of a stack be copied to the new stack by value?
If the language supports displaying the lifecycle of the control object (for example, the deterministic structure of C ++), and ADT uses the dynamically allocated memory, copy control is more important, otherwise, how can we prevent access to invalid objects?
Since C ++ class is a value semantics, copy control is a necessary means to implement deep copy. In addition, the resources used by the ADT only involve the dynamically allocated memory, so it is feasible to perform deep copy. On the contrary, the class in the object-based programming style often represents a real thing (such as Employee, Account, and File), and deep copy is meaningless.
C language does not have the copy control, and there is no way to prevent copying. It depends on the programmer's care. FILE * can be copied at will, but as long as one copy is disabled, other copies will also become invalid, which is generally the same as the NULL pointer. The entire C language treats resources (memory obtained by malloc, files opened by open (), and connections opened by socket, it is represented by an integer or pointer (that is, a "handle "). The "handle" of integer and pointer types can be copied at will, which can easily lead to repeated release, missed release, use of released resources, and other common errors. In this regard, C ++ is a significant improvement. boost: noncopyable is the most popular library in boost.
Operator overload
To write a dynamic array, we hope to use it like using a built-in array, such as supporting subscript operations. C ++ can reload operator [] to do this.
To write a complex number, the system can use it like a built-in double. For example, it supports addition, subtraction, multiplication, division. C ++ can overload operators such as operator + to achieve this.
If you want to write the date and time, we hope that it can be compared directly with a number greater than or less than the number, and use = to determine whether it is equal. C ++ can overload operators such as operator <to achieve this.
This requires that the language can reload members and global operators. Operator Overloading is inherent in C ++. In 1984, CFront E supported Operator Overloading and provided a complex class, this class is no different from the current standard library complex <>.
If there is no operator overload, the User-Defined ADT is different from the built-in type (think about some languages to differentiate = and equals, the code is really cumbersome to write ). Java contains BigInteger, but BigInteger is used differently from common int/long:
Public static BigInteger mean (BigInteger x, BigInteger y ){
BigInteger two = BigInteger. valueOf (2 );
Return x. add (y). divide (two );
}
Public static long mean (long x, long y ){
Return (x + y)/2;
}
Of course, Operator Overloading is easily abused, because it looks cool. I think it is suitable to overload addition, subtraction, multiplication, and division only when ADT represents a "value". In other cases, it is better to use the name function. Therefore, muduo: Timestamp only reloads Relational operators, there is no overload addition/subtraction operator. For another reason, see C ++ Engineering Practice (3): using code formats conducive to version management.
Lossless Efficiency
"Abstraction" does not mean inefficiency. In C ++, increasing the abstraction level does not reduce the efficiency. Otherwise, people would rather program at a low level than use more convenient abstraction, and data abstraction would lose the market. Next we will see a specific example.
Templates and generics
If I write an int vector, I don't want to implement the same code again for doule and string. I should write a vector as a template and use different types to realize it, in this way, specific types such as vector <int>, vector <double>, vector <complex>, and vector <string> are obtained.
Not every ADT requires this generic capability. A Date class does not have to be used to specify the type of integer. int32_t is enough.
According to the above requirements, not every object-oriented language can support data abstraction in the native way, it also shows that data abstraction is not a subset of object-oriented.
Data abstraction example
Next, let's look at two programs that simulate the N-body problem. The first one is in C language, and the last one is in C ++.
The two programs use the same algorithm.
C language version. For the complete code, see https://gist.github.com/115888920.file_nbody.c. The following is the code backbone. Planet stores triplicate with planetary positions, velocity, mass, location, and velocity. The program simulates gravity-dominated movements of several planets in three-dimensional space.
Struct planet
{
Double x, y, z;
Double vx, vy, vz;
Double mass;
};
Void advance (int nbodies, struct planet * bodies, double dt)
{
For (int I = 0; I <nbodies; I ++)
{
Struct planet * p1 = & (bodies [I]);
For (int j = I + 1; j <nbodies; j ++)
{
Struct planet * p2 = & (bodies [j]);
Double dx = p1-> x-p2-> x;
Double dy = p1-> y-p2-> y;
Double dz = p1-> z-p2-> z;
Double distance_squared = dx * dx + dy * dy + dz * dz;
Double distance = sqrt (distance_squared );
Double mag = dt/(distance * distance_squared );
P1-> vx-= dx * p2-> mass * mag;
P1-> vy-= dy * p2-> mass * mag;
P1-> vz-= dz * p2-> mass * mag;
P2-> vx + = dx * p1-> mass * mag;
P2-> vy + = dy * p1-> mass * mag;
P2-> vz + = dz * p1-> mass * mag;
}
}
For (int I = 0; I <nbodies; I ++)
{
Struct planet * p = & (bodies [I]);
P-> x + = dt * p-> vx;
P-> y + = dt * p-> vy;
P-> z + = dt * p-> vz;
}
}
The core algorithm is the numerical integration implemented by the advance () function. It calculates the acceleration based on the distance and gravity between each planet, modifies the speed, and updates the location of the planet. The complexity of this naive algorithm is O (N ^ 2 ).
For the C ++ data abstraction version, see https://gist.github.com/115888920.file_nbody.cc.
First, define the Vector3 abstraction, which represents a three-dimensional vector. It can be either a location or a speed. In this section, the operator overload of Vector3 is omitted. Vector3 supports common vector addition, subtraction, multiplication, division operations.
Then define the abstract Planet, representing a Planet, which has two Vector3 members: location and speed.
It should be noted that, according to the semantics, Vector3 is data abstraction, while Planet is object-based.
Struct Vector3
{
Vector3 (double x, double y, double z)
: X (x), y (y), z (z)
{
}
Double x;
Double y;
Double z;
};
Struct Planet
{
Planet (const Vector3 & position, const Vector3 & velocity, double mass)
: Position (position), velocity (velocity), mass (mass)
{
}
Vector3 position;
Vector3 velocity;
Const double mass;
};
The advance () code for the same function is much shorter and easier to verify. (Think about this error if the vx, vy, vz, dx, dy, and dz in the C language version of advance () are written incorrectly .)
Void advance (int nbodies, Planet * bodies, double delta_time)
{
For (Planet * p1 = bodies; p1! = Bodies + nbodies; ++ p1)
{
For (Planet * p2 = p1 + 1; p2! = Bodies + nbodies; ++ p2)
{
Vector3 difference = p1-> position-p2-> position;
Double distance_squared = magnitude_squared (difference );
Double distance = std: sqrt (distance_squared );
Double magnitude = delta_time/(distance * distance_squared );
P1-> velocity-= difference * p2-> mass * magn.pdf;
P2-> velocity + = difference * p1-> mass * magncapability;
}
}
For (Planet * p = bodies; p! = Bodies + nbodies; ++ p)
{
P-> position + = delta_time * p-> velocity;
}
}
Performance, although C ++ uses a higher level of abstract Vector3, its performance is as fast as that of C. Look at memory layout and you will understand:
The members of C struct are continuously stored, and the struct array is also continuous.
Despite defining the Vector3 abstraction, the memory layout of C ++ has not changed. The layout of Planet is the same as that of C planet, and the layout of Planet [] is the same as that of C arrays.
On the other hand, the inline function of C ++ also plays a huge role here. We can safely call operators such as Vector3: operator + =, the compiler generates code as efficient as C.
Not every programming language can improve abstraction without affecting performance. Let's look at the Java memory layout.
If we use class Vector3, class Planet, and Planet [] to write a Java version of the N-body program, the memory layout will be:
This greatly reduces memory locality. Interested readers can compare the implementation efficiency of Java and C ++.
Note: here the N-body algorithm is only for comparing the performance and programming convenience between languages. The N-body algorithm actually used in scientific research uses more advanced and underlying optimization, the complexity is O (N log N). During large-scale simulation, it runs much faster than this naive algorithm.
More examples
Date and Timestamp. The "data" of these two classes are integers, each defining a set of operations to express the concepts of Date and time.
BigInteger is a "number ". If BigInteger is implemented using C ++, it is natural to write the factorial function. The second function below is a Java version.
BigInteger factorial (int n)
{
BigInteger result (1 );
For (int I = 1; I <= n; ++ I ){
Result * = I;
}
Return result;
}
Public static BigInteger factorial (int n ){
BigInteger result = BigInteger. ONE;
For (int I = 1; I <= n; ++ I ){
Result = result. multiply (BigInteger. valueOf (I ));
}
Return result;
}
3D homogeneous coordinate Vector4 in graphics and corresponding 4x4 transformation matrix Matrix4
In the financial sector, the "purchase price/sell price" often appears, which can be encapsulated as BidOffer struct. The struct members can have mid () "medium price" and spread () "price difference ", addition and subtraction operators.
Summary
Data abstraction is an important abstract means of C ++. It is suitable for encapsulating "data". It has simple semantics and is easy to use. Data abstraction simplifies code writing and reduces accidental errors.